agentlang-index · model deep-dive
gpt-4o-mini
One model, twenty tasks, five languages, byte-exact scoring. Pass rate is 56% overall, 0% in Zero, average tax versus the other four languages is +70%.
Per-task results
Every cell is a single attempt. Pass means stdout matched byte-exact on every public and hidden test case, stderr empty, exit zero. Click a failure to see the first line of the diagnostic.
| Task | Zero | TypeScript | Rust | Go | Python |
|---|---|---|---|---|---|
| 000-hello-stdout Hello, stdout | compile | ✓ | ✓ | ✓ | ✓ |
| 001-fibonacci-memoized Fibonacci with memoization | compile | ✓ | ✓ | ✓ | ✓ |
| 002-sieve-prime-count Prime count via Sieve of Eratosthenes | compile | ✓ | other | ✓ | ✓ |
| 003-levenshtein-distance Levenshtein edit distance | compile | other | ✓ | ✓ | ✓ |
| 004-matrix-multiply Square integer matrix multiply | compile | ✓ | ✓ | other | ✓ |
| 005-balanced-parens Balanced bracket checker | compile | ✓ | ✓ | ✓ | ✓ |
| 006-substring-count Non-overlapping substring count | compile | ✓ | ✓ | ✓ | ✓ |
| 007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) | compile | ✓ | ✓ | other | other |
| 008-word-reverse Reverse the order of words on a line | compile | ✓ | ✓ | ✓ | ✓ |
| 009-word-count Count whitespace-separated tokens in input | compile | ✓ | ✓ | wrong output | wrong output |
| 010-byte-frequency Per-byte frequency table sorted by byte value | compile | ✓ | wrong output | ✓ | wrong output |
| 011-rle-encode Run-length encode the input as count/byte pairs | compile | ✓ | ✓ | ✓ | ✓ |
| 012-http-status-code GET a URL and write the HTTP status code | compile | wrong output | wrong output | ✓ | other |
| 013-http-json-sum POST a JSON pair and extract the sum | compile | wrong output | wrong output | ✓ | other |
| 014-http-header-echo GET a URL and echo a named response header | compile | runtime | wrong output | ✓ | other |
| 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure | compile | other | ✓ | ✓ | ✓ |
| 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure | compile | ✓ | wrong output | ✓ | ✓ |
| 017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow | compile | ✓ | ✓ | ✓ | ✓ |
| 018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input | compile | ✓ | ✓ | other | ✓ |
| 019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input | compile | wrong output | other | wrong output | ✓ |
Failure modes
Each failed attempt classifies as compile (parser, type checker, codegen, or build-system error before the program could run), runtime (program ran but crashed or threw), or wrong output (program ran cleanly but emitted the wrong bytes).
| Language | pass | compile | runtime | wrong output | other |
|---|---|---|---|---|---|
| Zero | 0 | 20 | 0 | 0 | 0 |
| TypeScript | 14 | 0 | 1 | 3 | 2 |
| Rust | 13 | 0 | 0 | 5 | 2 |
| Go | 15 | 0 | 0 | 2 | 3 |
| Python | 14 | 0 | 0 | 2 | 4 |
Zero deep-dive
Every Zero attempt failed. Below is each task with the first line of the captured diagnostic. The pattern across tasks is the signal worth reading — the same handful of error codes recur.
-
ref.zero:1:1 PAR100: expected '{' before block -
ref.zero:4:1 PAR100: expected '{' before block -
ref.zero:4:1 PAR100: expected '{' before block -
ref.zero:9:1 PAR100: expected '{' before block -
ref.zero:3:10 PAR100: expected '{' before block -
ref.zero:3:18 PAR100: expected expression -
ref.zero:1:1 PAR100: expected '{' before block -
ref.zero:1:15 PAR100: expected expression -
ref.zero:3:8 PAR100: expected '{' before block -
ref.zero:3:17 PAR100: expected '{' before block -
ref.zero:1:15 PAR100: expected expression -
ref.zero:1:15 PAR100: expected expression -
ref.zero:1:1 IMP001: unknown package-local import 'lib http' -
ref.zero:1:1 IMP001: unknown package-local import 'lib http' -
ref.zero:1:1 IMP001: unknown package-local import 'lib http' - 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure compile
ref.zero:3:13 PAR100: expected '{' before block - 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure compile
ref.zero:3:13 PAR100: expected '{' before block - 017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow compile
ref.zero:3:13 PAR100: expected '{' before block -
zero/src/main.0:1:1 IMP001: unknown package-local import '"src/lib.0"' -
zero/src/lib.0:23:1 PAR100: unexpected character '`'
Cost
| Prompt tokens | 64,610 |
|---|---|
| Completion tokens | 21,038 |
| Total tokens | 85,648 |
| Attempts | 100 (56 passed) |
Compare
Other models in this run: gpt-5 , gpt-4o . Or back to the leaderboard and methodology.
Want to re-run this end-to-end? See the per-run reproducibility page: reproduce this run.