agentlang-index · model deep-dive
gpt-4o
One model, twenty tasks, five languages, byte-exact scoring. Pass rate is 70% overall, 0% in Zero, average tax versus the other four languages is +88%.
Per-task results
Every cell is a single attempt. Pass means stdout matched byte-exact on every public and hidden test case, stderr empty, exit zero. Click a failure to see the first line of the diagnostic.
| Task | Zero | TypeScript | Rust | Go | Python |
|---|---|---|---|---|---|
| 000-hello-stdout Hello, stdout | compile | ✓ | ✓ | ✓ | ✓ |
| 001-fibonacci-memoized Fibonacci with memoization | compile | ✓ | ✓ | ✓ | ✓ |
| 002-sieve-prime-count Prime count via Sieve of Eratosthenes | compile | ✓ | ✓ | ✓ | ✓ |
| 003-levenshtein-distance Levenshtein edit distance | compile | ✓ | ✓ | ✓ | ✓ |
| 004-matrix-multiply Square integer matrix multiply | compile | ✓ | ✓ | other | ✓ |
| 005-balanced-parens Balanced bracket checker | compile | ✓ | ✓ | ✓ | ✓ |
| 006-substring-count Non-overlapping substring count | compile | ✓ | ✓ | ✓ | ✓ |
| 007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) | compile | ✓ | other | ✓ | ✓ |
| 008-word-reverse Reverse the order of words on a line | compile | ✓ | ✓ | ✓ | ✓ |
| 009-word-count Count whitespace-separated tokens in input | compile | ✓ | ✓ | ✓ | ✓ |
| 010-byte-frequency Per-byte frequency table sorted by byte value | compile | ✓ | ✓ | ✓ | ✓ |
| 011-rle-encode Run-length encode the input as count/byte pairs | compile | ✓ | ✓ | ✓ | ✓ |
| 012-http-status-code GET a URL and write the HTTP status code | compile | ✓ | wrong output | ✓ | other |
| 013-http-json-sum POST a JSON pair and extract the sum | compile | ✓ | wrong output | ✓ | other |
| 014-http-header-echo GET a URL and echo a named response header | compile | ✓ | wrong output | ✓ | other |
| 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure | compile | ✓ | ✓ | ✓ | ✓ |
| 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure | compile | ✓ | wrong output | ✓ | ✓ |
| 017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow | compile | ✓ | ✓ | ✓ | ✓ |
| 018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input | compile | ✓ | ✓ | ✓ | ✓ |
| 019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input | compile | wrong output | ✓ | ✓ | ✓ |
Failure modes
Each failed attempt classifies as compile (parser, type checker, codegen, or build-system error before the program could run), runtime (program ran but crashed or threw), or wrong output (program ran cleanly but emitted the wrong bytes).
| Language | pass | compile | runtime | wrong output | other |
|---|---|---|---|---|---|
| Zero | 0 | 20 | 0 | 0 | 0 |
| TypeScript | 19 | 0 | 0 | 1 | 0 |
| Rust | 15 | 0 | 0 | 4 | 1 |
| Go | 19 | 0 | 0 | 0 | 1 |
| Python | 17 | 0 | 0 | 0 | 3 |
Zero deep-dive
Every Zero attempt failed. Below is each task with the first line of the captured diagnostic. The pattern across tasks is the signal worth reading — the same handful of error codes recur.
-
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:6:8 PAR100: expected '{' before block -
ref.zero:6:8 PAR100: expected '{' before block -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std::net' -
ref.zero:1:1 IMP001: unknown package-local import 'std::net' -
ref.zero:1:1 IMP001: unknown package-local import 'http' - 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure compile
ref.zero:1:1 IMP001: unknown package-local import 'std' - 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure compile
ref.zero:6:8 PAR100: expected '{' before block - 017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow compile
ref.zero:1:1 IMP001: unknown package-local import 'std' -
zero/src/lib.0:8:1 PAR100: unexpected character '`' -
zero/src/lib.0:28:1 PAR100: unexpected character '`'
Cost
| Prompt tokens | 64,610 |
|---|---|
| Completion tokens | 22,803 |
| Total tokens | 87,413 |
| Attempts | 100 (70 passed) |
Compare
Other models in this run: gpt-5 , gpt-4o-mini . Or back to the leaderboard and methodology.
Want to re-run this end-to-end? See the per-run reproducibility page: reproduce this run.