agentlang-index · model deep-dive
gpt-5
One model, twenty tasks, five languages, byte-exact scoring. Pass rate is 79% overall, 0% in Zero, average tax versus the other four languages is +99%.
Per-task results
Every cell is a single attempt. Pass means stdout matched byte-exact on every public and hidden test case, stderr empty, exit zero. Click a failure to see the first line of the diagnostic.
| Task | Zero | TypeScript | Rust | Go | Python |
|---|---|---|---|---|---|
| 000-hello-stdout Hello, stdout | compile | ✓ | ✓ | ✓ | ✓ |
| 001-fibonacci-memoized Fibonacci with memoization | compile | ✓ | ✓ | ✓ | ✓ |
| 002-sieve-prime-count Prime count via Sieve of Eratosthenes | compile | ✓ | ✓ | ✓ | ✓ |
| 003-levenshtein-distance Levenshtein edit distance | compile | ✓ | ✓ | ✓ | ✓ |
| 004-matrix-multiply Square integer matrix multiply | compile | ✓ | ✓ | ✓ | ✓ |
| 005-balanced-parens Balanced bracket checker | compile | ✓ | ✓ | ✓ | ✓ |
| 006-substring-count Non-overlapping substring count | compile | ✓ | ✓ | ✓ | ✓ |
| 007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) | wrong output | ✓ | ✓ | ✓ | ✓ |
| 008-word-reverse Reverse the order of words on a line | compile | ✓ | ✓ | ✓ | ✓ |
| 009-word-count Count whitespace-separated tokens in input | compile | ✓ | ✓ | ✓ | ✓ |
| 010-byte-frequency Per-byte frequency table sorted by byte value | compile | ✓ | ✓ | ✓ | ✓ |
| 011-rle-encode Run-length encode the input as count/byte pairs | compile | ✓ | ✓ | ✓ | ✓ |
| 012-http-status-code GET a URL and write the HTTP status code | compile | ✓ | ✓ | ✓ | ✓ |
| 013-http-json-sum POST a JSON pair and extract the sum | wrong output | ✓ | wrong output | ✓ | ✓ |
| 014-http-header-echo GET a URL and echo a named response header | wrong output | ✓ | ✓ | ✓ | ✓ |
| 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure | compile | ✓ | ✓ | ✓ | ✓ |
| 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure | compile | ✓ | ✓ | ✓ | ✓ |
| 017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow | wrong output | ✓ | ✓ | ✓ | ✓ |
| 018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input | compile | ✓ | ✓ | ✓ | ✓ |
| 019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input | compile | ✓ | ✓ | ✓ | ✓ |
Failure modes
Each failed attempt classifies as compile (parser, type checker, codegen, or build-system error before the program could run), runtime (program ran but crashed or threw), or wrong output (program ran cleanly but emitted the wrong bytes).
| Language | pass | compile | runtime | wrong output | other |
|---|---|---|---|---|---|
| Zero | 0 | 16 | 0 | 4 | 0 |
| TypeScript | 20 | 0 | 0 | 0 | 0 |
| Rust | 19 | 0 | 0 | 1 | 0 |
| Go | 20 | 0 | 0 | 0 | 0 |
| Python | 20 | 0 | 0 | 0 | 0 |
Zero deep-dive
Every Zero attempt failed. Below is each task with the first line of the captured diagnostic. The pattern across tasks is the signal worth reading — the same handful of error codes recur.
-
ref.zero:1:10 PAR100: expected '{' before block -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:3:8 PAR100: expected '{' before block -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
(no diagnostic captured) -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:3:9 PAR100: expected '{' before block -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
(no diagnostic captured) -
(no diagnostic captured) - 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure compile
ref.zero:3:9 PAR100: expected '{' before block - 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure compile
ref.zero:1:1 IMP001: unknown package-local import 'std' - 017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow wrong output
(no diagnostic captured) -
zero/src/lib.0:10:1 PAR100: unexpected character '`' -
zero/src/main.0:1:1 IMP001: unknown package-local import 'world'
Cost
| Prompt tokens | 64,510 |
|---|---|
| Completion tokens | 348,714 |
| Total tokens | 413,224 |
| Attempts | 100 (79 passed) |
Compare
Other models in this run: gpt-4o , gpt-4o-mini . Or back to the leaderboard and methodology.
Want to re-run this end-to-end? See the per-run reproducibility page: reproduce this run.