agentlang-index · model deep-dive

gpt-4o-mini

One model, twenty tasks, five languages, byte-exact scoring. Pass rate is 56% overall, 0% in Zero, average tax versus the other four languages is +70%.

0% Zero
70% TypeScript
65% Rust
75% Go
70% Python
Pass rate per language. Each bar is the count of tasks (out of 20) the model passed in that language.

Per-task results

Every cell is a single attempt. Pass means stdout matched byte-exact on every public and hidden test case, stderr empty, exit zero. Click a failure to see the first line of the diagnostic.

Task ZeroTypeScriptRustGoPython
000-hello-stdout Hello, stdout compile
001-fibonacci-memoized Fibonacci with memoization compile
002-sieve-prime-count Prime count via Sieve of Eratosthenes compile other
003-levenshtein-distance Levenshtein edit distance compile other
004-matrix-multiply Square integer matrix multiply compile other
005-balanced-parens Balanced bracket checker compile
006-substring-count Non-overlapping substring count compile
007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) compile other other
008-word-reverse Reverse the order of words on a line compile
009-word-count Count whitespace-separated tokens in input compile wrong output wrong output
010-byte-frequency Per-byte frequency table sorted by byte value compile wrong output wrong output
011-rle-encode Run-length encode the input as count/byte pairs compile
012-http-status-code GET a URL and write the HTTP status code compile wrong output wrong output other
013-http-json-sum POST a JSON pair and extract the sum compile wrong output wrong output other
014-http-header-echo GET a URL and echo a named response header compile runtime wrong output other
015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure compile other
016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure compile wrong output
017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow compile
018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input compile other
019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input compile wrong output other wrong output

Failure modes

Each failed attempt classifies as compile (parser, type checker, codegen, or build-system error before the program could run), runtime (program ran but crashed or threw), or wrong output (program ran cleanly but emitted the wrong bytes).

Language pass compile runtime wrong output other
Zero 0 20 0 0 0
TypeScript 14 0 1 3 2
Rust 13 0 0 5 2
Go 15 0 0 2 3
Python 14 0 0 2 4

Zero deep-dive

Every Zero attempt failed. Below is each task with the first line of the captured diagnostic. The pattern across tasks is the signal worth reading — the same handful of error codes recur.

  1. 000-hello-stdout Hello, stdout compile
    ref.zero:1:1 PAR100: expected '{' before block
  2. 001-fibonacci-memoized Fibonacci with memoization compile
    ref.zero:4:1 PAR100: expected '{' before block
  3. 002-sieve-prime-count Prime count via Sieve of Eratosthenes compile
    ref.zero:4:1 PAR100: expected '{' before block
  4. 003-levenshtein-distance Levenshtein edit distance compile
    ref.zero:9:1 PAR100: expected '{' before block
  5. 004-matrix-multiply Square integer matrix multiply compile
    ref.zero:3:10 PAR100: expected '{' before block
  6. 005-balanced-parens Balanced bracket checker compile
    ref.zero:3:18 PAR100: expected expression
  7. 006-substring-count Non-overlapping substring count compile
    ref.zero:1:1 PAR100: expected '{' before block
  8. 007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) compile
    ref.zero:1:15 PAR100: expected expression
  9. 008-word-reverse Reverse the order of words on a line compile
    ref.zero:3:8 PAR100: expected '{' before block
  10. 009-word-count Count whitespace-separated tokens in input compile
    ref.zero:3:17 PAR100: expected '{' before block
  11. 010-byte-frequency Per-byte frequency table sorted by byte value compile
    ref.zero:1:15 PAR100: expected expression
  12. 011-rle-encode Run-length encode the input as count/byte pairs compile
    ref.zero:1:15 PAR100: expected expression
  13. 012-http-status-code GET a URL and write the HTTP status code compile
    ref.zero:1:1 IMP001: unknown package-local import 'lib http'
  14. 013-http-json-sum POST a JSON pair and extract the sum compile
    ref.zero:1:1 IMP001: unknown package-local import 'lib http'
  15. 014-http-header-echo GET a URL and echo a named response header compile
    ref.zero:1:1 IMP001: unknown package-local import 'lib http'
  16. 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure compile
    ref.zero:3:13 PAR100: expected '{' before block
  17. 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure compile
    ref.zero:3:13 PAR100: expected '{' before block
  18. 017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow compile
    ref.zero:3:13 PAR100: expected '{' before block
  19. 018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input compile
    zero/src/main.0:1:1 IMP001: unknown package-local import '"src/lib.0"'
  20. 019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input compile
    zero/src/lib.0:23:1 PAR100: unexpected character '`'

Cost

Prompt tokens 64,610
Completion tokens 21,038
Total tokens 85,648
Attempts 100 (56 passed)

Compare

Other models in this run: gpt-5 , gpt-4o . Or back to the leaderboard and methodology.

Want to re-run this end-to-end? See the per-run reproducibility page: reproduce this run.