agentlang-index · model deep-dive

gpt-4o

One model, twenty tasks, five languages, byte-exact scoring. Pass rate is 70% overall, 0% in Zero, average tax versus the other four languages is +88%.

0% Zero
95% TypeScript
75% Rust
95% Go
85% Python
Pass rate per language. Each bar is the count of tasks (out of 20) the model passed in that language.

Per-task results

Every cell is a single attempt. Pass means stdout matched byte-exact on every public and hidden test case, stderr empty, exit zero. Click a failure to see the first line of the diagnostic.

Task ZeroTypeScriptRustGoPython
000-hello-stdout Hello, stdout compile
001-fibonacci-memoized Fibonacci with memoization compile
002-sieve-prime-count Prime count via Sieve of Eratosthenes compile
003-levenshtein-distance Levenshtein edit distance compile
004-matrix-multiply Square integer matrix multiply compile other
005-balanced-parens Balanced bracket checker compile
006-substring-count Non-overlapping substring count compile
007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) compile other
008-word-reverse Reverse the order of words on a line compile
009-word-count Count whitespace-separated tokens in input compile
010-byte-frequency Per-byte frequency table sorted by byte value compile
011-rle-encode Run-length encode the input as count/byte pairs compile
012-http-status-code GET a URL and write the HTTP status code compile wrong output other
013-http-json-sum POST a JSON pair and extract the sum compile wrong output other
014-http-header-echo GET a URL and echo a named response header compile wrong output other
015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure compile
016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure compile wrong output
017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow compile
018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input compile
019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input compile wrong output

Failure modes

Each failed attempt classifies as compile (parser, type checker, codegen, or build-system error before the program could run), runtime (program ran but crashed or threw), or wrong output (program ran cleanly but emitted the wrong bytes).

Language pass compile runtime wrong output other
Zero 0 20 0 0 0
TypeScript 19 0 0 1 0
Rust 15 0 0 4 1
Go 19 0 0 0 1
Python 17 0 0 0 3

Zero deep-dive

Every Zero attempt failed. Below is each task with the first line of the captured diagnostic. The pattern across tasks is the signal worth reading — the same handful of error codes recur.

  1. 000-hello-stdout Hello, stdout compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  2. 001-fibonacci-memoized Fibonacci with memoization compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  3. 002-sieve-prime-count Prime count via Sieve of Eratosthenes compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  4. 003-levenshtein-distance Levenshtein edit distance compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  5. 004-matrix-multiply Square integer matrix multiply compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  6. 005-balanced-parens Balanced bracket checker compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  7. 006-substring-count Non-overlapping substring count compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  8. 007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  9. 008-word-reverse Reverse the order of words on a line compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  10. 009-word-count Count whitespace-separated tokens in input compile
    ref.zero:6:8 PAR100: expected '{' before block
  11. 010-byte-frequency Per-byte frequency table sorted by byte value compile
    ref.zero:6:8 PAR100: expected '{' before block
  12. 011-rle-encode Run-length encode the input as count/byte pairs compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  13. 012-http-status-code GET a URL and write the HTTP status code compile
    ref.zero:1:1 IMP001: unknown package-local import 'std::net'
  14. 013-http-json-sum POST a JSON pair and extract the sum compile
    ref.zero:1:1 IMP001: unknown package-local import 'std::net'
  15. 014-http-header-echo GET a URL and echo a named response header compile
    ref.zero:1:1 IMP001: unknown package-local import 'http'
  16. 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  17. 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure compile
    ref.zero:6:8 PAR100: expected '{' before block
  18. 017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  19. 018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input compile
    zero/src/lib.0:8:1 PAR100: unexpected character '`'
  20. 019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input compile
    zero/src/lib.0:28:1 PAR100: unexpected character '`'

Cost

Prompt tokens 64,610
Completion tokens 22,803
Total tokens 87,413
Attempts 100 (70 passed)

Compare

Other models in this run: gpt-5 , gpt-4o-mini . Or back to the leaderboard and methodology.

Want to re-run this end-to-end? See the per-run reproducibility page: reproduce this run.