agentlang-index · run reproducibility

8286655-gpt-4o-mini

Run of gpt-4o-mini against the 20-task corpus at harness commit 8286655, captured 2026-05-19. Below is the exact invocation, the toolchain pins, and every attempt. Drop it into a shell and it should reproduce.

Reproduce this run

Three commands and an API key. The repo at 8286655 pins the corpus, the Zero compiler, the per-language scaffold, and the prompt. The runner re-issues the same chat-completions request the original run made.

git clone https://github.com/truffle-dev/agentlang-index
cd agentlang-index
git checkout 8286655
export OPENAI_API_KEY=...
bun run bench/runner.ts --model gpt-4o-mini

Then aggregate the per-model summaries into the dashboard JSON the site reads:

bun run bench/aggregate.ts --site

Each (model, task, language) attempt is deterministic where the model accepts temperature=0. gpt-5 and o-series models use default sampling; their text may vary across re-runs but the byte-exact scoring is unchanged.


Run environment

Harness SHA
8286655
Zero version
0.1.2
Corpus size
20 tasks × 5 languages
Model
gpt-4o-mini
Model provider
OpenAI
Run timestamp
2026-05-19 00:31:37 UTC
Total wall-clock
479.6 s
Prompt tokens
64,610
Completion tokens
21,038
Total tokens
85,648
Cost (USD)
$0.02
Runner command
bun run bench/runner.ts --model gpt-4o-mini

Cost uses published OpenAI per-million-token prices at run time: gpt-5 $5/$15 (prompt/completion), gpt-4o $2.50/$10, gpt-4o-mini $0.15/$0.60. If the model is not in the pricing table, this row reads not priced.


Per-language pass rate

0% Zero
70% TypeScript
65% Rust
75% Go
70% Python
Pass rate per language. Overall 56% (56/100). Average tax versus the other four languages is +70%.

Per-task results

Every cell is one attempt from this run. Pass means stdout matched byte-exact on every public and hidden test case, stderr empty, exit zero. Click a task to view its prompt, acceptance, references, and per-model breakdown.

Task ZeroTypeScriptRustGoPython
000-hello-stdout Hello, stdout compile
001-fibonacci-memoized Fibonacci with memoization compile
002-sieve-prime-count Prime count via Sieve of Eratosthenes compile other
003-levenshtein-distance Levenshtein edit distance compile other
004-matrix-multiply Square integer matrix multiply compile other
005-balanced-parens Balanced bracket checker compile
006-substring-count Non-overlapping substring count compile
007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) compile other other
008-word-reverse Reverse the order of words on a line compile
009-word-count Count whitespace-separated tokens in input compile wrong output wrong output
010-byte-frequency Per-byte frequency table sorted by byte value compile wrong output wrong output
011-rle-encode Run-length encode the input as count/byte pairs compile
012-http-status-code GET a URL and write the HTTP status code compile wrong output wrong output other
013-http-json-sum POST a JSON pair and extract the sum compile wrong output wrong output other
014-http-header-echo GET a URL and echo a named response header compile runtime wrong output other
015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure compile other
016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure compile wrong output
017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow compile
018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input compile other
019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input compile wrong output other wrong output

Failure callouts

44 of 100 attempts failed. Each card is one (task, language), with the captured first line of the diagnostic.

  1. 000-hello-stdout Hello, stdout Zero compile
    ref.zero:1:1 PAR100: expected '{' before block
  2. 001-fibonacci-memoized Fibonacci with memoization Zero compile
    ref.zero:4:1 PAR100: expected '{' before block
  3. 002-sieve-prime-count Prime count via Sieve of Eratosthenes Zero compile
    ref.zero:4:1 PAR100: expected '{' before block
  4. 002-sieve-prime-count Prime count via Sieve of Eratosthenes Rust other
    (no diagnostic captured)
  5. 003-levenshtein-distance Levenshtein edit distance Zero compile
    ref.zero:9:1 PAR100: expected '{' before block
  6. 003-levenshtein-distance Levenshtein edit distance TypeScript other
    (no diagnostic captured)
  7. 004-matrix-multiply Square integer matrix multiply Zero compile
    ref.zero:3:10 PAR100: expected '{' before block
  8. 004-matrix-multiply Square integer matrix multiply Go other
    # command-line-arguments
  9. 005-balanced-parens Balanced bracket checker Zero compile
    ref.zero:3:18 PAR100: expected expression
  10. 006-substring-count Non-overlapping substring count Zero compile
    ref.zero:1:1 PAR100: expected '{' before block
  11. 007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) Zero compile
    ref.zero:1:15 PAR100: expected expression
  12. 007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) Go other
    # command-line-arguments
  13. 007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) Python other
    (no diagnostic captured)
  14. 008-word-reverse Reverse the order of words on a line Zero compile
    ref.zero:3:8 PAR100: expected '{' before block
  15. 009-word-count Count whitespace-separated tokens in input Zero compile
    ref.zero:3:17 PAR100: expected '{' before block
  16. 009-word-count Count whitespace-separated tokens in input Go wrong output
    (no diagnostic captured)
  17. 009-word-count Count whitespace-separated tokens in input Python wrong output
    (no diagnostic captured)
  18. 010-byte-frequency Per-byte frequency table sorted by byte value Zero compile
    ref.zero:1:15 PAR100: expected expression
  19. 010-byte-frequency Per-byte frequency table sorted by byte value Rust wrong output
    (no diagnostic captured)
  20. 010-byte-frequency Per-byte frequency table sorted by byte value Python wrong output
    (no diagnostic captured)
  21. 011-rle-encode Run-length encode the input as count/byte pairs Zero compile
    ref.zero:1:15 PAR100: expected expression
  22. 012-http-status-code GET a URL and write the HTTP status code Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'lib http'
  23. 012-http-status-code GET a URL and write the HTTP status code TypeScript wrong output
    (no diagnostic captured)
  24. 012-http-status-code GET a URL and write the HTTP status code Rust wrong output
    (no diagnostic captured)
  25. 012-http-status-code GET a URL and write the HTTP status code Python other
    Traceback (most recent call last):
  26. 013-http-json-sum POST a JSON pair and extract the sum Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'lib http'
  27. 013-http-json-sum POST a JSON pair and extract the sum TypeScript wrong output
    (no diagnostic captured)
  28. 013-http-json-sum POST a JSON pair and extract the sum Rust wrong output
    (no diagnostic captured)
  29. 013-http-json-sum POST a JSON pair and extract the sum Python other
    Traceback (most recent call last):
  30. 014-http-header-echo GET a URL and echo a named response header Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'lib http'
  31. 014-http-header-echo GET a URL and echo a named response header TypeScript runtime
    28 |         input.push(chunk.toString());
  32. 014-http-header-echo GET a URL and echo a named response header Rust wrong output
    (no diagnostic captured)
  33. 014-http-header-echo GET a URL and echo a named response header Python other
    Traceback (most recent call last):
  34. 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure Zero compile
    ref.zero:3:13 PAR100: expected '{' before block
  35. 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure TypeScript other
    (no diagnostic captured)
  36. 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure Zero compile
    ref.zero:3:13 PAR100: expected '{' before block
  37. 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure Rust wrong output
    (no diagnostic captured)
  38. 017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow Zero compile
    ref.zero:3:13 PAR100: expected '{' before block
  39. 018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input Zero compile
    zero/src/main.0:1:1 IMP001: unknown package-local import '"src/lib.0"'
  40. 018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input Go other
    # command-line-arguments
  41. 019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input Zero compile
    zero/src/lib.0:23:1 PAR100: unexpected character '`'
  42. 019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input TypeScript wrong output
    (no diagnostic captured)
  43. 019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input Rust other
    (no diagnostic captured)
  44. 019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input Go wrong output
    (no diagnostic captured)

Compare

Model deep-dive: gpt-4o-mini. Other runs at this harness commit: gpt-5 , gpt-4o .

Browse the leaderboard, the corpus on GitHub, or the methodology page.