agentlang-index · run reproducibility

8286655-gpt-5

Run of gpt-5 against the 20-task corpus at harness commit 8286655, captured 2026-05-19. Below is the exact invocation, the toolchain pins, and every attempt. Drop it into a shell and it should reproduce.

Reproduce this run

Three commands and an API key. The repo at 8286655 pins the corpus, the Zero compiler, the per-language scaffold, and the prompt. The runner re-issues the same chat-completions request the original run made.

git clone https://github.com/truffle-dev/agentlang-index
cd agentlang-index
git checkout 8286655
export OPENAI_API_KEY=...
bun run bench/runner.ts --model gpt-5

Then aggregate the per-model summaries into the dashboard JSON the site reads:

bun run bench/aggregate.ts --site

Each (model, task, language) attempt is deterministic where the model accepts temperature=0. gpt-5 and o-series models use default sampling; their text may vary across re-runs but the byte-exact scoring is unchanged.


Run environment

Harness SHA
8286655
Zero version
0.1.2
Corpus size
20 tasks × 5 languages
Model
gpt-5
Model provider
OpenAI
Run timestamp
2026-05-19 01:32:49 UTC
Total wall-clock
3512.3 s
Prompt tokens
64,510
Completion tokens
348,714
Total tokens
413,224
Cost (USD)
$5.55
Runner command
bun run bench/runner.ts --model gpt-5

Cost uses published OpenAI per-million-token prices at run time: gpt-5 $5/$15 (prompt/completion), gpt-4o $2.50/$10, gpt-4o-mini $0.15/$0.60. If the model is not in the pricing table, this row reads not priced.


Per-language pass rate

0% Zero
100% TypeScript
95% Rust
100% Go
100% Python
Pass rate per language. Overall 79% (79/100). Average tax versus the other four languages is +99%.

Per-task results

Every cell is one attempt from this run. Pass means stdout matched byte-exact on every public and hidden test case, stderr empty, exit zero. Click a task to view its prompt, acceptance, references, and per-model breakdown.

Task ZeroTypeScriptRustGoPython
000-hello-stdout Hello, stdout compile
001-fibonacci-memoized Fibonacci with memoization compile
002-sieve-prime-count Prime count via Sieve of Eratosthenes compile
003-levenshtein-distance Levenshtein edit distance compile
004-matrix-multiply Square integer matrix multiply compile
005-balanced-parens Balanced bracket checker compile
006-substring-count Non-overlapping substring count compile
007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) wrong output
008-word-reverse Reverse the order of words on a line compile
009-word-count Count whitespace-separated tokens in input compile
010-byte-frequency Per-byte frequency table sorted by byte value compile
011-rle-encode Run-length encode the input as count/byte pairs compile
012-http-status-code GET a URL and write the HTTP status code compile
013-http-json-sum POST a JSON pair and extract the sum wrong output wrong output
014-http-header-echo GET a URL and echo a named response header wrong output
015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure compile
016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure compile
017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow wrong output
018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input compile
019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input compile

Failure callouts

21 of 100 attempts failed. Each card is one (task, language), with the captured first line of the diagnostic.

  1. 000-hello-stdout Hello, stdout Zero compile
    ref.zero:1:10 PAR100: expected '{' before block
  2. 001-fibonacci-memoized Fibonacci with memoization Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  3. 002-sieve-prime-count Prime count via Sieve of Eratosthenes Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  4. 003-levenshtein-distance Levenshtein edit distance Zero compile
    ref.zero:3:8 PAR100: expected '{' before block
  5. 004-matrix-multiply Square integer matrix multiply Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  6. 005-balanced-parens Balanced bracket checker Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  7. 006-substring-count Non-overlapping substring count Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  8. 007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) Zero wrong output
    (no diagnostic captured)
  9. 008-word-reverse Reverse the order of words on a line Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  10. 009-word-count Count whitespace-separated tokens in input Zero compile
    ref.zero:3:9 PAR100: expected '{' before block
  11. 010-byte-frequency Per-byte frequency table sorted by byte value Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  12. 011-rle-encode Run-length encode the input as count/byte pairs Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  13. 012-http-status-code GET a URL and write the HTTP status code Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  14. 013-http-json-sum POST a JSON pair and extract the sum Zero wrong output
    (no diagnostic captured)
  15. 013-http-json-sum POST a JSON pair and extract the sum Rust wrong output
    (no diagnostic captured)
  16. 014-http-header-echo GET a URL and echo a named response header Zero wrong output
    (no diagnostic captured)
  17. 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure Zero compile
    ref.zero:3:9 PAR100: expected '{' before block
  18. 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  19. 017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow Zero wrong output
    (no diagnostic captured)
  20. 018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input Zero compile
    zero/src/lib.0:10:1 PAR100: unexpected character '`'
  21. 019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input Zero compile
    zero/src/main.0:1:1 IMP001: unknown package-local import 'world'

Compare

Model deep-dive: gpt-5. Other runs at this harness commit: gpt-4o , gpt-4o-mini .

Browse the leaderboard, the corpus on GitHub, or the methodology page.