agentlang-index · run reproducibility
8286655-gpt-5
Run of gpt-5 against the 20-task corpus at harness commit 8286655, captured 2026-05-19. Below is the exact invocation, the toolchain pins, and every attempt. Drop it into a shell and it should reproduce.
Reproduce this run
Three commands and an API key. The repo at 8286655 pins the corpus, the Zero compiler, the per-language scaffold, and the prompt. The runner re-issues the same chat-completions request the original run made.
git clone https://github.com/truffle-dev/agentlang-index
cd agentlang-index
git checkout 8286655
export OPENAI_API_KEY=...
bun run bench/runner.ts --model gpt-5 Then aggregate the per-model summaries into the dashboard JSON the site reads:
bun run bench/aggregate.ts --site Each (model, task, language) attempt is deterministic where the model accepts temperature=0. gpt-5 and o-series models use default sampling; their text may vary across re-runs but the byte-exact scoring is unchanged.
Run environment
- Harness SHA
- 8286655
- Zero version
- 0.1.2
- Corpus size
- 20 tasks × 5 languages
- Model
- gpt-5
- Model provider
- OpenAI
- Run timestamp
- 2026-05-19 01:32:49 UTC
- Total wall-clock
- 3512.3 s
- Prompt tokens
- 64,510
- Completion tokens
- 348,714
- Total tokens
- 413,224
- Cost (USD)
- $5.55
- Runner command
- bun run bench/runner.ts --model gpt-5
Cost uses published OpenAI per-million-token prices at run time: gpt-5 $5/$15 (prompt/completion), gpt-4o $2.50/$10, gpt-4o-mini $0.15/$0.60. If the model is not in the pricing table, this row reads not priced.
Per-language pass rate
Per-task results
Every cell is one attempt from this run. Pass means stdout matched byte-exact on every public and hidden test case, stderr empty, exit zero. Click a task to view its prompt, acceptance, references, and per-model breakdown.
| Task | Zero | TypeScript | Rust | Go | Python |
|---|---|---|---|---|---|
| 000-hello-stdout Hello, stdout | compile | ✓ | ✓ | ✓ | ✓ |
| 001-fibonacci-memoized Fibonacci with memoization | compile | ✓ | ✓ | ✓ | ✓ |
| 002-sieve-prime-count Prime count via Sieve of Eratosthenes | compile | ✓ | ✓ | ✓ | ✓ |
| 003-levenshtein-distance Levenshtein edit distance | compile | ✓ | ✓ | ✓ | ✓ |
| 004-matrix-multiply Square integer matrix multiply | compile | ✓ | ✓ | ✓ | ✓ |
| 005-balanced-parens Balanced bracket checker | compile | ✓ | ✓ | ✓ | ✓ |
| 006-substring-count Non-overlapping substring count | compile | ✓ | ✓ | ✓ | ✓ |
| 007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) | wrong output | ✓ | ✓ | ✓ | ✓ |
| 008-word-reverse Reverse the order of words on a line | compile | ✓ | ✓ | ✓ | ✓ |
| 009-word-count Count whitespace-separated tokens in input | compile | ✓ | ✓ | ✓ | ✓ |
| 010-byte-frequency Per-byte frequency table sorted by byte value | compile | ✓ | ✓ | ✓ | ✓ |
| 011-rle-encode Run-length encode the input as count/byte pairs | compile | ✓ | ✓ | ✓ | ✓ |
| 012-http-status-code GET a URL and write the HTTP status code | compile | ✓ | ✓ | ✓ | ✓ |
| 013-http-json-sum POST a JSON pair and extract the sum | wrong output | ✓ | wrong output | ✓ | ✓ |
| 014-http-header-echo GET a URL and echo a named response header | wrong output | ✓ | ✓ | ✓ | ✓ |
| 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure | compile | ✓ | ✓ | ✓ | ✓ |
| 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure | compile | ✓ | ✓ | ✓ | ✓ |
| 017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow | wrong output | ✓ | ✓ | ✓ | ✓ |
| 018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input | compile | ✓ | ✓ | ✓ | ✓ |
| 019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input | compile | ✓ | ✓ | ✓ | ✓ |
Failure callouts
21 of 100 attempts failed. Each card is one (task, language), with the captured first line of the diagnostic.
-
ref.zero:1:10 PAR100: expected '{' before block -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:3:8 PAR100: expected '{' before block -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
(no diagnostic captured) -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:3:9 PAR100: expected '{' before block -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
ref.zero:1:1 IMP001: unknown package-local import 'std' -
(no diagnostic captured) -
(no diagnostic captured) -
(no diagnostic captured) - 015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure Zero compile
ref.zero:3:9 PAR100: expected '{' before block - 016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure Zero compile
ref.zero:1:1 IMP001: unknown package-local import 'std' - 017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow Zero wrong output
(no diagnostic captured) - 018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input Zero compile
zero/src/lib.0:10:1 PAR100: unexpected character '`' - 019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input Zero compile
zero/src/main.0:1:1 IMP001: unknown package-local import 'world'
Compare
Model deep-dive: gpt-5. Other runs at this harness commit: gpt-4o , gpt-4o-mini .
Browse the leaderboard, the corpus on GitHub, or the methodology page.