agentlang-index · run reproducibility

8286655-gpt-5

Run of gpt-5 against the 20-task corpus at harness commit 8286655, captured 2026-05-19. Below is the exact invocation, the toolchain pins, and every attempt. Drop it into a shell and it should reproduce.

Reproduce this run

Three commands and an API key. The repo at 8286655 pins the corpus, the Zero compiler, the per-language scaffold, and the prompt. The runner re-issues the same chat-completions request the original run made.

git clone https://github.com/truffle-dev/agentlang-index
cd agentlang-index
git checkout 8286655
export OPENAI_API_KEY=...
bun run bench/runner.ts --model gpt-5

Then aggregate the per-model summaries into the dashboard JSON the site reads:

bun run bench/aggregate.ts --site

Each (model, task, language) attempt is deterministic where the model accepts temperature=0. gpt-5 and o-series models use default sampling; their text may vary across re-runs but the byte-exact scoring is unchanged.

Run environment

Harness SHA: 8286655
Zero version: 0.1.2
Corpus size: 20 tasks × 5 languages
Model: gpt-5
Model provider: OpenAI
Run timestamp: 2026-05-19 01:32:49 UTC
Total wall-clock: 3512.3 s
Prompt tokens: 64,510
Completion tokens: 348,714
Total tokens: 413,224
Cost (USD): $5.55
Runner command: bun run bench/runner.ts --model gpt-5

Cost uses published OpenAI per-million-token prices at run time: gpt-5 $5/$15 (prompt/completion), gpt-4o $2.50/$10, gpt-4o-mini $0.15/$0.60. If the model is not in the pricing table, this row reads not priced.

Per-language pass rate

0% Zero

100% TypeScript

95% Rust

100% Go

100% Python

Pass rate per language. Overall 79% (79/100). Average tax versus the other four languages is +99%.

Per-task results

Every cell is one attempt from this run. Pass means stdout matched byte-exact on every public and hidden test case, stderr empty, exit zero. Click a task to view its prompt, acceptance, references, and per-model breakdown.

Task	Zero	TypeScript	Rust	Go	Python
000-hello-stdout Hello, stdout	compile	✓	✓	✓	✓
001-fibonacci-memoized Fibonacci with memoization	compile	✓	✓	✓	✓
002-sieve-prime-count Prime count via Sieve of Eratosthenes	compile	✓	✓	✓	✓
003-levenshtein-distance Levenshtein edit distance	compile	✓	✓	✓	✓
004-matrix-multiply Square integer matrix multiply	compile	✓	✓	✓	✓
005-balanced-parens Balanced bracket checker	compile	✓	✓	✓	✓
006-substring-count Non-overlapping substring count	compile	✓	✓	✓	✓
007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset)	wrong output	✓	✓	✓	✓
008-word-reverse Reverse the order of words on a line	compile	✓	✓	✓	✓
009-word-count Count whitespace-separated tokens in input	compile	✓	✓	✓	✓
010-byte-frequency Per-byte frequency table sorted by byte value	compile	✓	✓	✓	✓
011-rle-encode Run-length encode the input as count/byte pairs	compile	✓	✓	✓	✓
012-http-status-code GET a URL and write the HTTP status code	compile	✓	✓	✓	✓
013-http-json-sum POST a JSON pair and extract the sum	wrong output	✓	wrong output	✓	✓
014-http-header-echo GET a URL and echo a named response header	wrong output	✓	✓	✓	✓
015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure	compile	✓	✓	✓	✓
016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure	compile	✓	✓	✓	✓
017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow	wrong output	✓	✓	✓	✓
018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input	compile	✓	✓	✓	✓
019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input	compile	✓	✓	✓	✓

Failure callouts

21 of 100 attempts failed. Each card is one (task, language), with the captured first line of the diagnostic.

000-hello-stdout Hello, stdout Zero compile

ref.zero:1:10 PAR100: expected '{' before block

001-fibonacci-memoized Fibonacci with memoization Zero compile
```
ref.zero:1:1 IMP001: unknown package-local import 'std'
```
002-sieve-prime-count Prime count via Sieve of Eratosthenes Zero compile
```
ref.zero:1:1 IMP001: unknown package-local import 'std'
```
003-levenshtein-distance Levenshtein edit distance Zero compile
```
ref.zero:3:8 PAR100: expected '{' before block
```
004-matrix-multiply Square integer matrix multiply Zero compile
```
ref.zero:1:1 IMP001: unknown package-local import 'std'
```

005-balanced-parens Balanced bracket checker Zero compile

ref.zero:1:1 IMP001: unknown package-local import 'std'

006-substring-count Non-overlapping substring count Zero compile
```
ref.zero:1:1 IMP001: unknown package-local import 'std'
```
007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) Zero wrong output
```
(no diagnostic captured)
```
008-word-reverse Reverse the order of words on a line Zero compile
```
ref.zero:1:1 IMP001: unknown package-local import 'std'
```
009-word-count Count whitespace-separated tokens in input Zero compile
```
ref.zero:3:9 PAR100: expected '{' before block
```
010-byte-frequency Per-byte frequency table sorted by byte value Zero compile
```
ref.zero:1:1 IMP001: unknown package-local import 'std'
```
011-rle-encode Run-length encode the input as count/byte pairs Zero compile
```
ref.zero:1:1 IMP001: unknown package-local import 'std'
```
012-http-status-code GET a URL and write the HTTP status code Zero compile
```
ref.zero:1:1 IMP001: unknown package-local import 'std'
```
013-http-json-sum POST a JSON pair and extract the sum Zero wrong output
```
(no diagnostic captured)
```
013-http-json-sum POST a JSON pair and extract the sum Rust wrong output
```
(no diagnostic captured)
```
014-http-header-echo GET a URL and echo a named response header Zero wrong output
```
(no diagnostic captured)
```
015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure Zero compile
```
ref.zero:3:9 PAR100: expected '{' before block
```
016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure Zero compile
```
ref.zero:1:1 IMP001: unknown package-local import 'std'
```
017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow Zero wrong output
```
(no diagnostic captured)
```
018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input Zero compile
```
zero/src/lib.0:10:1 PAR100: unexpected character '`'
```
019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input Zero compile
```
zero/src/main.0:1:1 IMP001: unknown package-local import 'world'
```

Compare

Model deep-dive: gpt-5. Other runs at this harness commit: gpt-4o , gpt-4o-mini .

Browse the leaderboard, the corpus on GitHub, or the methodology page.

Truffle · run 8286655-gpt-5 · captured 2026-05-19