agentlang-index · run reproducibility

8286655-gpt-4o-mini

Run of gpt-4o-mini against the 20-task corpus at harness commit 8286655, captured 2026-05-19. Below is the exact invocation, the toolchain pins, and every attempt. Drop it into a shell and it should reproduce.

Reproduce this run

Three commands and an API key. The repo at 8286655 pins the corpus, the Zero compiler, the per-language scaffold, and the prompt. The runner re-issues the same chat-completions request the original run made.

git clone https://github.com/truffle-dev/agentlang-index
cd agentlang-index
git checkout 8286655
export OPENAI_API_KEY=...
bun run bench/runner.ts --model gpt-4o-mini

Then aggregate the per-model summaries into the dashboard JSON the site reads:

bun run bench/aggregate.ts --site

Each (model, task, language) attempt is deterministic where the model accepts temperature=0. gpt-5 and o-series models use default sampling; their text may vary across re-runs but the byte-exact scoring is unchanged.

Run environment

Harness SHA: 8286655
Zero version: 0.1.2
Corpus size: 20 tasks × 5 languages
Model: gpt-4o-mini
Model provider: OpenAI
Run timestamp: 2026-05-19 00:31:37 UTC
Total wall-clock: 479.6 s
Prompt tokens: 64,610
Completion tokens: 21,038
Total tokens: 85,648
Cost (USD): $0.02
Runner command: bun run bench/runner.ts --model gpt-4o-mini

Cost uses published OpenAI per-million-token prices at run time: gpt-5 $5/$15 (prompt/completion), gpt-4o $2.50/$10, gpt-4o-mini $0.15/$0.60. If the model is not in the pricing table, this row reads not priced.

Per-language pass rate

0% Zero

70% TypeScript

65% Rust

75% Go

70% Python

Pass rate per language. Overall 56% (56/100). Average tax versus the other four languages is +70%.

Per-task results

Every cell is one attempt from this run. Pass means stdout matched byte-exact on every public and hidden test case, stderr empty, exit zero. Click a task to view its prompt, acceptance, references, and per-model breakdown.

Task	Zero	TypeScript	Rust	Go	Python
000-hello-stdout Hello, stdout	compile	✓	✓	✓	✓
001-fibonacci-memoized Fibonacci with memoization	compile	✓	✓	✓	✓
002-sieve-prime-count Prime count via Sieve of Eratosthenes	compile	✓	other	✓	✓
003-levenshtein-distance Levenshtein edit distance	compile	other	✓	✓	✓
004-matrix-multiply Square integer matrix multiply	compile	✓	✓	other	✓
005-balanced-parens Balanced bracket checker	compile	✓	✓	✓	✓
006-substring-count Non-overlapping substring count	compile	✓	✓	✓	✓
007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset)	compile	✓	✓	other	other
008-word-reverse Reverse the order of words on a line	compile	✓	✓	✓	✓
009-word-count Count whitespace-separated tokens in input	compile	✓	✓	wrong output	wrong output
010-byte-frequency Per-byte frequency table sorted by byte value	compile	✓	wrong output	✓	wrong output
011-rle-encode Run-length encode the input as count/byte pairs	compile	✓	✓	✓	✓
012-http-status-code GET a URL and write the HTTP status code	compile	wrong output	wrong output	✓	other
013-http-json-sum POST a JSON pair and extract the sum	compile	wrong output	wrong output	✓	other
014-http-header-echo GET a URL and echo a named response header	compile	runtime	wrong output	✓	other
015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure	compile	other	✓	✓	✓
016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure	compile	✓	wrong output	✓	✓
017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow	compile	✓	✓	✓	✓
018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input	compile	✓	✓	other	✓
019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input	compile	wrong output	other	wrong output	✓

Failure callouts

44 of 100 attempts failed. Each card is one (task, language), with the captured first line of the diagnostic.

000-hello-stdout Hello, stdout Zero compile

ref.zero:1:1 PAR100: expected '{' before block

001-fibonacci-memoized Fibonacci with memoization Zero compile
```
ref.zero:4:1 PAR100: expected '{' before block
```
002-sieve-prime-count Prime count via Sieve of Eratosthenes Zero compile
```
ref.zero:4:1 PAR100: expected '{' before block
```
002-sieve-prime-count Prime count via Sieve of Eratosthenes Rust other
```
(no diagnostic captured)
```
003-levenshtein-distance Levenshtein edit distance Zero compile
```
ref.zero:9:1 PAR100: expected '{' before block
```
003-levenshtein-distance Levenshtein edit distance TypeScript other
```
(no diagnostic captured)
```
004-matrix-multiply Square integer matrix multiply Zero compile
```
ref.zero:3:10 PAR100: expected '{' before block
```
004-matrix-multiply Square integer matrix multiply Go other
```
# command-line-arguments
```
005-balanced-parens Balanced bracket checker Zero compile
```
ref.zero:3:18 PAR100: expected expression
```
006-substring-count Non-overlapping substring count Zero compile
```
ref.zero:1:1 PAR100: expected '{' before block
```
007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) Zero compile
```
ref.zero:1:15 PAR100: expected expression
```
007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) Go other
```
# command-line-arguments
```
007-csv-line-tokenize CSV line tokenizer (RFC 4180 subset) Python other
```
(no diagnostic captured)
```
008-word-reverse Reverse the order of words on a line Zero compile
```
ref.zero:3:8 PAR100: expected '{' before block
```
009-word-count Count whitespace-separated tokens in input Zero compile
```
ref.zero:3:17 PAR100: expected '{' before block
```
009-word-count Count whitespace-separated tokens in input Go wrong output
```
(no diagnostic captured)
```
009-word-count Count whitespace-separated tokens in input Python wrong output
```
(no diagnostic captured)
```
010-byte-frequency Per-byte frequency table sorted by byte value Zero compile
```
ref.zero:1:15 PAR100: expected expression
```
010-byte-frequency Per-byte frequency table sorted by byte value Rust wrong output
```
(no diagnostic captured)
```
010-byte-frequency Per-byte frequency table sorted by byte value Python wrong output
```
(no diagnostic captured)
```
011-rle-encode Run-length encode the input as count/byte pairs Zero compile
```
ref.zero:1:15 PAR100: expected expression
```
012-http-status-code GET a URL and write the HTTP status code Zero compile
```
ref.zero:1:1 IMP001: unknown package-local import 'lib http'
```
012-http-status-code GET a URL and write the HTTP status code TypeScript wrong output
```
(no diagnostic captured)
```
012-http-status-code GET a URL and write the HTTP status code Rust wrong output
```
(no diagnostic captured)
```
012-http-status-code GET a URL and write the HTTP status code Python other
```
Traceback (most recent call last):
```
013-http-json-sum POST a JSON pair and extract the sum Zero compile
```
ref.zero:1:1 IMP001: unknown package-local import 'lib http'
```
013-http-json-sum POST a JSON pair and extract the sum TypeScript wrong output
```
(no diagnostic captured)
```
013-http-json-sum POST a JSON pair and extract the sum Rust wrong output
```
(no diagnostic captured)
```
013-http-json-sum POST a JSON pair and extract the sum Python other
```
Traceback (most recent call last):
```
014-http-header-echo GET a URL and echo a named response header Zero compile
```
ref.zero:1:1 IMP001: unknown package-local import 'lib http'
```
014-http-header-echo GET a URL and echo a named response header TypeScript runtime
```
28 |         input.push(chunk.toString());
```
014-http-header-echo GET a URL and echo a named response header Rust wrong output
```
(no diagnostic captured)
```
014-http-header-echo GET a URL and echo a named response header Python other
```
Traceback (most recent call last):
```
015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure Zero compile
```
ref.zero:3:13 PAR100: expected '{' before block
```
015-checked-divide-u32 Parse two unsigned integers and write their integer quotient, or error on any failure TypeScript other
```
(no diagnostic captured)
```
016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure Zero compile
```
ref.zero:3:13 PAR100: expected '{' before block
```
016-parse-list-sum Read a count then that many u32 integers and write their sum, or error on any failure Rust wrong output
```
(no diagnostic captured)
```
017-checked-add-overflow Parse two unsigned integers and write their sum, or error on parse failure or u32 overflow Zero compile
```
ref.zero:3:13 PAR100: expected '{' before block
```
018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input Zero compile
```
zero/src/main.0:1:1 IMP001: unknown package-local import '"src/lib.0"'
```
018-caesar-cipher Shift a lowercase ASCII string by a Caesar offset, or error on bad input Go other
```
# command-line-arguments
```
019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input Zero compile
```
zero/src/lib.0:23:1 PAR100: unexpected character '`'
```
019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input TypeScript wrong output
```
(no diagnostic captured)
```
019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input Rust other
```
(no diagnostic captured)
```
019-run-length-encode Run-length encode a lowercase ASCII string, or error on bad input Go wrong output
```
(no diagnostic captured)
```

Compare

Model deep-dive: gpt-4o-mini. Other runs at this harness commit: gpt-5 , gpt-4o .

Browse the leaderboard, the corpus on GitHub, or the methodology page.

Truffle · run 8286655-gpt-4o-mini · captured 2026-05-19