agentlang-index · run-0001

How well do frontier models write Zero?

AgentLang Index measures how well frontier LLMs write programs in Zero, TypeScript, Rust, Go, and Python under identical conditions. Below is the first public run. Real numbers, reproducible.

0% Zero

88% TypeScript

78% Rust

90% Go

85% Python

Average pass rate per language, across 3 models and 20 tasks. Zero is the agent-first language; the bar is empty.

The first public result

On 2026-05-23, 3 frontier models were prompted to write a complete reference implementation for each of 20 corpus tasks across 5 languages. Each generated program was compiled (where applicable) and executed against every public and hidden test case. A task counts as passed for a model+language only when every case agrees byte-exactly on stdout, with empty stderr and exit status 0.

Leaderboard

Model	Zero	TypeScript	Rust	Go	Python	Overall	Lang. tax	Run
gpt-5	0% 0/20	100% 20/20	95% 19/20	100% 20/20	100% 20/20	79% 79/100	+99%	reproduce
gpt-4o	0% 0/20	95% 19/20	75% 15/20	95% 19/20	85% 17/20	70% 70/100	+88%	reproduce
gpt-4o-mini	0% 0/20	70% 14/20	65% 13/20	75% 15/20	70% 14/20	56% 56/100	+70%	reproduce

Pass rate is task-level: every test case must agree byte-exactly for a task to count. Language tax is the average gap between Zero pass-rate and each other language's pass-rate. Negative means Zero is harder for this model.

The headline

Across this run, the average language tax is +85%. Zero is the agent-first language. If the thesis held, Zero should be the easiest language for these models to write. It is, in this first run, the hardest by a wide margin.

One run is one data point, not a verdict. The next run adds repair-loop mode (the agent sees the runtime diagnostic and edits), which is where the agent-first thesis actually predicts the advantage. The first one-shot run is the baseline that lets the loop-vs-one-shot delta mean something.

Per-task results

Each cell is whether the model passed that task in that language. Green for pass, dim for fail.

Task	gpt-4o	gpt-4o-mini
000-hello-stdout	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
001-fibonacci-memoized	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
002-sieve-prime-count	✓	✓	✓	✓	✓	✓	✓	✓	✓		✓	✓
003-levenshtein-distance	✓	✓	✓	✓	✓	✓	✓	✓		✓	✓	✓
004-matrix-multiply	✓	✓	✓	✓	✓	✓		✓	✓	✓		✓
005-balanced-parens	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
006-substring-count	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
007-csv-line-tokenize	✓	✓	✓	✓	✓		✓	✓	✓	✓
008-word-reverse	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
009-word-count	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
010-byte-frequency	✓	✓	✓	✓	✓	✓	✓	✓	✓		✓
011-rle-encode	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
012-http-status-code	✓	✓	✓	✓	✓		✓				✓
013-http-json-sum	✓		✓	✓	✓		✓				✓
014-http-header-echo	✓	✓	✓	✓	✓		✓				✓
015-checked-divide-u32	✓	✓	✓	✓	✓	✓	✓	✓		✓	✓	✓
016-parse-list-sum	✓	✓	✓	✓	✓		✓	✓	✓		✓	✓
017-checked-add-overflow	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
018-caesar-cipher	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓		✓
019-run-length-encode	✓	✓	✓	✓		✓	✓	✓				✓

Task

✓

001-fibonacci-memoized

✓

002-sieve-prime-count

✓

003-levenshtein-distance

✓

004-matrix-multiply

✓

005-balanced-parens

✓

006-substring-count

✓

007-csv-line-tokenize

✓

✓

✓

✓

✓

✓

✓

✓

015-checked-divide-u32

✓

016-parse-list-sum

✓

017-checked-add-overflow

✓

018-caesar-cipher

✓

019-run-length-encode

✓

Methodology

One-shot, deterministic. Each (model, task, language) gets one API call. Temperature is fixed at zero where the model accepts it. No retries, no agent-loop in this run. The model sees the spec, the per-language calling convention, and a "return only the source code" instruction.

Real toolchains. Generated programs are compiled with the same Rust 2021 (cargo --release), Go 1.21, Bun-as-tsx, Python 3, and Zero 0.1.2 binaries the corpus reference implementations use.

Byte-exact. Every test case asserts a specific stdout (down to trailing newlines), empty stderr, and exit status 0. No fuzzy matching, no "if it kind of works."

Reproducible. The runner, prompts, model responses, scratch directories, test cases, and per-attempt result JSON all live in truffle-dev/agentlang-index under bench/. You can replay this run.

Long form. Corpus design, scoring rules, the ten Zero codegen quirks the first run exposed, and how to replay a run end-to-end live on the methodology page. The matching blog post is Same prompt, five languages, byte-exact.

Repos

agentlang-index Corpus, references, and the bench/ runner. v1.0 corpus is 20 tasks × 5 languages.
bench/results Per-model run output. Every prompt, every response, every test case capture.
agentlang-index-data Open dataset under CC-BY-4.0. Packaged exports of every run: manifest, dashboard rollups, and 300 per-attempt records.
agentlang-spec Third-party companion: Zero CLI vendored, no-network.

Truffle · run generated 2026-05-23 22:29:44 UTC