agentlang-index · run-0001
How well do frontier models write Zero?
AgentLang Index measures how well frontier LLMs write programs in Zero, TypeScript, Rust, Go, and Python under identical conditions. Below is the first public run. Real numbers, reproducible.
The first public result
On , 3 frontier models were prompted to write a complete reference implementation for each of 20 corpus tasks across 5 languages. Each generated program was compiled (where applicable) and executed against every public and hidden test case. A task counts as passed for a model+language only when every case agrees byte-exactly on stdout, with empty stderr and exit status 0.
Leaderboard
| Model | Zero | TypeScript | Rust | Go | Python | Overall | Lang. tax | Run |
|---|---|---|---|---|---|---|---|---|
| gpt-5 | 0% 0/20 | 100% 20/20 | 95% 19/20 | 100% 20/20 | 100% 20/20 | 79% 79/100 | +99% | reproduce |
| gpt-4o | 0% 0/20 | 95% 19/20 | 75% 15/20 | 95% 19/20 | 85% 17/20 | 70% 70/100 | +88% | reproduce |
| gpt-4o-mini | 0% 0/20 | 70% 14/20 | 65% 13/20 | 75% 15/20 | 70% 14/20 | 56% 56/100 | +70% | reproduce |
Pass rate is task-level: every test case must agree byte-exactly for a task to count. Language tax is the average gap between Zero pass-rate and each other language's pass-rate. Negative means Zero is harder for this model.
The headline
Across this run, the average language tax is +85%. Zero is the agent-first language. If the thesis held, Zero should be the easiest language for these models to write. It is, in this first run, the hardest by a wide margin.
One run is one data point, not a verdict. The next run adds repair-loop mode (the agent sees the runtime diagnostic and edits), which is where the agent-first thesis actually predicts the advantage. The first one-shot run is the baseline that lets the loop-vs-one-shot delta mean something.
Per-task results
Each cell is whether the model passed that task in that language. Green for pass, dim for fail.
| Task | gpt-5 | gpt-4o | gpt-4o-mini | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Z | T | R | G | P | Z | T | R | G | P | Z | T | R | G | P | |
| 000-hello-stdout | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| 001-fibonacci-memoized | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| 002-sieve-prime-count | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
| 003-levenshtein-distance | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
| 004-matrix-multiply | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
| 005-balanced-parens | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| 006-substring-count | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| 007-csv-line-tokenize | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||
| 008-word-reverse | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| 009-word-count | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
| 010-byte-frequency | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
| 011-rle-encode | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| 012-http-status-code | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||
| 013-http-json-sum | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||
| 014-http-header-echo | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||
| 015-checked-divide-u32 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
| 016-parse-list-sum | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
| 017-checked-add-overflow | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| 018-caesar-cipher | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||
| 019-run-length-encode | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||
Methodology
One-shot, deterministic. Each (model, task, language) gets one API call. Temperature is fixed at zero where the model accepts it. No retries, no agent-loop in this run. The model sees the spec, the per-language calling convention, and a "return only the source code" instruction.
Real toolchains. Generated programs are compiled with the same Rust 2021 (cargo --release), Go 1.21, Bun-as-tsx, Python 3, and Zero 0.1.2 binaries the corpus reference implementations use.
Byte-exact. Every test case asserts a specific stdout (down to trailing newlines), empty stderr, and exit status 0. No fuzzy matching, no "if it kind of works."
Reproducible. The runner, prompts, model responses, scratch directories, test cases, and per-attempt result JSON all live in
truffle-dev/agentlang-index
under bench/. You can replay this run.
Long form. Corpus design, scoring rules, the ten Zero codegen quirks the first run exposed, and how to replay a run end-to-end live on the methodology page. The matching blog post is Same prompt, five languages, byte-exact.
Repos
- agentlang-index Corpus, references, and the bench/ runner. v1.0 corpus is 20 tasks × 5 languages.
- bench/results Per-model run output. Every prompt, every response, every test case capture.
- agentlang-index-data Open dataset under CC-BY-4.0. Packaged exports of every run: manifest, dashboard rollups, and 300 per-attempt records.
- agentlang-spec Third-party companion: Zero CLI vendored, no-network.