agentlang-index · run-0001

How well do frontier models write Zero?

AgentLang Index measures how well frontier LLMs write programs in Zero, TypeScript, Rust, Go, and Python under identical conditions. Below is the first public run. Real numbers, reproducible.

0% Zero
88% TypeScript
78% Rust
90% Go
85% Python
Average pass rate per language, across 3 models and 20 tasks. Zero is the agent-first language; the bar is empty.

The first public result

On , 3 frontier models were prompted to write a complete reference implementation for each of 20 corpus tasks across 5 languages. Each generated program was compiled (where applicable) and executed against every public and hidden test case. A task counts as passed for a model+language only when every case agrees byte-exactly on stdout, with empty stderr and exit status 0.


Leaderboard

Model ZeroTypeScriptRustGoPython Overall Lang. tax Run
gpt-5 0% 0/20 100% 20/20 95% 19/20 100% 20/20 100% 20/20 79% 79/100 +99% reproduce
gpt-4o 0% 0/20 95% 19/20 75% 15/20 95% 19/20 85% 17/20 70% 70/100 +88% reproduce
gpt-4o-mini 0% 0/20 70% 14/20 65% 13/20 75% 15/20 70% 14/20 56% 56/100 +70% reproduce

Pass rate is task-level: every test case must agree byte-exactly for a task to count. Language tax is the average gap between Zero pass-rate and each other language's pass-rate. Negative means Zero is harder for this model.


The headline

Across this run, the average language tax is +85%. Zero is the agent-first language. If the thesis held, Zero should be the easiest language for these models to write. It is, in this first run, the hardest by a wide margin.

One run is one data point, not a verdict. The next run adds repair-loop mode (the agent sees the runtime diagnostic and edits), which is where the agent-first thesis actually predicts the advantage. The first one-shot run is the baseline that lets the loop-vs-one-shot delta mean something.


Per-task results

Each cell is whether the model passed that task in that language. Green for pass, dim for fail.

Task gpt-5gpt-4ogpt-4o-mini
ZTRGPZTRGPZTRGP
000-hello-stdout
001-fibonacci-memoized
002-sieve-prime-count
003-levenshtein-distance
004-matrix-multiply
005-balanced-parens
006-substring-count
007-csv-line-tokenize
008-word-reverse
009-word-count
010-byte-frequency
011-rle-encode
012-http-status-code
013-http-json-sum
014-http-header-echo
015-checked-divide-u32
016-parse-list-sum
017-checked-add-overflow
018-caesar-cipher
019-run-length-encode

Methodology

One-shot, deterministic. Each (model, task, language) gets one API call. Temperature is fixed at zero where the model accepts it. No retries, no agent-loop in this run. The model sees the spec, the per-language calling convention, and a "return only the source code" instruction.

Real toolchains. Generated programs are compiled with the same Rust 2021 (cargo --release), Go 1.21, Bun-as-tsx, Python 3, and Zero 0.1.2 binaries the corpus reference implementations use.

Byte-exact. Every test case asserts a specific stdout (down to trailing newlines), empty stderr, and exit status 0. No fuzzy matching, no "if it kind of works."

Reproducible. The runner, prompts, model responses, scratch directories, test cases, and per-attempt result JSON all live in truffle-dev/agentlang-index under bench/. You can replay this run.

Long form. Corpus design, scoring rules, the ten Zero codegen quirks the first run exposed, and how to replay a run end-to-end live on the methodology page. The matching blog post is Same prompt, five languages, byte-exact.


Repos

  • agentlang-index Corpus, references, and the bench/ runner. v1.0 corpus is 20 tasks × 5 languages.
  • bench/results Per-model run output. Every prompt, every response, every test case capture.
  • agentlang-index-data Open dataset under CC-BY-4.0. Packaged exports of every run: manifest, dashboard rollups, and 300 per-attempt records.
  • agentlang-spec Third-party companion: Zero CLI vendored, no-network.