agentlang-index · methodology

What the benchmark actually measures.

AgentLang Index has one job: produce a comparable number for how often a given model writes a working program in Zero versus how often it writes a working program in TypeScript, Rust, Go, or Python under the same conditions. This page is the long form of that.

The thesis under test

Zero is Vercel Labs' agent-first programming language. The claim that motivates Zero is that frontier language models will write Zero more reliably than they write established languages, because Zero's grammar, error model, and standard library are designed for the way a model emits code rather than the way a human types code.

The thesis is testable. Pick a fixed set of tasks. Pick a fixed set of models. Ask each model to write each task in each of several languages under identical conditions. Score every output against byte-exact acceptance tests. Report the gaps. If Zero is easier for models, Zero's column should be the highest. If not, the numbers say so directly.

Corpus design

The v1.0 corpus is twenty tasks. Each task is a small, self-contained program with a written spec, a reference implementation in each of the five benchmarked languages, and a public test suite that asserts byte-exact stdout, empty stderr, and exit status 0. The reference implementations exist so the corpus tests itself before any model touches it: if my own reference does not pass the suite under the pinned toolchain, the task is not shippable.

Tasks cover six shapes, each chosen to exercise a different axis of language-and-stdlib usage:

I/O smoke. Read stdin or print to stdout in the most direct form the language allows. Catches calling-convention mistakes (000-hello-stdout, 009-word-count, 010-byte-frequency).
Algorithmic computation. Recursion, memoization, dynamic programming, fixed-size matrix math, prime sieve. The model has to translate a clear specification into integer arithmetic and control flow (001-fibonacci-memoized, 002-sieve-prime-count, 003-levenshtein-distance, 004-matrix-multiply).
Parsing and strings. Stack-based balanced-paren validation, substring counting, CSV state machine, word tokenization. Exercises slice operations, byte-vs-char distinctions, and whitespace handling (005-balanced-parens, 006-substring-count, 007-csv-line-tokenize, 008-word-reverse).
Encoding and transformation. Run-length encoding, byte frequency tabulation. Multi-record output forces correct loop boundaries and final-newline discipline (011-rle-encode, 019-run-length-encode).
Network and stdlib breadth. HTTP GET status code, HTTP JSON sum, HTTP header echo. The model has to find the language's HTTP client and JSON parser, then drive them correctly. This is where standard libraries differ most (012-http-status-code, 013-http-json-sum, 014-http-header-echo).
Error handling and module boundaries. Checked u32 division, list-sum with parse failure, u32 overflow detection, Caesar cipher and RLE split across a driver plus a library file. Exercises typed errors, Maybe/Option/Result shapes, and inter-module function signatures (015-checked-divide-u32, 016-parse-list-sum, 017-checked-add-overflow, 018-caesar-cipher).

Every task spec is a single JSON file with the slug, the natural-language prompt, the input schema, the output schema, byte-exact acceptance criteria, a token budget, and the languages it targets. The prompt the model sees is exactly the prompt field, prefixed by a language-specific calling convention block and suffixed by a "return only the source code" instruction. Nothing else.

Two run modes

One-shot. One API call per (model, task, language) triple. Temperature is fixed at zero where the model accepts it. No retries. No iteration. The model sees the prompt, returns code, the harness compiles and runs it. This is the baseline that lets every later improvement measure itself against something concrete.

Agent-loop. Up to five repair attempts. After the first attempt, the harness feeds the model structured diagnostics (compiler errors with file and line, runtime traces, failing test names with expected-vs-actual diffs) and the model edits. Identical to how a developer would use the model in an IDE. This is the mode where the agent-first thesis actually predicts a Zero advantage, because Zero's errors are designed to be machine-edited, not human-read.

The first public run, dated 2026-05-19, is one-shot only. The agent-loop runner shipped on 2026-05-20 and will appear in subsequent runs alongside the one-shot column, so the loop-vs-one-shot delta is visible per model per language.

Scoring

A task counts as passed for a model+language only when every public test case and every hidden test case produces:

The exact stdout bytes specified by the spec (including trailing newlines).
Empty stderr.
Exit status 0.
Wall time under the per-task ceiling (typically 5 seconds, longer for the network tasks).

There is no fuzzy matching, no "off by one trailing newline is fine," no "the JSON parsed but the keys are in a different order." If the output is not byte-identical, the case fails, and any failing case fails the task. This shape is deliberately strict because the question the benchmark answers is not "does the model approximately understand the task" but "does the model produce a program that runs."

Language tax. The headline metric. For a given model, it is the average of (Zero pass-rate minus per-language pass-rate) across TypeScript, Rust, Go, and Python. A negative tax means Zero is harder for that model. A positive tax means Zero is easier. The first run reports tax in the range of -70% to -95% per model.

Toolchains

Generated programs are compiled and executed with the same binaries the corpus reference implementations use. No special handling per model, no permissive lints, no automatic formatting passes.

Zero 0.1.2 from agentlang-spec. Direct ELF64 backend.
TypeScript via bun 1.3.x, with the corpus's pinned tsconfig.json.
Rust 2021 edition via cargo --release, rust-toolchain pinned per corpus.
Go 1.21 via go build, module per task.
Python 3.12, no third-party packages beyond the standard library and a tiny HTTP helper where the task requires it.

Model providers

Two runner paths exist in the harness, one per provider family. Both write the same per-attempt record into the same SQLite schema, so the dashboard treats them identically.

OpenAI models. The TypeScript runner at bench/runner.ts calls the Chat Completions API directly. Temperature zero, no streaming, no tools. Requires OPENAI_API_KEY.
Anthropic models. The Python harness at harness/src/agentlang_harness/ shells out to the local claude CLI in --print --output-format json mode. No ANTHROPIC_API_KEY is read or required; the CLI carries its own authentication. Switched from the Anthropic Python SDK on 2026-06-07 so the harness composes with the agent SDK that operators already have installed.

Per-attempt token accounting (prompt, completion, cache read, cache write) is recorded for both paths, so cost reporting stays uniform across providers.

What the first run exposed about Zero

The first public run is also the first systematic stress test of Zero 0.1.2 in the hands of code-generating models. Every Zero task failed across every model. The failures cluster around ten specific compiler or codegen quirks that the corpus surfaced. None of these are model failures of comprehension; the corresponding TypeScript, Rust, Go, and Python implementations are all correct. They are language-and-toolchain rough edges that an agent-first language has to smooth before it can deliver on the agent-first promise.

SIGFPE on inline (u64 / const) as u32 narrowing. A divisor that is a power of two on the right-hand side of u64 / divisor as u32 traps with SIGFPE (exit 136) in the direct ELF64 backend. Surfaced by the checked-divide tasks.
i32 versus usize requires explicit _usize suffix. A bare integer literal in an index position is inferred i32, not usize, and the cast raises TYP002 at compile time. Surfaced by every loop that indexes a span.
bool is not Bool. The keyword is capitalized; the lowercase form raises TYP002 with no suggestion. Surfaced by every condition the models wrote.
&& does not short-circuit. Both sides of a logical-and are evaluated, so a bounds check followed by an array access traps if the array is empty. The fix is to nest two if blocks; the natural single-expression form is wrong.
No ! prefix operator. Negation is == false, written out. Models almost always reach for ! first.
Trailing write byte-count leaks to exit code. The number of bytes returned from std.io.write propagates as the process exit status unless the program ends with an explicit return 0. A program that prints correctly fails the exit-status check.
Span<u8> and shape values cannot cross user-function boundaries on direct ELF64. Function parameters and return types are restricted to scalars and Bool in the backend the harness uses. Helper functions that the model factors out for clarity stop compiling.
std.parse.parseU* rejects runtime strings. The parser entry points require a literal text argument and refuse a runtime String value. The natural shape for parsing stdin tokens is therefore not what the standard library supports.
User code cannot construct Maybe<T>::Some. The standard-library Maybe type accepts None construction in user code but rejects Some(...) at module boundaries with PAR100. Error paths must return scalars and a side-channel Bool, not Maybe<T>.
if is statement-only. let x = if cond { a } else { b } is rejected. Models port this idiom from Rust and TypeScript by default; the workaround is a mutable binding plus a separate if statement.

Each quirk above is reproducible from the corpus tasks listed in truffle-dev/agentlang-index/corpus. The individual notes.md files in each task directory show the exact minimal repro and the workaround the reference implementation uses. The intent is for the Zero team to see this list and decide which quirks block adoption and which are intentional, then for the next Zero release to flip the language-tax curve in a measurable way.

Reproducing a run

Everything required to replay the first run lives in truffle-dev/agentlang-index. The repo includes the harness, the corpus, the pinned vendor toolchains, the prompts, every model response captured during the run, and the per-attempt result JSON.

git clone https://github.com/truffle-dev/agentlang-index
cd agentlang-index
make setup

# OpenAI models go through the TypeScript runner.
export OPENAI_API_KEY=...
bun bench/runner.ts --models gpt-4o-mini --tasks 000-hello-stdout --languages zero,ts,rust,go,python

# Anthropic models go through the Python harness, which shells out
# to the local `claude` CLI. No ANTHROPIC_API_KEY is read.
uv run agentlang-run one-shot --task 000-hello-stdout --lang python

The runner produces a fresh dashboard JSON under bench/results/. To regenerate the published run end-to-end, drop the --models and --tasks filters and pass the same model list the dashboard shows.

Dataset and license

The harness, corpus, and reference implementations are Apache-2.0. The published dataset (per-attempt model output, structured scores, run manifests) lives in truffle-dev/agentlang-index-data under CC-BY-4.0. Both repos pin the harness git SHA and the Zero version per export, so a replay always knows which corpus and which compiler produced a given number.

What the benchmark refuses to be

Not a Zero marketing instrument. Tasks where Zero scores poorly appear with the same prominence as tasks where it scores well, and the methodology page explains what went wrong rather than how to phrase around it.

Not a closed dataset. Every run is reproducible from the public corpus and the public dataset, with toolchain versions pinned.

Not an opinion piece. The thesis is testable, the artifact is the answer, and the next run will move whichever way the next run moves.

Back to the leaderboard