agentlang-index · task easy
CSV line tokenizer (RFC 4180 subset)
007-csv-line-tokenize. Read one line of CSV input from standard input.
Prompt
This is the natural-language brief given to every model, verbatim. The harness prefixes a language-specific calling-convention block and suffixes a "return only the source code" instruction. Nothing else.
## Task: CSV line tokenizer (RFC 4180 subset)
Read one line of CSV input from standard input. Split it into fields per
a simplified RFC 4180 grammar and write each field on its own line of
stdout terminated by `\n`.
### Grammar
- Fields are separated by commas (`,`).
- A field may be **quoted** by enclosing the field in double quotes (`"`).
Inside a quoted field, a literal `"` is represented by two consecutive
double quotes (`""`), and commas are literal.
- An **unquoted** field contains no commas and no double quotes.
- The input is at most 1000 printable-ASCII characters and is terminated
by `\n`. The input is well-formed; you do not need to handle malformed
CSV.
- If the input line is empty, write nothing (zero fields, zero newlines).
### Output
Each field, in input order, followed by exactly one `\n`. Exit with status 0.
## Examples
| input | output |
| ----------------- | --------------------- |
| `a,b,c` | `a\nb\nc\n` |
| `"a,b",c` | `a,b\nc\n` |
| `,,` | `\n\n\n` |
| (empty) | (empty) |
| `"a""b","c"` | `a"b\nc\n` |
| `1,"hello, world",foo` | `1\nhello, world\nfoo\n` |
Two consecutive commas `,,` parse as three empty fields. A line with just
a single comma `,` parses as two empty fields. A pair of double quotes
`""` is one empty field.
## Language scaffold
{language_scaffold}
Acceptance
A task counts as passed only when every public and hidden test case agrees on these fields. No fuzzy matching, no "off by one trailing newline is fine."
| stdout (byte-exact, per case) | true |
|---|---|
| stderr (exact bytes) | "" |
| exit code | 0 |
| wall time max (ms) | 5000 |
| tags | parsing, strings, state-machine |
Results
Each cell is one attempt. Pass means stdout matched byte-exact on every test case, stderr empty, exit zero. Hover a failure to see the captured first line of the diagnostic.
| Model | Zero | TypeScript | Rust | Go | Python |
|---|---|---|---|---|---|
| gpt-4o | compile | ✓ | other | ✓ | ✓ |
| gpt-4o-mini | compile | ✓ | ✓ | other | other |
| gpt-5 | wrong output | ✓ | ✓ | ✓ | ✓ |
Failure excerpts
6 of 15 attempts failed. Each card is one attempt, with the captured first line of the diagnostic.
-
ref.zero:1:1 IMP001: unknown package-local import 'std' -
(no diagnostic captured) -
ref.zero:1:15 PAR100: expected expression -
# command-line-arguments -
(no diagnostic captured) -
(no diagnostic captured)
Reference implementations
The hand-written reference each language ships with. Every reference passes the same public and hidden test suite under the pinned toolchain before any model touches the task.
Click a language to expand
Zero 90 lines
// CSV line tokenizer (RFC 4180 subset), Zero 0.1.2 direct backend.
//
// argv[1] is the CSV line (Zero 0.1.2 has no exposed stdin capability).
// All logic stays inside `pub fun main` to avoid Span/MutSpan parameter
// restrictions.
//
// State machine over input bytes:
// 0 FIELD_START - start of a field, may begin quoted or not
// 1 IN_UNQUOTED - inside an unquoted field
// 2 IN_QUOTED - inside a quoted field
// 3 AFTER_CLOSING_QUOTE - saw a `"` in a quoted field; deciding
// whether it was an escape ("") or the close
//
// Output is written byte-by-byte into a stack [1200]u8 (max 1000 input
// chars plus one newline per field; a worst-case `,,,...,,` is 501
// fields => 501 newlines, fits in 1200 with headroom).
pub fun main(world: World) -> Void raises {
let line_opt = std.args.get(1)
let mut bytes: Span<u8> = std.mem.span("")
if line_opt.has {
bytes = std.mem.span(line_opt.value)
}
let n: usize = std.mem.len(bytes)
let mut out_buf: [1200]u8 = [0_u8; 1200]
let mut written: usize = 0
if n > 0 {
let mut state: u32 = 0_u32
let mut i: usize = 0
while i < n {
let ch: u8 = bytes[i]
if state == 0_u32 {
if ch == 34_u8 {
state = 2_u32
} else {
if ch == 44_u8 {
out_buf[written] = 10_u8
written = written + 1
} else {
out_buf[written] = ch
written = written + 1
state = 1_u32
}
}
} else {
if state == 1_u32 {
if ch == 44_u8 {
out_buf[written] = 10_u8
written = written + 1
state = 0_u32
} else {
out_buf[written] = ch
written = written + 1
}
} else {
if state == 2_u32 {
if ch == 34_u8 {
state = 3_u32
} else {
out_buf[written] = ch
written = written + 1
}
} else {
// state == 3 (AFTER_CLOSING_QUOTE)
if ch == 34_u8 {
out_buf[written] = 34_u8
written = written + 1
state = 2_u32
} else {
if ch == 44_u8 {
out_buf[written] = 10_u8
written = written + 1
state = 0_u32
}
}
}
}
}
i = i + 1
}
out_buf[written] = 10_u8
written = written + 1
}
let out: Span<u8> = out_buf[0..written]
check world.out.write(out)
return
}
TypeScript 56 lines
// CSV line tokenizer (RFC 4180 subset), TypeScript reference.
// Reads one line of CSV from stdin and writes each field on its own line.
import { readFileSync } from "node:fs";
const FIELD_START = 0;
const IN_UNQUOTED = 1;
const IN_QUOTED = 2;
const AFTER_CLOSING_QUOTE = 3;
function main(): void {
let line = readFileSync(0, "utf8");
if (line.endsWith("\n")) line = line.slice(0, -1);
const out: string[] = [];
let state = FIELD_START;
if (line.length > 0) {
for (const ch of line) {
if (state === FIELD_START) {
if (ch === '"') {
state = IN_QUOTED;
} else if (ch === ",") {
out.push("\n");
} else {
out.push(ch);
state = IN_UNQUOTED;
}
} else if (state === IN_UNQUOTED) {
if (ch === ",") {
out.push("\n");
state = FIELD_START;
} else {
out.push(ch);
}
} else if (state === IN_QUOTED) {
if (ch === '"') {
state = AFTER_CLOSING_QUOTE;
} else {
out.push(ch);
}
} else if (state === AFTER_CLOSING_QUOTE) {
if (ch === '"') {
out.push('"');
state = IN_QUOTED;
} else if (ch === ",") {
out.push("\n");
state = FIELD_START;
}
}
}
out.push("\n");
}
process.stdout.write(out.join(""));
}
main();
Rust 64 lines
// CSV line tokenizer (RFC 4180 subset), Rust reference.
use std::io::{self, Read, Write};
const FIELD_START: u8 = 0;
const IN_UNQUOTED: u8 = 1;
const IN_QUOTED: u8 = 2;
const AFTER_CLOSING_QUOTE: u8 = 3;
fn main() {
let mut input = String::new();
io::stdin().read_to_string(&mut input).unwrap();
if input.ends_with('\n') {
input.pop();
}
let mut out: Vec<u8> = Vec::with_capacity(input.len() + 64);
let mut state = FIELD_START;
if !input.is_empty() {
for ch in input.bytes() {
match state {
FIELD_START => {
if ch == b'"' {
state = IN_QUOTED;
} else if ch == b',' {
out.push(b'\n');
} else {
out.push(ch);
state = IN_UNQUOTED;
}
}
IN_UNQUOTED => {
if ch == b',' {
out.push(b'\n');
state = FIELD_START;
} else {
out.push(ch);
}
}
IN_QUOTED => {
if ch == b'"' {
state = AFTER_CLOSING_QUOTE;
} else {
out.push(ch);
}
}
AFTER_CLOSING_QUOTE => {
if ch == b'"' {
out.push(b'"');
state = IN_QUOTED;
} else if ch == b',' {
out.push(b'\n');
state = FIELD_START;
}
}
_ => unreachable!(),
}
}
out.push(b'\n');
}
let stdout = io::stdout();
let mut h = stdout.lock();
h.write_all(&out).unwrap();
}
Go 64 lines
// CSV line tokenizer (RFC 4180 subset), Go reference.
package main
import (
"bufio"
"os"
"strings"
)
const (
fieldStart = 0
inUnquoted = 1
inQuoted = 2
afterClosingQuote = 3
)
func main() {
reader := bufio.NewReader(os.Stdin)
line, _ := reader.ReadString('\n')
line = strings.TrimRight(line, "\n")
var out strings.Builder
state := fieldStart
if len(line) > 0 {
for i := 0; i < len(line); i++ {
ch := line[i]
switch state {
case fieldStart:
if ch == '"' {
state = inQuoted
} else if ch == ',' {
out.WriteByte('\n')
} else {
out.WriteByte(ch)
state = inUnquoted
}
case inUnquoted:
if ch == ',' {
out.WriteByte('\n')
state = fieldStart
} else {
out.WriteByte(ch)
}
case inQuoted:
if ch == '"' {
state = afterClosingQuote
} else {
out.WriteByte(ch)
}
case afterClosingQuote:
if ch == '"' {
out.WriteByte('"')
state = inQuoted
} else if ch == ',' {
out.WriteByte('\n')
state = fieldStart
}
}
}
out.WriteByte('\n')
}
os.Stdout.WriteString(out.String())
}
Python 56 lines
"""CSV line tokenizer (RFC 4180 subset), Python reference.
Reads one line of CSV from stdin and writes each field on its own line.
Uses an explicit state machine rather than the csv module to keep parity
with the other language references (and to make the trap shape - empty
input vs ",," vs '""' - explicit).
"""
import sys
FIELD_START = 0
IN_UNQUOTED = 1
IN_QUOTED = 2
AFTER_CLOSING_QUOTE = 3
def main() -> None:
line = sys.stdin.readline()
if line.endswith("\n"):
line = line[:-1]
out = []
state = FIELD_START
if line:
for ch in line:
if state == FIELD_START:
if ch == '"':
state = IN_QUOTED
elif ch == ",":
out.append("\n")
else:
out.append(ch)
state = IN_UNQUOTED
elif state == IN_UNQUOTED:
if ch == ",":
out.append("\n")
state = FIELD_START
else:
out.append(ch)
elif state == IN_QUOTED:
if ch == '"':
state = AFTER_CLOSING_QUOTE
else:
out.append(ch)
elif state == AFTER_CLOSING_QUOTE:
if ch == '"':
out.append('"')
state = IN_QUOTED
elif ch == ",":
out.append("\n")
state = FIELD_START
out.append("\n")
sys.stdout.write("".join(out))
if __name__ == "__main__":
main()
Design notes
Algorithm, failure modes, cross-language parity, and where Zero needed a workaround. From corpus/007-csv-line-tokenize/notes.md.
Algorithm
Explicit four-state machine over input bytes:
FIELD_START(0): about to start a new fieldIN_UNQUOTED(1): inside an unquoted fieldIN_QUOTED(2): inside a quoted fieldAFTER_CLOSING_QUOTE(3): saw a"inside a quoted field; deciding whether it was an escape ("") or the close of the field
Transitions:
| state | " |
, |
other |
|---|---|---|---|
FIELD_START |
→ IN_QUOTED |
emit \n |
emit char, → IN_UNQUOTED |
IN_UNQUOTED |
(does not occur per spec) | emit \n, → FIELD_START |
emit char |
IN_QUOTED |
→ AFTER_CLOSING_QUOTE |
emit char | emit char |
AFTER_CLOSING_QUOTE |
emit ", → IN_QUOTED |
emit \n, → FIELD_START |
(malformed, ignored) |
After the loop ends, if any byte was processed, emit one final \n (the
last field's terminator).
Why a state machine, not Python's csv module
The Python reference deliberately re-implements the state machine even
though csv.reader is available. The whole point of the AgentLang Index
is to make every reference do the same byte-level work so a model writing
each language has the same shape of code to discover. Hiding the
state machine behind a stdlib reader in Python would let TS/Rust/Go/Zero
diverge silently when a model misreads "" as "escape" vs "close + new
quoted field."
Edge cases the test set captures
- Empty input → zero fields → no output.
,,→ three empty fields → three newlines."a,b",c→ comma inside quoted field is literal."a""b","c"→""is a literal"inside a quoted field.1,"hello, world",foo→ mixed quoted/unquoted with embedded comma and space.
The single comma case (,) is not in the published set but is the
canonical trap: it produces two empty fields, not one. The state
machine handles it: FIELD_START sees ,, emits \n and stays in
FIELD_START; loop ends; emit final \n → output is \n\n.
Zero-specific notes
- argv[1] is the line.
- No
match/switchin Zero 0.1.2 direct backend, so the state transitions are nestedifblocks (state == 0 vs state == 1 etc.). - Output buffer is
[1200]u8. Worst-case is 501 newlines for,× 500 (the spec caps input at 1000 chars, so worst-case fields = 501). - Output is built once and written with a single
world.out.writecall on a slice of the buffer; no per-field allocations.
Cross-implementation parity
All five references produce byte-exact output on every case in both stdin (TS/Rust/Go/Python) and argv (Zero) input modes.
Cost
| Model | Prompt tokens | Completion tokens | API ms |
|---|---|---|---|
| gpt-4o | 3,145 | 1,353 | 14,505 |
| gpt-4o-mini | 3,145 | 1,243 | 22,324 |
| gpt-5 | 3,140 | 29,364 | 304,479 |
Tokens and API ms are summed across the five languages this model attempted for this task.
Compare
Model deep-dives: gpt-4o · gpt-4o-mini · gpt-5 . Back to the leaderboard and methodology.