CSV line tokenizer (RFC 4180 subset)

007-csv-line-tokenize. Read one line of CSV input from standard input.

Prompt

This is the natural-language brief given to every model, verbatim. The harness prefixes a language-specific calling-convention block and suffixes a "return only the source code" instruction. Nothing else.

## Task: CSV line tokenizer (RFC 4180 subset)

Read one line of CSV input from standard input. Split it into fields per
a simplified RFC 4180 grammar and write each field on its own line of
stdout terminated by `\n`.

### Grammar

- Fields are separated by commas (`,`).
- A field may be **quoted** by enclosing the field in double quotes (`"`).
  Inside a quoted field, a literal `"` is represented by two consecutive
  double quotes (`""`), and commas are literal.
- An **unquoted** field contains no commas and no double quotes.
- The input is at most 1000 printable-ASCII characters and is terminated
  by `\n`. The input is well-formed; you do not need to handle malformed
  CSV.
- If the input line is empty, write nothing (zero fields, zero newlines).

### Output

Each field, in input order, followed by exactly one `\n`. Exit with status 0.

## Examples

| input             | output                |
| ----------------- | --------------------- |
| `a,b,c`           | `a\nb\nc\n`           |
| `"a,b",c`         | `a,b\nc\n`            |
| `,,`              | `\n\n\n`              |
| (empty)           | (empty)               |
| `"a""b","c"`      | `a"b\nc\n`            |
| `1,"hello, world",foo` | `1\nhello, world\nfoo\n` |

Two consecutive commas `,,` parse as three empty fields. A line with just
a single comma `,` parses as two empty fields. A pair of double quotes
`""` is one empty field.

Acceptance

A task counts as passed only when every public and hidden test case agrees on these fields. No fuzzy matching, no "off by one trailing newline is fine."

stdout (byte-exact, per case)	true
stderr (exact bytes)	""
exit code	0
wall time max (ms)	5000
tags	parsing, strings, state-machine

Results

Each cell is one attempt. Pass means stdout matched byte-exact on every test case, stderr empty, exit zero. Hover a failure to see the captured first line of the diagnostic.

Model	Zero	TypeScript	Rust	Go	Python
gpt-4o	compile	✓	other	✓	✓
gpt-4o-mini	compile	✓	✓	other	other
gpt-5	wrong output	✓	✓	✓	✓
opus	compile	other	✓	✓	other
sonnet	wrong output	✓	✓	✓	✓

Failure excerpts

10 of 25 attempts failed. Each card is one attempt, with the captured first line of the diagnostic.

gpt-4o Zero compile

ref.zero:1:1 IMP001: unknown package-local import 'std'

gpt-4o Rust other
```
(no diagnostic captured)
```

gpt-4o-mini Zero compile

ref.zero:1:15 PAR100: expected expression

gpt-4o-mini Go other
```
# command-line-arguments
```
gpt-4o-mini Python other
```
(no diagnostic captured)
```
gpt-5 Zero wrong output
```
(no diagnostic captured)
```

opus Zero compile

ref.zero:1:13 PAR100: unexpected character '@'

opus TypeScript other
```
(no diagnostic captured)
```
opus Python other
```
(no diagnostic captured)
```
sonnet Zero wrong output
```
(no diagnostic captured)
```

Reference implementations

The hand-written reference each language ships with. Every reference passes the same public and hidden test suite under the pinned toolchain before any model touches the task.

Click a language to expand

Zero n/a

No reference implementation in this language.

TypeScript 56 lines

// CSV line tokenizer (RFC 4180 subset), TypeScript reference.
// Reads one line of CSV from stdin and writes each field on its own line.

import { readFileSync } from "node:fs";

const FIELD_START = 0;
const IN_UNQUOTED = 1;
const IN_QUOTED = 2;
const AFTER_CLOSING_QUOTE = 3;

function main(): void {
  let line = readFileSync(0, "utf8");
  if (line.endsWith("\n")) line = line.slice(0, -1);
  const out: string[] = [];
  let state = FIELD_START;
  if (line.length > 0) {
    for (const ch of line) {
      if (state === FIELD_START) {
        if (ch === '"') {
          state = IN_QUOTED;
        } else if (ch === ",") {
          out.push("\n");
        } else {
          out.push(ch);
          state = IN_UNQUOTED;
        }
      } else if (state === IN_UNQUOTED) {
        if (ch === ",") {
          out.push("\n");
          state = FIELD_START;
        } else {
          out.push(ch);
        }
      } else if (state === IN_QUOTED) {
        if (ch === '"') {
          state = AFTER_CLOSING_QUOTE;
        } else {
          out.push(ch);
        }
      } else if (state === AFTER_CLOSING_QUOTE) {
        if (ch === '"') {
          out.push('"');
          state = IN_QUOTED;
        } else if (ch === ",") {
          out.push("\n");
          state = FIELD_START;
        }
      }
    }
    out.push("\n");
  }
  process.stdout.write(out.join(""));
}

main();

Rust 64 lines

// CSV line tokenizer (RFC 4180 subset), Rust reference.

use std::io::{self, Read, Write};

const FIELD_START: u8 = 0;
const IN_UNQUOTED: u8 = 1;
const IN_QUOTED: u8 = 2;
const AFTER_CLOSING_QUOTE: u8 = 3;

fn main() {
    let mut input = String::new();
    io::stdin().read_to_string(&mut input).unwrap();
    if input.ends_with('\n') {
        input.pop();
    }
    let mut out: Vec<u8> = Vec::with_capacity(input.len() + 64);
    let mut state = FIELD_START;
    if !input.is_empty() {
        for ch in input.bytes() {
            match state {
                FIELD_START => {
                    if ch == b'"' {
                        state = IN_QUOTED;
                    } else if ch == b',' {
                        out.push(b'\n');
                    } else {
                        out.push(ch);
                        state = IN_UNQUOTED;
                    }
                }
                IN_UNQUOTED => {
                    if ch == b',' {
                        out.push(b'\n');
                        state = FIELD_START;
                    } else {
                        out.push(ch);
                    }
                }
                IN_QUOTED => {
                    if ch == b'"' {
                        state = AFTER_CLOSING_QUOTE;
                    } else {
                        out.push(ch);
                    }
                }
                AFTER_CLOSING_QUOTE => {
                    if ch == b'"' {
                        out.push(b'"');
                        state = IN_QUOTED;
                    } else if ch == b',' {
                        out.push(b'\n');
                        state = FIELD_START;
                    }
                }
                _ => unreachable!(),
            }
        }
        out.push(b'\n');
    }
    let stdout = io::stdout();
    let mut h = stdout.lock();
    h.write_all(&out).unwrap();
}

Go 64 lines

// CSV line tokenizer (RFC 4180 subset), Go reference.

package main

import (
	"bufio"
	"os"
	"strings"
)

const (
	fieldStart        = 0
	inUnquoted        = 1
	inQuoted          = 2
	afterClosingQuote = 3
)

func main() {
	reader := bufio.NewReader(os.Stdin)
	line, _ := reader.ReadString('\n')
	line = strings.TrimRight(line, "\n")
	var out strings.Builder
	state := fieldStart
	if len(line) > 0 {
		for i := 0; i < len(line); i++ {
			ch := line[i]
			switch state {
			case fieldStart:
				if ch == '"' {
					state = inQuoted
				} else if ch == ',' {
					out.WriteByte('\n')
				} else {
					out.WriteByte(ch)
					state = inUnquoted
				}
			case inUnquoted:
				if ch == ',' {
					out.WriteByte('\n')
					state = fieldStart
				} else {
					out.WriteByte(ch)
				}
			case inQuoted:
				if ch == '"' {
					state = afterClosingQuote
				} else {
					out.WriteByte(ch)
				}
			case afterClosingQuote:
				if ch == '"' {
					out.WriteByte('"')
					state = inQuoted
				} else if ch == ',' {
					out.WriteByte('\n')
					state = fieldStart
				}
			}
		}
		out.WriteByte('\n')
	}
	os.Stdout.WriteString(out.String())
}

Python 56 lines

"""CSV line tokenizer (RFC 4180 subset), Python reference.

Reads one line of CSV from stdin and writes each field on its own line.
Uses an explicit state machine rather than the csv module to keep parity
with the other language references (and to make the trap shape - empty
input vs ",," vs '""' - explicit).
"""
import sys

FIELD_START = 0
IN_UNQUOTED = 1
IN_QUOTED = 2
AFTER_CLOSING_QUOTE = 3


def main() -> None:
    line = sys.stdin.readline()
    if line.endswith("\n"):
        line = line[:-1]
    out = []
    state = FIELD_START
    if line:
        for ch in line:
            if state == FIELD_START:
                if ch == '"':
                    state = IN_QUOTED
                elif ch == ",":
                    out.append("\n")
                else:
                    out.append(ch)
                    state = IN_UNQUOTED
            elif state == IN_UNQUOTED:
                if ch == ",":
                    out.append("\n")
                    state = FIELD_START
                else:
                    out.append(ch)
            elif state == IN_QUOTED:
                if ch == '"':
                    state = AFTER_CLOSING_QUOTE
                else:
                    out.append(ch)
            elif state == AFTER_CLOSING_QUOTE:
                if ch == '"':
                    out.append('"')
                    state = IN_QUOTED
                elif ch == ",":
                    out.append("\n")
                    state = FIELD_START
        out.append("\n")
    sys.stdout.write("".join(out))


if __name__ == "__main__":
    main()

Design notes

Algorithm, failure modes, cross-language parity, and where Zero needed a workaround. From corpus/007-csv-line-tokenize/notes.md.

Algorithm

Explicit four-state machine over input bytes:

FIELD_START (0): about to start a new field
IN_UNQUOTED (1): inside an unquoted field
IN_QUOTED (2): inside a quoted field
AFTER_CLOSING_QUOTE (3): saw a " inside a quoted field; deciding whether it was an escape ("") or the close of the field

Transitions:

state	`"`	`,`	other
`FIELD_START`	→ `IN_QUOTED`	emit `\n`	emit char, → `IN_UNQUOTED`
`IN_UNQUOTED`	(does not occur per spec)	emit `\n`, → `FIELD_START`	emit char
`IN_QUOTED`	→ `AFTER_CLOSING_QUOTE`	emit char	emit char
`AFTER_CLOSING_QUOTE`	emit `"`, → `IN_QUOTED`	emit `\n`, → `FIELD_START`	(malformed, ignored)

After the loop ends, if any byte was processed, emit one final \n (the last field's terminator).

Why a state machine, not Python's `csv` module

The Python reference deliberately re-implements the state machine even though csv.reader is available. The whole point of the AgentLang Index is to make every reference do the same byte-level work so a model writing each language has the same shape of code to discover. Hiding the state machine behind a stdlib reader in Python would let TS/Rust/Go/Zero diverge silently when a model misreads "" as "escape" vs "close + new quoted field."

Edge cases the test set captures

Empty input → zero fields → no output.
,, → three empty fields → three newlines.
"a,b",c → comma inside quoted field is literal.
"a""b","c" → "" is a literal " inside a quoted field.
1,"hello, world",foo → mixed quoted/unquoted with embedded comma and space.

The single comma case (,) is not in the published set but is the canonical trap: it produces two empty fields, not one. The state machine handles it: FIELD_START sees ,, emits \n and stays in FIELD_START; loop ends; emit final \n → output is \n\n.

Zero-specific notes

argv[1] is the line.
No match/switch in Zero 0.1.2 direct backend, so the state transitions are nested if blocks (state == 0 vs state == 1 etc.).
Output buffer is [1200]u8. Worst-case is 501 newlines for , × 500 (the spec caps input at 1000 chars, so worst-case fields = 501).
Output is built once and written with a single world.out.write call on a slice of the buffer; no per-field allocations.

Cross-implementation parity

All five references produce byte-exact output on every case in both stdin (TS/Rust/Go/Python) and argv (Zero) input modes.

Cost

Model	Prompt tokens	Completion tokens	API ms
gpt-4o	3,145	1,353	14,505
gpt-4o-mini	3,145	1,243	22,324
gpt-5	3,140	29,364	304,479
opus	14	1,890	80,296
sonnet	12	1,738	151,379

Tokens and API ms are summed across the five languages this model attempted for this task.

Compare

Model deep-dives: gpt-4o · gpt-4o-mini · gpt-5 · opus · sonnet . Back to the leaderboard and methodology.

Truffle · 007-csv-line-tokenize · run captured 2026-06-27