agentlang-index · task easy

CSV line tokenizer (RFC 4180 subset)

007-csv-line-tokenize. Read one line of CSV input from standard input.

Prompt

This is the natural-language brief given to every model, verbatim. The harness prefixes a language-specific calling-convention block and suffixes a "return only the source code" instruction. Nothing else.

## Task: CSV line tokenizer (RFC 4180 subset)

Read one line of CSV input from standard input. Split it into fields per
a simplified RFC 4180 grammar and write each field on its own line of
stdout terminated by `\n`.

### Grammar

- Fields are separated by commas (`,`).
- A field may be **quoted** by enclosing the field in double quotes (`"`).
  Inside a quoted field, a literal `"` is represented by two consecutive
  double quotes (`""`), and commas are literal.
- An **unquoted** field contains no commas and no double quotes.
- The input is at most 1000 printable-ASCII characters and is terminated
  by `\n`. The input is well-formed; you do not need to handle malformed
  CSV.
- If the input line is empty, write nothing (zero fields, zero newlines).

### Output

Each field, in input order, followed by exactly one `\n`. Exit with status 0.

## Examples

| input             | output                |
| ----------------- | --------------------- |
| `a,b,c`           | `a\nb\nc\n`           |
| `"a,b",c`         | `a,b\nc\n`            |
| `,,`              | `\n\n\n`              |
| (empty)           | (empty)               |
| `"a""b","c"`      | `a"b\nc\n`            |
| `1,"hello, world",foo` | `1\nhello, world\nfoo\n` |

Two consecutive commas `,,` parse as three empty fields. A line with just
a single comma `,` parses as two empty fields. A pair of double quotes
`""` is one empty field.

## Language scaffold

{language_scaffold}

Acceptance

A task counts as passed only when every public and hidden test case agrees on these fields. No fuzzy matching, no "off by one trailing newline is fine."

stdout (byte-exact, per case) true
stderr (exact bytes) ""
exit code 0
wall time max (ms) 5000
tags parsing, strings, state-machine

Results

Each cell is one attempt. Pass means stdout matched byte-exact on every test case, stderr empty, exit zero. Hover a failure to see the captured first line of the diagnostic.

Model ZeroTypeScriptRustGoPython
gpt-4o compile other
gpt-4o-mini compile other other
gpt-5 wrong output

Failure excerpts

6 of 15 attempts failed. Each card is one attempt, with the captured first line of the diagnostic.

  1. gpt-4o Zero compile
    ref.zero:1:1 IMP001: unknown package-local import 'std'
  2. gpt-4o Rust other
    (no diagnostic captured)
  3. gpt-4o-mini Zero compile
    ref.zero:1:15 PAR100: expected expression
  4. gpt-4o-mini Go other
    # command-line-arguments
  5. gpt-4o-mini Python other
    (no diagnostic captured)
  6. gpt-5 Zero wrong output
    (no diagnostic captured)

Reference implementations

The hand-written reference each language ships with. Every reference passes the same public and hidden test suite under the pinned toolchain before any model touches the task.

Click a language to expand

Zero 90 lines
// CSV line tokenizer (RFC 4180 subset), Zero 0.1.2 direct backend.
//
// argv[1] is the CSV line (Zero 0.1.2 has no exposed stdin capability).
// All logic stays inside `pub fun main` to avoid Span/MutSpan parameter
// restrictions.
//
// State machine over input bytes:
//   0 FIELD_START          - start of a field, may begin quoted or not
//   1 IN_UNQUOTED          - inside an unquoted field
//   2 IN_QUOTED            - inside a quoted field
//   3 AFTER_CLOSING_QUOTE  - saw a `"` in a quoted field; deciding
//                            whether it was an escape ("") or the close
//
// Output is written byte-by-byte into a stack [1200]u8 (max 1000 input
// chars plus one newline per field; a worst-case `,,,...,,` is 501
// fields => 501 newlines, fits in 1200 with headroom).
pub fun main(world: World) -> Void raises {
    let line_opt = std.args.get(1)
    let mut bytes: Span<u8> = std.mem.span("")
    if line_opt.has {
        bytes = std.mem.span(line_opt.value)
    }
    let n: usize = std.mem.len(bytes)

    let mut out_buf: [1200]u8 = [0_u8; 1200]
    let mut written: usize = 0

    if n > 0 {
        let mut state: u32 = 0_u32
        let mut i: usize = 0
        while i < n {
            let ch: u8 = bytes[i]
            if state == 0_u32 {
                if ch == 34_u8 {
                    state = 2_u32
                } else {
                    if ch == 44_u8 {
                        out_buf[written] = 10_u8
                        written = written + 1
                    } else {
                        out_buf[written] = ch
                        written = written + 1
                        state = 1_u32
                    }
                }
            } else {
                if state == 1_u32 {
                    if ch == 44_u8 {
                        out_buf[written] = 10_u8
                        written = written + 1
                        state = 0_u32
                    } else {
                        out_buf[written] = ch
                        written = written + 1
                    }
                } else {
                    if state == 2_u32 {
                        if ch == 34_u8 {
                            state = 3_u32
                        } else {
                            out_buf[written] = ch
                            written = written + 1
                        }
                    } else {
                        // state == 3 (AFTER_CLOSING_QUOTE)
                        if ch == 34_u8 {
                            out_buf[written] = 34_u8
                            written = written + 1
                            state = 2_u32
                        } else {
                            if ch == 44_u8 {
                                out_buf[written] = 10_u8
                                written = written + 1
                                state = 0_u32
                            }
                        }
                    }
                }
            }
            i = i + 1
        }
        out_buf[written] = 10_u8
        written = written + 1
    }

    let out: Span<u8> = out_buf[0..written]
    check world.out.write(out)
    return
}
TypeScript 56 lines
// CSV line tokenizer (RFC 4180 subset), TypeScript reference.
// Reads one line of CSV from stdin and writes each field on its own line.

import { readFileSync } from "node:fs";

const FIELD_START = 0;
const IN_UNQUOTED = 1;
const IN_QUOTED = 2;
const AFTER_CLOSING_QUOTE = 3;

function main(): void {
  let line = readFileSync(0, "utf8");
  if (line.endsWith("\n")) line = line.slice(0, -1);
  const out: string[] = [];
  let state = FIELD_START;
  if (line.length > 0) {
    for (const ch of line) {
      if (state === FIELD_START) {
        if (ch === '"') {
          state = IN_QUOTED;
        } else if (ch === ",") {
          out.push("\n");
        } else {
          out.push(ch);
          state = IN_UNQUOTED;
        }
      } else if (state === IN_UNQUOTED) {
        if (ch === ",") {
          out.push("\n");
          state = FIELD_START;
        } else {
          out.push(ch);
        }
      } else if (state === IN_QUOTED) {
        if (ch === '"') {
          state = AFTER_CLOSING_QUOTE;
        } else {
          out.push(ch);
        }
      } else if (state === AFTER_CLOSING_QUOTE) {
        if (ch === '"') {
          out.push('"');
          state = IN_QUOTED;
        } else if (ch === ",") {
          out.push("\n");
          state = FIELD_START;
        }
      }
    }
    out.push("\n");
  }
  process.stdout.write(out.join(""));
}

main();
Rust 64 lines
// CSV line tokenizer (RFC 4180 subset), Rust reference.

use std::io::{self, Read, Write};

const FIELD_START: u8 = 0;
const IN_UNQUOTED: u8 = 1;
const IN_QUOTED: u8 = 2;
const AFTER_CLOSING_QUOTE: u8 = 3;

fn main() {
    let mut input = String::new();
    io::stdin().read_to_string(&mut input).unwrap();
    if input.ends_with('\n') {
        input.pop();
    }
    let mut out: Vec<u8> = Vec::with_capacity(input.len() + 64);
    let mut state = FIELD_START;
    if !input.is_empty() {
        for ch in input.bytes() {
            match state {
                FIELD_START => {
                    if ch == b'"' {
                        state = IN_QUOTED;
                    } else if ch == b',' {
                        out.push(b'\n');
                    } else {
                        out.push(ch);
                        state = IN_UNQUOTED;
                    }
                }
                IN_UNQUOTED => {
                    if ch == b',' {
                        out.push(b'\n');
                        state = FIELD_START;
                    } else {
                        out.push(ch);
                    }
                }
                IN_QUOTED => {
                    if ch == b'"' {
                        state = AFTER_CLOSING_QUOTE;
                    } else {
                        out.push(ch);
                    }
                }
                AFTER_CLOSING_QUOTE => {
                    if ch == b'"' {
                        out.push(b'"');
                        state = IN_QUOTED;
                    } else if ch == b',' {
                        out.push(b'\n');
                        state = FIELD_START;
                    }
                }
                _ => unreachable!(),
            }
        }
        out.push(b'\n');
    }
    let stdout = io::stdout();
    let mut h = stdout.lock();
    h.write_all(&out).unwrap();
}
Go 64 lines
// CSV line tokenizer (RFC 4180 subset), Go reference.

package main

import (
	"bufio"
	"os"
	"strings"
)

const (
	fieldStart        = 0
	inUnquoted        = 1
	inQuoted          = 2
	afterClosingQuote = 3
)

func main() {
	reader := bufio.NewReader(os.Stdin)
	line, _ := reader.ReadString('\n')
	line = strings.TrimRight(line, "\n")
	var out strings.Builder
	state := fieldStart
	if len(line) > 0 {
		for i := 0; i < len(line); i++ {
			ch := line[i]
			switch state {
			case fieldStart:
				if ch == '"' {
					state = inQuoted
				} else if ch == ',' {
					out.WriteByte('\n')
				} else {
					out.WriteByte(ch)
					state = inUnquoted
				}
			case inUnquoted:
				if ch == ',' {
					out.WriteByte('\n')
					state = fieldStart
				} else {
					out.WriteByte(ch)
				}
			case inQuoted:
				if ch == '"' {
					state = afterClosingQuote
				} else {
					out.WriteByte(ch)
				}
			case afterClosingQuote:
				if ch == '"' {
					out.WriteByte('"')
					state = inQuoted
				} else if ch == ',' {
					out.WriteByte('\n')
					state = fieldStart
				}
			}
		}
		out.WriteByte('\n')
	}
	os.Stdout.WriteString(out.String())
}
Python 56 lines
"""CSV line tokenizer (RFC 4180 subset), Python reference.

Reads one line of CSV from stdin and writes each field on its own line.
Uses an explicit state machine rather than the csv module to keep parity
with the other language references (and to make the trap shape - empty
input vs ",," vs '""' - explicit).
"""
import sys

FIELD_START = 0
IN_UNQUOTED = 1
IN_QUOTED = 2
AFTER_CLOSING_QUOTE = 3


def main() -> None:
    line = sys.stdin.readline()
    if line.endswith("\n"):
        line = line[:-1]
    out = []
    state = FIELD_START
    if line:
        for ch in line:
            if state == FIELD_START:
                if ch == '"':
                    state = IN_QUOTED
                elif ch == ",":
                    out.append("\n")
                else:
                    out.append(ch)
                    state = IN_UNQUOTED
            elif state == IN_UNQUOTED:
                if ch == ",":
                    out.append("\n")
                    state = FIELD_START
                else:
                    out.append(ch)
            elif state == IN_QUOTED:
                if ch == '"':
                    state = AFTER_CLOSING_QUOTE
                else:
                    out.append(ch)
            elif state == AFTER_CLOSING_QUOTE:
                if ch == '"':
                    out.append('"')
                    state = IN_QUOTED
                elif ch == ",":
                    out.append("\n")
                    state = FIELD_START
        out.append("\n")
    sys.stdout.write("".join(out))


if __name__ == "__main__":
    main()

Design notes

Algorithm, failure modes, cross-language parity, and where Zero needed a workaround. From corpus/007-csv-line-tokenize/notes.md.

Algorithm

Explicit four-state machine over input bytes:

  • FIELD_START (0): about to start a new field
  • IN_UNQUOTED (1): inside an unquoted field
  • IN_QUOTED (2): inside a quoted field
  • AFTER_CLOSING_QUOTE (3): saw a " inside a quoted field; deciding whether it was an escape ("") or the close of the field

Transitions:

state " , other
FIELD_START IN_QUOTED emit \n emit char, → IN_UNQUOTED
IN_UNQUOTED (does not occur per spec) emit \n, → FIELD_START emit char
IN_QUOTED AFTER_CLOSING_QUOTE emit char emit char
AFTER_CLOSING_QUOTE emit ", → IN_QUOTED emit \n, → FIELD_START (malformed, ignored)

After the loop ends, if any byte was processed, emit one final \n (the last field's terminator).

Why a state machine, not Python's csv module

The Python reference deliberately re-implements the state machine even though csv.reader is available. The whole point of the AgentLang Index is to make every reference do the same byte-level work so a model writing each language has the same shape of code to discover. Hiding the state machine behind a stdlib reader in Python would let TS/Rust/Go/Zero diverge silently when a model misreads "" as "escape" vs "close + new quoted field."

Edge cases the test set captures

  • Empty input → zero fields → no output.
  • ,, → three empty fields → three newlines.
  • "a,b",c → comma inside quoted field is literal.
  • "a""b","c""" is a literal " inside a quoted field.
  • 1,"hello, world",foo → mixed quoted/unquoted with embedded comma and space.

The single comma case (,) is not in the published set but is the canonical trap: it produces two empty fields, not one. The state machine handles it: FIELD_START sees ,, emits \n and stays in FIELD_START; loop ends; emit final \n → output is \n\n.

Zero-specific notes

  • argv[1] is the line.
  • No match/switch in Zero 0.1.2 direct backend, so the state transitions are nested if blocks (state == 0 vs state == 1 etc.).
  • Output buffer is [1200]u8. Worst-case is 501 newlines for , × 500 (the spec caps input at 1000 chars, so worst-case fields = 501).
  • Output is built once and written with a single world.out.write call on a slice of the buffer; no per-field allocations.

Cross-implementation parity

All five references produce byte-exact output on every case in both stdin (TS/Rust/Go/Python) and argv (Zero) input modes.


Cost

Model Prompt tokens Completion tokens API ms
gpt-4o 3,145 1,353 14,505
gpt-4o-mini 3,145 1,243 22,324
gpt-5 3,140 29,364 304,479

Tokens and API ms are summed across the five languages this model attempted for this task.


Compare

Model deep-dives: gpt-4o · gpt-4o-mini · gpt-5 . Back to the leaderboard and methodology.