Square 1:1 format. Camera mounted directly overhead, viewing the cup ...
The prompt
Square 1:1 format. Camera mounted directly overhead, viewing the cup surface from above. Macro close-up of a barista pouring steamed milk into a ceramic cup of espresso, a rosetta leaf pattern blooming face-on in the caramel crema, wisps of steam rising, warm cafe light, ultra realistic top-down footage
What the agent wrote
Overhead macro shot, camera mounted directly above a small white ceramic cup filled with dark espresso, looking straight down at the crema. A thin stream of steamed milk pours steadily from the upper edge into the center, the white milk swirling and feathering outward as the barista wiggles and pulls, a rosetta leaf pattern blooming face-on across the caramel crema in real time. Delicate wisps of steam curl upward through warm tungsten cafe light, the surface rippling gently. Static top-down camera, shallow focus, photorealistic, soft golden highlights on the glossy crema.
Why this framing: The cup viewed dead-on from above is a centered, radially symmetric subject, so a square 1:1 frame keeps the rosetta perfectly composed as requested.
The eval receipt
Every Take clip ships with the measurements that admitted it. Deterministic gates and CV lanes run first; the VLM judge only rules on what those lanes cannot measure, and it must justify each level before naming it. How the cascade works.
L0 · Deterministic gates
| Gate | Result | Reading | Check |
|---|---|---|---|
| decodes | pass | h264 | ffprobe parsed a video stream |
| duration | pass | 5.04s | expected 5s ± 1s |
| resolution | pass | 960x960 | expected height ~960 (±48) |
| framerate | pass | 24.00 fps | sane range 12-60 |
| not black | pass | mean luma 93.4 | 0/121 frames under 12.0 |
| not frozen | pass | mean frame diff 1.789 | static-cheat detector, threshold 0.35 |
L1 · CV lanes
mean abs luma diff between consecutive frames
How much each frame differs from the next, checked on every frame. A high average usually just means a busy scene; sudden spikes far above the average mean strobing.
Farneback flow magnitude, motion energy per frame pair
How far pixels move between frames: one number for how much is actually happening. Under 0.3 is basically a still image; 2 to 8 is normal motion.
cosine of CLIP ViT-B/32 prompt and frame embeddings
How well the frames match what you typed, scored by CLIP. 0.30 and up is well on-prompt; below 0.24 the model probably wandered off.
DINOv2 cosine of each sampled frame against frame 0
Whether the subject stays the same subject, with every frame compared against the first. Watch the min: a single frame below 0.5 means identity broke.
These bands come from a 14-clip calibration set we ran before launch. See the full thresholds and the clips that set them.
L2 · Judge verdict
A convincing overhead latte-art pour: a rosetta leaf blooms face-on across warm caramel crema under golden tungsten light, with steady normal motion and rock-solid scene consistency. Visual quality is high with only mild softness in the milk feathering, and all prompt elements are clearly depicted.
What the judge saw
The timestamped contact sheet, exactly as handed to the judge. ffmpeg pulled eight evenly spaced frames from the clip, stamped each with its timestamp, and tiled them into this one grid; it is the only image the judge reads. Why a grid beats a video.