Take · Research notes
How do you grade one video?
Take's evaluation cascade is not a vibe. Each level exists because a specific published result says the obvious alternative fails. This page walks the evidence: why benchmark metrics can't judge a single clip, why a frontier vision-language model scores at chance on temporal questions, why CLIP and DINO answer different questions, and how a 14-clip calibration set turned the papers into the exact thresholds running in production. Every claim links to a primary source. Where something is synthesis or practitioner lore rather than a citable result, it says so.
In plain words
When you press the button, a video model renders your clip. Then, before you ever see it, four layers of checking run. First, cheap sanity checks: does the file actually play, is it the right length, is it black or frozen. Second, three small vision models and some pixel math measure things a human would eyeball: how much is moving, whether the picture matches what you typed, whether the subject stays the same subject. Third, a large AI model looks at a grid of frames pulled from the clip and writes a graded review, like a film critic with a rubric. Fourth, a simple rule decides: good enough to show you, or roll again (at most three tries). The rest of this page explains why each layer exists.
1. Two different questions, two different toolboxes
Video evaluation research answers two questions that look similar and are not.
“Is the model good?” is a distributional question. You generate a corpus of videos, take a corpus of real ones, embed both, and measure the distance between distributions. That is FVD [FVD]: embed both sets with an I3D action classifier, fit Gaussians, compute the Fréchet distance. Its successor JEDi [JEDi] swaps in V-JEPA features and a kernel distance that drops the Gaussian assumption, and needs roughly a sixth of the samples.
“Is this clip good?” is a per-clip, reference-free, online question. A set of one video has no distribution, so the entire FVD family is structurally useless here. This is the question Take has to answer every time you press the button, and it needs a completely different toolbox: deterministic gates, per-clip metrics, and a judge.
There is a second reason not to lean on FVD-style thinking even for regression testing: it barely measures time. A CVPR 2024 study [FVD-bias] showed FVD is close to insensitive to temporal corruption and can be gamed by resampling from motion-free videos. The root cause is that I3D features are appearance-dominated; most Kinetics action classes are recognizable from single frames. Hold that thought, because the same failure reappears in a different costume two sections down.
2. What “good video” decomposes into
Nobody serious scores video with one number. VBench [VBench] decomposes quality into sixteen disentangled dimensions, each with its own purpose-built evaluator: subject consistency is DINO cosine across frames, background consistency is CLIP cosine, temporal flickering is frame differencing, dynamic degree is optical-flow magnitude, aesthetics is a learned predictor, and so on. VBench-2.0 [VBench-2.0] extends this to intrinsic faithfulness: human fidelity, controllability, creativity, physics, commonsense. EvalCrafter [EvalCrafter] found that a fitted combination of seventeen metrics tracks human opinion better than naive averaging. T2V-CompBench [T2V-CompBench] makes the sharpest version of the point: for compositional skills, the evaluator type has to be matched per dimension. MLLM judges for some, object detectors for others, trackers for motion binding. No single judge covers everything.
Two findings from this literature shaped Take directly.
First, physics is the open frontier. On VideoPhy and VideoPhy-2 [VideoPhy] [VideoPhy-2], the best video generators score roughly 22–40% on physical commonsense. Shadows that detach, liquids that flow uphill, hands that pass through cups. That's why physics is one of Take's six judge axes, and why it's restricted to violations visible in sampled frames.
Second, models cheat by standing still. DEVIL [DEVIL] documents that generators inflate quality scores by under-animating: a nearly static clip flickers less, drifts less, and scores higher on consistency. A motion-quantity check is therefore an anti-gaming signal, not a nice-to-have. Take measures optical flow on every clip and the judge is told explicitly that near-zero flow is a defect unless the prompt asked for stillness.
Synthesizing across VBench, EvalCrafter, T2V-CompBench, VideoPhy and the reward-model literature, six axes keep reappearing: per-frame fidelity, aesthetics, temporal consistency, motion quantity and plausibility, prompt semantics, and physics. Models trade these off against each other, which is exactly why Take reports all six per-axis instead of collapsing to a single score and calling it objective.
3. The judge's blind spot: time
The tempting architecture is one big multimodal model: hand it the video, ask “is this good?”, done. The published evidence says no, in a very specific way.
TempCompass [TempCompass] showed that video LLMs lean on single frames plus language priors and fail on speed, direction, and event-order variants of otherwise identical content. Vinoground [Vinoground] is the cleaner kill: temporal-counterfactual pairs on short videos, where the same events happen in a different order. Humans score about 90%. GPT-4o scores about 50%, which is chance. CLIP-based models are also at random. A vision-language judge cannot reliably verify “X happens, then Y” even on a ten-second clip.
There is also a sampling-theory problem (this one is synthesis, not a single citation): a judge that sees 6–16 sampled frames cannot detect 8–24 Hz flicker, because it aliases away between samples. It is the FVD content-bias failure again, in a different costume: an appearance-dominated evaluator confidently grading a temporal property it physically cannot observe.
Take's answer is a strict division of labor. Everything temporal is owned by deterministic code that sees every frame at full framerate: flicker by frame differencing, motion quantity by optical flow, identity drift by per-frame embeddings. The judge is explicitly instructed not to grade flicker, strobing, speed, or ordering from the contact sheet, and to read the deterministic measurements for those axes instead. The judge grades only what is actually visible in static frames: anatomy, artifacts, composition, whether the required scene elements exist, whether a shadow points the wrong way.
4. The deterministic lanes, and why these four
CLIPScore: what cosine similarity actually measures
CLIP [CLIP] trains two encoders, an image ViT and a text Transformer, on ~400M web image-text pairs with a contrastive objective: in every batch, maximize the cosine similarity of correct image-text pairs and minimize all the wrong pairings. The training objective literally is “cosine is high if and only if the text describes the image.” That is the entire reason CLIPScore [CLIPScore] works: cosine in CLIP space is a direct readout of the quantity the model was trained to encode. Take computes it between your prompt and each sampled frame, then averages.
The same objective dictates the weaknesses. Features only need to be good enough to discriminate captions within a batch, so they are semantic and categorical: word order, attribute binding, counting, and spatial relations are largely invisible to it. CLIPScore ranks on-prompt versus off-prompt usefully, but it does not calibrate to an absolute “goodness,” and it cannot tell “a red cube on a blue sphere” from the reverse. Those compositional checks belong to the judge's semantics axis.
In plain words
CLIP is a model that studied 400 million internet pictures together with their captions until it could tell which caption belongs to which picture. We use it as a matching meter: we hand it your prompt and a frame from the video, and it returns a number for “how much does this picture look like that sentence.” High number, the video is about what you asked for. Low number, the model probably wandered off. What CLIP can't do is fine print: it knows “dog on a beach,” but it can't reliably count three dogs or tell left from right. That fine print is the judge's job.
DINO, not CLIP, for identity
Why use a second embedding model for subject consistency? Because CLIP is invariant to exactly the thing we need to detect. The argument is from DreamBooth [DreamBooth] and is now the VBench convention: CLIP features are class-level, so swapping in a different dog of the same breed barely moves CLIP similarity. DINO [DINO] [DINOv2] is trained with self-distillation and no labels: a student network must match a teacher's output across different crops of the same image. The only learning signal is “these views are the same instance,” so the features are forced to encode what makes this particular object itself: texture, parts, geometry. That makes DINO instance-discriminative where CLIP is class-discriminative. Take embeds every sampled frame with DINOv2 and tracks cosine similarity against frame zero. If the subject quietly morphs into a different subject by frame forty, this lane catches it.
In plain words
CLIP thinks in categories: to CLIP, one golden retriever looks much like another golden retriever. DINO thinks in individuals: it learned, without any labels, to recognize that two different views show the same exact thing, so its features capture the fur pattern, the proportions, the specific face. AI video models have a famous failure where the subject slowly turns into a slightly different subject as the clip plays. We take a fingerprint of the very first frame with DINO and compare every later frame against it. If any frame stops matching, the dog stopped being that dog, and we catch it.
Optical flow: motion as a number
VBench's dynamic degree and DEVIL's dynamics scores are statistics over optical-flow magnitude, with RAFT [RAFT] as the standard estimator. Per-clip, the readings are diagnostic at both extremes: near-zero flow is the frozen “cheat” output DEVIL warns about, and extreme chaotic flow usually means the clip fell apart. Take runs on CPU, so it uses OpenCV's Farneback estimator as a proxy; the bands below were calibrated against that estimator, not RAFT, which is the honest way to use a substitute.
Flicker: full-framerate frame differencing
Mean absolute pixel difference between consecutive frames, computed over every transition in the clip, not the sampled subset. This is VBench's temporal-flickering measure, and it exists in Take precisely because of the sampling blind spot in section 3: it is the lane the judge can never own.
In plain words
Optical flow asks: between this frame and the next one, how far did the pixels move? Average that over the whole clip and you get one honest number for “how much is actually happening.” Near zero means the video is basically a photograph, which is a known trick models use to look better. Flicker is even simpler: we subtract each frame from the next and measure how different they are. A smooth clip changes gently; a strobing, glitchy clip jumps. Both checks look at every single frame, which is exactly what the big AI judge can't do.
Why not a learned video reward model?
The strongest per-clip graders in the literature are learned reward models: VideoScore [VideoScore] (Spearman 77.1 against human ratings), VideoScore2 [VideoScore2] (chain-of-thought before scoring), VisionReward [VisionReward] (a hierarchical checklist of binary judgments, +17.2% over VideoScore, and notably resistant to reward hacking because each judgment is interpretable), and VideoReward [VideoReward]. They are all open-weights and single-GPU-class. Take's pipeline runs on a CPU-only box today, so they are a v2 lane, not a v1 lane. The cascade is designed so a GPU reward model can slot in as an additional L1 lane without changing the architecture.
5. How the judge is built
Take's L2 judge is a frontier vision-language model reading a timestamped contact sheet: a single grid image of frames sampled uniformly across the clip, first and last frames always included. Each design choice has a paper behind it.
The contact sheet itself. IG-VLM [IG-VLM] showed that tiling sampled frames into one grid and prompting a single-image VLM beats dedicated video-LM methods on 9 of 10 zero-shot video-QA benchmarks. The grid preserves coarse temporal order through reading order, and it costs one model call instead of sixteen.
In plain words: what ffmpeg does here
ffmpeg is the open-source Swiss army knife of video; nearly every video pipeline on the internet runs on it. Take uses it for all the physical handling of the file: first to verify the clip actually decodes and has the right length, size, and framerate; then to walk through every frame for the motion and flicker math; then to pull eight evenly spaced frames out of the clip, stamp each one with its timestamp, and tile them into a single grid image, like a strip of film laid flat on a light table. That grid, the contact sheet, is the one picture the judge actually looks at. So when the verdict says “what the judge saw,” that image is literally it: eight moments from your clip, in order, made by ffmpeg.
Uniform sampling, first and last pinned. Uniform is the published default for short clips (VideoScore, VBench, IG-VLM all use it, typically 6–16 frames). Adaptive keyframe sampling [AKS] matters for long video, but for a single-shot 5–10 second clip it degenerates to: spread uniformly, and always include the first and last frames. One nuance worth flagging honestly: codec keyframes (I-frames) are a compression artifact, placed by encoder budget, and are semantically meaningless for sampling decisions. Content keyframes mean scene cuts [TransNet V2], and a generated single-shot clip should have zero of them; an unexpected hard cut inside one is itself a defect.
Discrete levels, not decimals. The judge outputs excellent / good / fair / poor / bad per axis. This is the Q-Align result [Q-Align]: language models score visual quality more reliably through discrete text-defined levels than through raw numbers, because levels are how human raters actually annotate. A “7.3/10” from a language model is fake precision.
Rationale before rating. Each axis requires the judge to write its reasoning first and commit to a level second. LiFT [LiFT] showed rationale-supervised critics generalize better, and VideoScore2 [VideoScore2] trains chain-of-thought-then-score explicitly. Asking for the level first and the justification second invites post-hoc rationalization; ordering it rationale-first is free accuracy.
Decomposed rubric, not a holistic score. This is the most convergent finding in the whole dossier: VBench's disentangled dimensions expose differences a single score hides, EvalCrafter's fitted combination beats naive averaging, T2V-CompBench matches evaluator types per dimension, and VisionReward's interpretable checklist beats a monolithic scorer by double digits. No credible paper argues the reverse.
Bounded retakes. The accept-or-retake decision at L3 is deliberately deterministic: a fixed level-to-points mapping, a fixed threshold (no axis at poor or below), at most three attempts. The judge's retake advice is folded into the next composition, which makes the loop a crude best-of-N search guided by a critic, the same shape the reward-model alignment literature uses [VideoReward], with the budget capped so a bad prompt fails fast instead of burning money.
6. The cascade as economics
The four levels are ordered by cost per verdict.
- L0 gates (ffprobe decode, duration, resolution, framerate, then black-frame and frozen-frame pixel checks) cost milliseconds and catch outright failures. A clip that fails here never wastes anything downstream.
- L1 lanes (flicker, flow, CLIPScore, DINOv2 drift) cost CPU seconds and own every full-framerate temporal question.
- L2 judge costs a frontier-model call and owns the things only a generalist can see: anatomy, composition, composition-level semantics, visible physics.
- L3 decision costs nothing and exists so the expensive parts run at most three times.
This shape, deterministic filters in front of learned evaluators in front of a bounded controller, is not novel; it is what the benchmark suites institutionalize internally. What Take adds is showing you the whole trace live, per clip, while it runs.
7. First-party calibration: 14 clips, real thresholds
Papers give the architecture; they do not give thresholds for this generator, this flow estimator, and this CLIP checkpoint. So before launch we ran a 14-clip reference set through the full pipeline: prompts deliberately chosen to produce good clips, ambient near-static clips, busy multi-subject scenes, strobing, identity breaks, frozen output, and pure darkness. The numbers below are our measurements, and they are the exact bands the production judge reads.
| Lane | Band | Reading | Evidence from the reference set |
|---|---|---|---|
| flow Farneback magnitude | < 0.3 | effectively static | frozen still-wall clip measured 0.152 |
| 0.3 – 1.5 | ambient motion, often intentional | a good cloud timelapse measured 0.969; penalizing it would be wrong | |
| 2 – 8 | normal motion | bulk of accepted clips | |
| > 15 | violent motion | scrutinize plausibility | |
| clipscore ViT-B/32 | 0.30 – 0.46 | good alignment | best clip in the set peaked at 0.458; nothing approaches the folklore “0.5+” |
| 0.24 – 0.30 | gray zone, judge scrutinizes semantics | a fine minimalist cat clip sat at 0.276; a strobing mess sat at 0.237 | |
| < 0.24 | likely off-prompt | ||
| dino_drift DINOv2 cosine vs frame 0 | > 0.90 | same subject throughout | |
| 0.80 – 0.90 | noticeable drift | ||
| min < 0.5 | at least one frame broke identity | a neon-sign clip hit min 0.144 and a spilling-glass clip min 0.279 while their means looked survivable; the minimum is the tell | |
| flicker mean abs frame diff | mean > 8 | usually just a busy scene | two-dogs clip 13.4 and station-crowd clip 10.3, both fine |
| max » mean | strobe | spikes, not a high average, are the strobing signature |
Three lessons from the calibration that no paper handed us:
Read the minimum, not the mean. Identity breaks are single-frame events. Two clips in the set had means that looked acceptable while one sampled frame had completely lost the subject. The production judge is instructed to find and explain any frame where the per-frame DINO cosine drops below 0.5.
High flicker is not strobe. Busy multi-subject scenes produce large frame-to-frame differences continuously; strobe produces spikes. Mean-versus-max separates them cleanly, and a judge told only “flicker is high” would have failed two perfectly good clips.
Some separations belong in the judge, not the gate. Our frozen-frame gate threshold sits at a frame-diff of 0.35. The degenerate still-wall clip measured 0.70 and the genuinely good cloud timelapse measured 0.97. Raising the gate to catch the first would put the second at risk, and a gate is a blind instrument. We left the gate alone and gave the judge the ambient-flow band instead: a clip with flow 0.3–1.5 is graded against intent, not against a motion quota. Gates should only kill what is unambiguously dead.
Darkness, for the record, dies at L0: the pure-darkness clip failed both the black-frame and frozen-frame gates and never reached a judge.
8. Honest limits
The cascade has known blind spots, and pretending otherwise would defeat the point of publishing the receipts.
Event-order verification is unsolved at every level: the lanes measure how much things change, not what happened in what order, and Vinoground says the judge can't do it either. Compositional binding is covered only by the judge's semantics axis, which is better than CLIPScore but not a detector-grade check of counts and spatial relations the way T2V-CompBench does it. Physics is graded only where a violation is visible in sampled frames. The DINO identity lane is meaningless for crowd scenes, where frame zero anchors nothing, and the judge is told to ignore it there. And the learned reward models in section 4 are absent purely for hardware reasons. Each of these is an upgrade path, not a surprise.
One more disclosure: the claim that VLM judges miss high-frequency flicker is our synthesis from sampling theory plus the FVD content-bias mechanism, not a single published experiment, and the observation that generated clips degrade toward their final frames is practitioner lore. Both informed the design anyway; both are labeled.
The best way to understand the cascade is to watch it run. Roll a take and the page will show you every gate, every lane reading, and the judge thinking in real time.
References
- Radford et al., Learning Transferable Visual Models From Natural Language Supervision (CLIP) arxiv.org/abs/2103.00020
- Hessel et al., CLIPScore: A Reference-free Evaluation Metric for Image Captioning arxiv.org/abs/2104.08718
- Caron et al., Emerging Properties in Self-Supervised Vision Transformers (DINO) arxiv.org/abs/2104.14294
- Oquab et al., DINOv2: Learning Robust Visual Features without Supervision arxiv.org/abs/2304.07193
- Ruiz et al., DreamBooth (CLIP-I vs DINO for subject identity) arxiv.org/abs/2208.12242
- Unterthiner et al., Towards Accurate Generative Models of Video (FVD) arxiv.org/abs/1812.01717
- Ge et al., On the Content Bias in Fréchet Video Distance (CVPR 2024) arxiv.org/abs/2404.12391
- Luo et al., Beyond FVD: An Enhanced Evaluation Metric for Video Generation (JEDi) arxiv.org/abs/2410.05203
- Huang et al., VBench: Comprehensive Benchmark Suite for Video Generative Models (CVPR 2024) arxiv.org/abs/2311.17982
- Zheng et al., VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness arxiv.org/abs/2503.21755
- Liu et al., EvalCrafter: Benchmarking and Evaluating Large Video Generation Models arxiv.org/abs/2310.11440
- Sun et al., T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-Video Generation (CVPR 2025) arxiv.org/abs/2407.14505
- Bansal et al., VideoPhy: Evaluating Physical Commonsense for Video Generation arxiv.org/abs/2406.03520
- Bansal et al., VideoPhy-2: Challenging Action-Centric Physical Commonsense Evaluation arxiv.org/abs/2503.06800
- Liao et al., Evaluation of Text-to-Video Generation Models: A Dynamics Perspective (DEVIL) arxiv.org/abs/2407.01094
- Teed & Deng, RAFT: Recurrent All-Pairs Field Transforms for Optical Flow arxiv.org/abs/2003.12039
- Wu et al., Exploring Video Quality Assessment on User Generated Contents (DOVER, ICCV 2023) arxiv.org/abs/2211.04894
- Wu et al., Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels (ICML 2024) arxiv.org/abs/2312.17090
- Kim et al., An Image Grid Can Be Worth a Video (IG-VLM) arxiv.org/abs/2403.18406
- Tang et al., Adaptive Keyframe Sampling for Long Video Understanding (CVPR 2025) arxiv.org/abs/2502.21271
- Liu et al., TempCompass: Do Video LLMs Really Understand Videos? arxiv.org/abs/2403.00476
- Zhang et al., Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos arxiv.org/abs/2410.02763
- He et al., VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback (EMNLP 2024) arxiv.org/abs/2406.15252
- He et al., VideoScore2: Think Before You Score arxiv.org/abs/2509.22799
- Xu et al., VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning arxiv.org/abs/2412.21059
- Wang et al., LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment arxiv.org/abs/2412.04814
- Liu et al., Improving Video Generation with Human Feedback (VideoReward / VideoAlign) arxiv.org/abs/2501.13918
- Souček & Lokoč, TransNet V2: An Effective Deep Network Architecture for Fast Shot Transition Detection arxiv.org/abs/2008.04838