Cut · Research notes

How do you cut a film nobody shot?

Cut assembles 10–30 second films from shots that never existed until the moment they were sampled. That changes what editing means. There is no set, no coverage, no second camera, and no footage bin; there is a timeline document, a chain of conditioning edges, and a stack of checks at every junction. This page walks the lineage: fifty years of edit decision lists, a century of cutting-room grammar, the published work on frame-conditioned generation, the unglamorous mechanics of ffmpeg concat, and a build-system trick that makes revisions nearly free. Where a claim has a paper, it links to one. Where it is cutting-room or pipeline lore, it says so.

In plain words

You type a story. An AI plans it as a list of 5-second shots, like a tiny storyboard, plus a shared visual style and a note about how each shot should hand off to the next. Each shot is rendered by a video model, and each new shot starts from the final frame of the one before it, so the world carries over instead of resetting. Every shot is graded before it is accepted, with up to three tries. Then the accepted shots are spliced together, every splice point is photographed and inspected, and a final judge reviews the whole piece. Afterward you can ask for changes in chat; most changes turn out to be edits to the plan, not to the footage, so they cost almost nothing. The rest of this page explains why each piece works the way it does.

1. The timeline is data: from the CMX 600 to OTIO

In 1971 the CMX 600 introduced computer-assisted editing, and with it an idea that outlived every machine that implemented it: the edit is not the footage. The edit is a list. An edit decision list records which source goes where, with what in and out points, joined by what transition. The CMX3600 EDL format, a plain-text descendant of that system, is still accepted by editing software half a century later, which is the kind of longevity pixels never get.

The modern form of the idea is OpenTimelineIO [OTIO], the interchange format Pixar built and open-sourced, now an Academy Software Foundation project [ASWF]. OTIO models a cut as a typed tree: tracks contain clips, clips reference media, transitions and effects annotate the joins, and arbitrary metadata rides along. The point of the design is that the composition is inspectable and editable as a document, separately from any rendered output.

Cut's composition document is in this lineage: an EDL/OTIO-inspired JSON timeline holding the shot list, the shared style block, the transitions, the conditioning edges between shots, and the assembly metadata. It is the source of truth for the piece. The rendered video is a projection of it, the way a compiled binary is a projection of source code. Composition-as-data buys three things composition-as-pixels cannot: the timeline is re-renderable (lose every video file and the piece can be rebuilt), diffable (two versions of an edit compare as structured data, not as two opaque MP4s), and revisable (a change request becomes a patch against a document rather than surgery on a flattened render). Section 7 cashes in all three.

2. Why multi-shot is not N times single-shot

The naive view of a 30-second piece is six 5-second renders in a row. The naive view fails because the hard part of a film is not inside the shots. It is between them.

A century of continuity editing exists to manage that gap. The 180-degree rule keeps cameras on one side of the action line so screen direction survives a cut. Eyeline matches make two separate shots read as one exchange of glances. Match cuts carry a shape or a motion across the join. Walter Murch's working rule [Murch] orders the priorities of a cut: emotion first, story second, rhythm third, with eye-trace and spatial continuity below them, and his observation is that a cut works when the audience's attention is already prepared to jump. None of this grammar is decoration. It is the machinery that lets a sequence of discontinuous images read as one continuous world.

Film production earns its continuity physically. The set persists between takes. The same actor wears the same costume under the same light, and coverage means the editor chooses among shots of the same real event. Generative video has none of that. Each text-to-video sample is drawn fresh from the model; describe the same character twice and you get two castings, two wardrobes, two color sciences. Every cut between independently sampled shots is a potential identity break, and the audience's tolerance for identity breaks at a cut is approximately zero, because a century of cinema trained them to read a cut as “same world, new angle.”

So a multi-shot system has to manufacture, by construction, the continuity that a film set provides for free. Cut attacks the problem at three layers: at generation time (section 3), at assembly time (section 4), and at judgment time (sections 5 and 6).

3. Continuity at generation time: chaining the last frame

The weakest tool for cross-shot consistency is prose. Repeating “the same red-jacketed cyclist” in six prompts does not bind six samples to one cyclist; text conditions the distribution, and the distribution contains every red-jacketed cyclist the model can imagine. The published lever that actually binds appearance is image conditioning. Stable Video Diffusion [SVD] is the canonical open example: the image-to-video variant conditions generation on a supplied frame, so the output inherits that frame's subjects, palette, and layout rather than re-rolling them. Luma's Ray generation API exposes the same capability as keyframe conditioning, a start_frame the clip must grow out of.

Cut chains on it. Shot 1 renders from text alone. Shot 2 is seeded with the final frame of the accepted shot 1, plus its own prompt and the shared style block. Shot 3 seeds from shot 2, and so on down the timeline. The world state at each junction, who is in frame, what they wear, where the light falls, is carried by actual pixels rather than by a description of pixels. Continuity is generated, not faked in post. The conditioning edges live in the composition document as first-class data, which matters later: they are what makes invalidation cascade correctly in section 7.

Chaining has a known failure mode: drift. Each hop conditions on the output of the previous hop, so small errors compound, the same error-accumulation shape that limits long autoregressive generation. VideoPoet [VideoPoet], which generates video autoregressively and extends clips by predicting from the last second of output, sits squarely in this regime, and OpenAI's Sora report [Sora] treats long-horizon temporal consistency as a headline challenge rather than a solved one. A chain is a ratchet: it can carry identity forward, but it can also carry a defect forward. Cut bounds the chain at 2–6 shots, re-asserts the style block in every shot's prompt as a corrective pressure, and, crucially, only chains off accepted frames: every shot passes Take's full evaluation cascade (deterministic gates, perceptual metrics, frame-grid judge, at most three attempts) before its last frame is allowed to seed the next shot. A flawed frame that gets rejected never becomes anyone's ancestor.

In plain words

Telling the model “same character as before” is like describing a stranger to a sketch artist twice and hoping for identical drawings. Handing it the last frame of the previous shot is like handing the artist a photograph. Cut always hands over the photograph. The risk is the photocopy-of-a-photocopy effect, where tiny flaws build up shot after shot, so Cut keeps the chain short, re-states the visual style every time, and never passes a frame forward until that shot has been inspected and accepted.

4. Mechanical assembly: what concat actually requires

This section is practitioner lore, in the same sense the Take page uses the term: it is documented behavior and hard-won pipeline habit, not a research result. It is also where naive multi-shot pipelines actually die.

ffmpeg offers two main paths for joining clips [ffmpeg-concat]. The concat demuxer splices files without re-encoding, which is fast and lossless, but it is only valid when every input carries the same codec, resolution, framerate, pixel format, and timebase; feed it mismatched streams and the output ranges from stuttering timestamps to a file that simply does not play. The concat filter re-encodes and is forgiving, at the price of a generation loss and CPU time. Either way, the reliable recipe is the same: normalize before you concatenate. Cut decodes each accepted shot and re-encodes it to one canonical profile, same codec, same resolution, same constant framerate, same pixel format, same timebase, before any joining happens. Generated video makes this non-optional: clips arriving from a rendering API at different times are not guaranteed bit-identical encoder settings, and normalization-before-concat is the difference between a video and a corrupted file.

Transitions are an assembly-time choice. A hard cut is free: it is the absence of processing, two normalized streams butted together. A crossfade is not: ffmpeg's xfade filter [xfade] overlaps the tail of one shot with the head of the next for a stated duration and offset, consuming timeline from both sides and forcing a re-encode of the junction. The composition document records the transition per junction, which is exactly what lets a “make that cut a crossfade” note in section 7 be a re-stitch instead of a re-render.

5. Judging the seams: shot-boundary research, inverted

There is a mature research line on finding cuts. Shot transition detection networks like TransNet V2 [TransNet V2] are trained to scan footage and locate the frames where one shot ends and another begins, learning the visual signature of a hard cut or a dissolve. Cut's situation is the mirror image: it does not need to find its cuts, it placed them. The question inverts from “where is the cut?” to “does the cut I made read as intentional?” A junction can be mechanically perfect and still be a bad edit: a subject who teleports, a palette that lurches, an exposure jump that reads as a glitch instead of a scene change.

Cut inspects every junction two ways. Deterministically, it measures the color and luminance delta across the join, last frame of shot N against first frame of shot N+1; a junction that chained correctly at generation time should change little unless the timeline says it should. And as evidence for the judge, it renders a seam contact sheet: at every junction, the outgoing frame and the incoming frame side by side, labeled, in order. The sheet does for cuts what Take's contact sheet does for clips: it turns a temporal claim into a static image that both a model and a human can actually check. If shot 3 hands a cyclist to shot 4 and shot 4 opens on a different bicycle, the seam sheet shows it in one glance, with no video playback and no temporal reasoning required.

In plain words

Imagine printing the very last frame of each shot next to the very first frame of the shot that follows it, like pairs of photographs taped to a wall at every splice. That is the seam contact sheet. Most continuity mistakes, a character whose jacket changes color, a sky that jumps from dusk to noon, are obvious the instant the two frames sit side by side, even though they hide easily inside a playing video.

6. The final judge: a sequence is not a long clip

Judging an assembled piece is a different problem from judging one shot, and the temptation is the same one Take's research notes document: hand a vision-language model the whole video and ask if it is good. The evidence against that is unchanged. TempCompass [TempCompass] showed video LLMs lean on single frames plus language priors and fail on speed, direction, and event order; Vinoground [Vinoground] showed GPT-4o at chance on temporal counterfactuals over short videos. A multi-shot piece is strictly worse terrain: the property under judgment, whether six shots read as one continuous story, is temporal twice over, ordering within shots and ordering across them.

So Cut's final judge never watches video. It reads two static artifacts. The first is a frame grid sampled across the whole piece, every shot represented, in timeline order, the IG-VLM result [IG-VLM] applied at sequence scale: tiling frames into one image preserves coarse order through reading order and beats dedicated video-LM approaches on most zero-shot video QA. The second is the seam contact sheet from section 5, which pre-localizes exactly the junction comparisons a sampled grid might straddle or miss. Everything the judge would be bad at verifying temporally has been converted into something it is good at: side-by-side comparison of frames it can actually see.

The judge rules on seven axes. Six are inherited from Take's per-clip rubric, fidelity, aesthetics, consistency, motion, semantics, physics, applied at the piece level. The seventh is cross-cut continuity: does each junction preserve identity, palette, and light where the timeline says it should, and change them where the timeline says it should. That last clause matters. A deliberate scene change is supposed to produce a large junction delta, so the judge reads the composition document's transition and style intent alongside the seam sheet, rather than penalizing every big delta as a defect. Verdicts at the piece level feed bounded repair, the same shape as Take's bounded retakes: re-shoot the named offending shot, not the piece, and not forever.

7. Edit routing: most notes touch the timeline, not the pixels

A finished piece accepts up to five chat revisions. The design insight is in what people actually ask for. “Swap shots 2 and 3.” “Make the last cut a crossfade.” “Change the title.” These notes are edits to the composition document. No pixel they care about needs to be re-generated; the shots already exist, normalized, on disk. Charging a full re-render for a reorder would be like recompiling a program because someone renamed the README.

Cut therefore routes every revision note through a classifier with two outcomes. Deterministic notes (reorder, transition change, trim, retitle) are applied to the timeline and the piece is re-stitched from cached shots: ffmpeg time, seconds, approximately free. Generative notes (“re-shoot shot 4 with rain,” “make the whole thing warmer”) genuinely change what some shot must contain, so those shots, and only those shots, go back to the renderer.

The mechanism that makes “only those shots” safe rather than hopeful is content addressing, borrowed from build systems. Each shot carries a content hash over everything that determines its pixels: its prompt, the shared style block, and its conditioning fingerprint, the identity of the frame it was seeded with. This is the Merkle-tree construction [Merkle]: because the conditioning fingerprint is part of the hash, each shot's hash transitively commits to its entire ancestry in the chain. Nix [Nix] built a whole operating system on the principle that an artifact's address is the hash of everything that produced it, so identical inputs are never rebuilt; Bazel's remote cache [Bazel] applies the same memoization to compilation at scale. Cut applies it to shots: a revision re-renders exactly the shots whose hash changed and serves every other shot from cache.

The chain does the dependency analysis for free. Re-shoot shot 2 and its last frame changes, which changes shot 3's conditioning fingerprint, which changes shot 3's hash, and so on downstream: invalidation cascades along the conditioning edges automatically, with no hand-maintained dependency list to forget. Upstream shots are untouched; a style-block change, by contrast, lands in every shot's hash and honestly invalidates everything. The hash does not decide what to rebuild. It is the decision.

In plain words

Every shot gets a fingerprint made from its instructions: what was asked for, what style applies, and which frame it grew out of. When you request a change, Cut recomputes the fingerprints. Shots whose fingerprint did not change are reused from storage at no cost; only the changed ones are re-filmed. And because each shot's fingerprint includes the frame it started from, re-filming one shot automatically flags the shots after it, while everything before it stays untouched. Software build tools have worked this way for years; Cut just treats shots as build artifacts.

8. The arithmetic

The whole design lands in one comparison. A 6-shot piece with 5 revisions, handled naively, is 30 shot renders on top of the original 6, before counting retakes. With edit routing and content addressing, a typical revision pass re-renders 0 shots (the note was deterministic) or 1–2 (one shot re-shot, occasionally with a cascade hop), so 5 revisions usually cost a handful of renders, not thirty.

Per shot: at most 3 render attempts, gated by the full eval cascade. A shot that cannot pass in three tries fails fast instead of burning budget.
Per junction: deterministic delta checks plus a seam contact sheet, CPU seconds and one image.
Per version: one stitch, one frame grid, one final-judge call, and a bounded repair pass.
Per revision: a router call, then a re-stitch (seconds) or a targeted re-render of only the hash-invalidated shots.

Every bound is enforced, not aspirational, and they compose into a hard ceiling of $4.50 per piece, revisions included. The numbers are ordinary engineering. What makes them possible is the oldest idea on this page: the CMX 600's insight that the edit is a list. Because Cut's film is a document, changing the film is mostly changing the document, and documents are cheap.

The best way to understand the pipeline is to watch it plan, shoot, grade, and splice. Cut a film and the page shows the shot list, every junction, and the judge's ruling as they happen.

References

Academy Software Foundation, OpenTimelineIO: Open Source API and Interchange Format for Editorial Timeline Information github.com/AcademySoftwareFoundation/OpenTimelineIO
Academy Software Foundation, hosted projects (OpenTimelineIO adopted 2021) www.aswf.io/projects/
Walter Murch, In the Blink of an Eye: A Perspective on Film Editing (Silman-James Press) en.wikipedia.org/wiki/In_the_Blink_of_an_Eye_(Murch_book)
Blattmann et al., Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets arxiv.org/abs/2311.15127
Kondratyuk et al., VideoPoet: A Large Language Model for Zero-Shot Video Generation arxiv.org/abs/2312.14125
OpenAI, Video Generation Models as World Simulators (Sora technical report) openai.com/index/video-generation-models-as-world-simulators/
FFmpeg Wiki, Concatenating media files (demuxer vs filter, stream requirements) trac.ffmpeg.org/wiki/Concatenate
FFmpeg Filters Documentation, xfade (cross fade between two input videos) ffmpeg.org/ffmpeg-filters.html#xfade
Souček & Lokoč, TransNet V2: An Effective Deep Network Architecture for Fast Shot Transition Detection arxiv.org/abs/2008.04838
Liu et al., TempCompass: Do Video LLMs Really Understand Videos? arxiv.org/abs/2403.00476
Zhang et al., Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos arxiv.org/abs/2410.02763
Kim et al., An Image Grid Can Be Worth a Video (IG-VLM) arxiv.org/abs/2403.18406
Merkle, A Digital Signature Based on a Conventional Encryption Function (CRYPTO '87, hash trees) link.springer.com/chapter/10.1007/3-540-48184-2_32
Dolstra, Löh & Pierron, NixOS: A Purely Functional Linux Distribution (JFP 2010) edolstra.github.io/pubs/nixos-jfp-final.pdf
Bazel documentation, Remote Caching (content-addressed action and artifact cache) bazel.build/remote/caching