Paper Review: LongLive: Real-time Interactive Long Video Generation

Paper Review

Real-time interactive video generation — write a prompt to change the video immediately!

Paper

Project

Code

LongLive is a frame-level autoregressive model for real-time, interactive long video generation. It combines high quality with fast inference by using a KV-recache mechanism that updates cached states for smooth prompt transitions, a streaming long tuning approach that aligns training with long-sequence inference, and short window attention with a frame sink to preserve consistency over time. A 1.3B parameter short-clip model is fine-tuned for minute-long generation in 32 GPU-days, achieving 20.7 FPS and supporting videos up to 240 seconds on a single H100 GPU. It also supports INT8 quantized inference with minimal quality loss.

The approach

KV Recache

Causal autoregressive models struggle with prompt switching because clearing the KV cache causes abrupt visual jumps, while keeping it prevents quick adaptation due to residual information from the old prompt. This happens because previous prompt signals are repeatedly injected through cross-attention and stored in the cache. KV recache solves this by rebuilding the cache at each prompt switch using already generated frames as visual context and the new prompt, preserving motion continuity while aligning semantics to the new instruction. This technique is integrated into training (when training iteration has a prompt switch, recaching is triggered, then rollout is continued with the updated cache, and the teacher model uses the new prompt as well) to match inference conditions, improving temporal smoothness and fast adaptation. Recaching adds minimal computational cost and generalizes to multiple prompt switches during inference by refreshing the cache at each boundary, which results in smooth transitions and prompt-aligned generation throughout.

Streaming long tuning

Frame-level autoregressive video models are usually trained on short clips, but during inference, they generate long videos by repeatedly conditioning on their own outputs. This causes accumulated errors, noise, and content drift because the model never sees such degraded, self-generated inputs during training. LongLive solves this with a train-long–test-long approach, where the model learns on extended sequences generated from its own predictions, improving fidelity and long-term consistency.

To make this feasible, the authors introduce a streaming long tuning procedure. Instead of generating and backpropagating through an entire long video, the model produces it incrementally in short segments. Each new segment is conditioned on the cached context from previous ones, and supervision is applied only to the current clip, which keeps memory usage low and supervision reliable. This rolling training process mirrors inference behavior, aligns training with deployment conditions, and avoids out-of-memory issues. Tuning on long videos is a prerequisite for efficient inference techniques, which include window attention and frame sink.

Efficient long inference

Dense causal attention becomes too expensive for long video generation because its cost scales quadratically with sequence length. To address this, LongLive uses short-window attention, restricting attention to a fixed number of recent frames. This reduces computation and memory since complexity and KV cache size depend on the window rather than the total sequence. Smaller windows improve efficiency but can hurt temporal consistency because distant context is lost.

To mitigate this trade-off, the authors use a frame sink: a small set of global tokens (from the first video frames) is kept permanently in the KV cache and remains accessible to all layers, providing stable long-range context even with local attention. This preserves high long-video quality while lowering compute time by 28% and peak memory by 17% on a single H100 GPU.

Short-window attention and frame sinks are integrated into streaming training so the model learns under the same conditions used at inference. Only recent frames and the current supervised clip are used in gradient computation, and sink tokens are never evicted. This approach prevents memory growth with video length, stabilizes identity and scene information, and allows efficient KV recaching from only the latest frames, preserving continuity while reducing cost.

Experiments

LongLive matches or surpasses existing video generation models across short, long, and interactive scenarios while remaining highly efficient. On short videos, it achieves quality and stability comparable to the best models and is the fastest, reaching 20.7 FPS. For long single-prompt videos, it delivers state-of-the-art quality and consistency while outperforming others in speed.

In interactive multi-prompt settings, it shows strong semantic adherence, smooth transitions, and high consistency, outperforming Self-Forcing and SkyReels-V2, while being over 41 times faster than SkyReels-V2 and slightly faster than Self-Forcing.

Ablation studies confirm the effectiveness of its components. KV recache keeps temporal continuity while adapting quickly to new prompts, avoiding the abrupt changes seen when clearing the cache and the inertia seen when retaining it. Short-window attention and frame sink reveal a trade-off between quality and efficiency: larger windows improve consistency but increase cost. The frame sink mitigates this, allowing small windows to achieve near-large-window consistency while maintaining speed and low memory usage.

Learn more about Paper Review: LongLive: Real-time Interactive Long Video Generation

Leave a Reply