BlockVid: Block Diffusion for High-Fidelity and Coherent Minute-Long Video Generation

Zeyu Zhang1 Shuning Chang1 Yuanyu He12 Yizeng Han1 Jiasheng Tang13* Fan Wang1† Bohan Zhuang12*†
1DAMO Academy, Alibaba Group 2ZIP Lab, Zhejiang University 3Hupan Lab
*Project leads. Corresponding authors.
Research Paper

TL;DR: BlockVid is a semi-AR block diffusion framework equipped with semantic sparse KV caching, block forcing, and noise scheduling. Furthermore, LV-Bench is a fine-grained benchmark for minute-long videos with dedicated metrics to evaluate long-range coherence.


Visualization Results
Scene 1
Shot 1
forest flying view
Scene 2
Shot 2
low-angle view
Scene 3
Shot 3
get lost in forest
Scene 4
Shot 4
girl walks out
Scene 5
Shot 5
walks towards camera
Scene 6
Shot 6
POV of forest
Scene 7
Shot 7
meet and discuss
Scene 8
Shot 8
close-up to serious face
Scene 9
Shot 9
worrying expression
Scene 10
Shot 10
continue exploration
Scene 11
Shot 11
high-angle view
Scene 12
Shot 12
wide shot
Scene 13
Shot 13
arrive at a clearing
Scene 14
Shot 14
find an abandoned house
Scene 15
Shot 15
focus on the house
Scene 16
Shot 16
cautiously discuss
Scene 17
Shot 17
close-up to the door
Scene 18
Shot 18
tries to open the door
Scene 19
Shot 19
interior of the house
Scene 20
Shot 20
enter and explore
Scene 21
Shot 21
continue exploration
Scene 22
Shot 22
into a dusty room
Scene 23
Shot 23
showing room's layout
Scene 24
Shot 24
close-up to the shelf
Scene 25
Shot 25
find something on desk
Scene 26
Shot 26
magic ball appears
Scene 27
Shot 27
their shocked faces
0:00 / 0:00
Scene 1
Shot 1
forest flying view
Scene 2
Shot 2
low-angle view
Scene 3
Shot 3
get lost in forest
Scene 4
Shot 4
girl walks out
Scene 5
Shot 5
walks towards camera
Scene 6
Shot 6
POV of forest
Scene 7
Shot 7
meet and discuss
Scene 8
Shot 8
close-up to serious face
Scene 9
Shot 9
worrying expression
Scene 10
Shot 10
continue exploration
Scene 11
Shot 11
high-angle view
Scene 12
Shot 12
wide shot
Scene 13
Shot 13
arrive at a clearing
Scene 14
Shot 14
find an abandoned house
Scene 15
Shot 15
focus on the house
Scene 16
Shot 16
cautiously discuss
Scene 17
Shot 17
close-up to the door
Scene 18
Shot 18
tries to open the door
Scene 19
Shot 19
interior of the house
Scene 20
Shot 20
enter and explore
Scene 21
Shot 21
continue exploration
Scene 22
Shot 22
into a dusty room
Scene 23
Shot 23
showing room's layout
Scene 24
Shot 24
close-up to the shelf
Scene 25
Shot 25
find something on desk
Scene 26
Shot 26
magic ball appears
Scene 27
Shot 27
their shocked faces
0:00 / 0:00
Scene 1
Shot 1
forest flying view
Scene 2
Shot 2
low-angle view
Scene 3
Shot 3
get lost in forest
Scene 4
Shot 4
girl walks out
Scene 5
Shot 5
walks towards camera
Scene 6
Shot 6
POV of forest
Scene 7
Shot 7
meet and discuss
Scene 8
Shot 8
close-up to serious face
Scene 9
Shot 9
worrying expression
Scene 10
Shot 10
continue exploration
Scene 11
Shot 11
high-angle view
Scene 12
Shot 12
wide shot
Scene 13
Shot 13
arrive at a clearing
Scene 14
Shot 14
find an abandoned house
Scene 15
Shot 15
focus on the house
Scene 16
Shot 16
cautiously discuss
Scene 17
Shot 17
close-up to the door
Scene 18
Shot 18
tries to open the door
Scene 19
Shot 19
interior of the house
Scene 20
Shot 20
enter and explore
Scene 21
Shot 21
continue exploration
Scene 22
Shot 22
into a dusty room
Scene 23
Shot 23
showing room's layout
Scene 24
Shot 24
close-up to the shelf
Scene 25
Shot 25
find something on desk
Scene 26
Shot 26
magic ball appears
Scene 27
Shot 27
their shocked faces
0:00 / 0:00
Scene 1
Shot 1
forest flying view
Scene 2
Shot 2
low-angle view
Scene 3
Shot 3
get lost in forest
Scene 4
Shot 4
girl walks out
Scene 5
Shot 5
walks towards camera
Scene 6
Shot 6
POV of forest
Scene 7
Shot 7
meet and discuss
Scene 8
Shot 8
close-up to serious face
Scene 9
Shot 9
worrying expression
Scene 10
Shot 10
continue exploration
Scene 11
Shot 11
high-angle view
Scene 12
Shot 12
wide shot
Scene 13
Shot 13
arrive at a clearing
Scene 14
Shot 14
find an abandoned house
Scene 15
Shot 15
focus on the house
Scene 16
Shot 16
cautiously discuss
Scene 17
Shot 17
close-up to the door
Scene 18
Shot 18
tries to open the door
Scene 19
Shot 19
interior of the house
Scene 20
Shot 20
enter and explore
Scene 21
Shot 21
continue exploration
Scene 22
Shot 22
into a dusty room
Scene 23
Shot 23
showing room's layout
Scene 24
Shot 24
close-up to the shelf
Scene 25
Shot 25
find something on desk
Scene 26
Shot 26
magic ball appears
Scene 27
Shot 27
their shocked faces
0:00 / 0:00
Scene 1
Shot 1
forest flying view
Scene 2
Shot 2
low-angle view
Scene 3
Shot 3
get lost in forest
Scene 4
Shot 4
girl walks out
Scene 5
Shot 5
walks towards camera
Scene 6
Shot 6
POV of forest
Scene 7
Shot 7
meet and discuss
Scene 8
Shot 8
close-up to serious face
Scene 9
Shot 9
worrying expression
Scene 10
Shot 10
continue exploration
Scene 11
Shot 11
high-angle view
Scene 12
Shot 12
wide shot
Scene 13
Shot 13
arrive at a clearing
Scene 14
Shot 14
find an abandoned house
Scene 15
Shot 15
focus on the house
Scene 16
Shot 16
cautiously discuss
Scene 17
Shot 17
close-up to the door
Scene 18
Shot 18
tries to open the door
Scene 19
Shot 19
interior of the house
Scene 20
Shot 20
enter and explore
Scene 21
Shot 21
continue exploration
Scene 22
Shot 22
into a dusty room
Scene 23
Shot 23
showing room's layout
Scene 24
Shot 24
close-up to the shelf
Scene 25
Shot 25
find something on desk
Scene 26
Shot 26
magic ball appears
Scene 27
Shot 27
their shocked faces
0:00 / 0:00
Scene 1
Shot 1
forest flying view
Scene 2
Shot 2
low-angle view
Scene 3
Shot 3
get lost in forest
Scene 4
Shot 4
girl walks out
Scene 5
Shot 5
walks towards camera
Scene 6
Shot 6
POV of forest
Scene 7
Shot 7
meet and discuss
Scene 8
Shot 8
close-up to serious face
Scene 9
Shot 9
worrying expression
Scene 10
Shot 10
continue exploration
Scene 11
Shot 11
high-angle view
Scene 12
Shot 12
wide shot
Scene 13
Shot 13
arrive at a clearing
Scene 14
Shot 14
find an abandoned house
Scene 15
Shot 15
focus on the house
Scene 16
Shot 16
cautiously discuss
Scene 17
Shot 17
close-up to the door
Scene 18
Shot 18
tries to open the door
Scene 19
Shot 19
interior of the house
Scene 20
Shot 20
enter and explore
Scene 21
Shot 21
continue exploration
Scene 22
Shot 22
into a dusty room
Scene 23
Shot 23
showing room's layout
Scene 24
Shot 24
close-up to the shelf
Scene 25
Shot 25
find something on desk
Scene 26
Shot 26
magic ball appears
Scene 27
Shot 27
their shocked faces
0:00 / 0:00
Visualization Comparison
MAGI-1
Self Forcing
PAVDM
FramePack
SkyReel-V2
BlockVid (Ours)
Method

Overview of the BlockVid semi-AR framework: The generation of chunk c+1 is conditioned on both a local KV cache and a globally retrieved context. The global context is dynamically assembled by retrieving top-l semantically similar KV chunks via prompt embedding similarity. Upon generation, the bank is updated with the new chunk's most salient KV tokens.

Architecture diagram
Architecture Designs for BlockVid.

BlockVid introduces a semi-AR block diffusion architecture. During training, we are given a single-shot long video \( V = \{V_1, V_2, V_3, \ldots, V_n\} \), where each video chunk \( V_i \in \mathbb{R}^{(1+T) \times H \times W \times 3} \), with \(T\) frames, height \(H\), width \(W\), and 3 RGB channels. We also have the corresponding chunk level prompts \( \mathcal{Y} = \{y_i\}_{i=1}^{n} \), with \(y_i\) conditioning \(V_i\). Specifically, the first frame serves as the image guidance. The 3D causal VAE compresses its spatio-temporal dimensions to \([(1+T/4),\, H/8,\, W/8]\) while expanding the number of channels to 16, resulting in the latent representation \( Z \in \mathbb{R}^{(1 + T/4) \times H/8 \times W/8 \times 16} \). The first frame is compressed only spatially to better handle the image guidance.

During post-training, we introduce Block Forcing, a training strategy that stabilizes long video generation by jointly integrating Velocity Forcing and Self Forcing objectives. Velocity Forcing aligns predicted dynamics with semantic history to prevent drift, while Self Forcing closes the training–inference gap by exposing the model to its own roll-outs and enforcing sequence-level realism.

In the latent space, the representation \( Z \) is first processed by the block diffusion denoiser to produce the denoised latent \( \tilde{Z} \). During this procedure, the semantic sparse KV cache is dynamically constructed and preserved as a compact memory of salient keys and values, serving as semantic guidance for subsequent chunk generation. Subsequently, the denoised latent \( \tilde{Z} \) is projected back into the video space \( \tilde{X} \).

Besides, we design a noise scheduling strategy that operates both during training and inference to stabilize long video generation. During training, progressive noise scheduling gradually increases noise levels across chunks. While during inference, noise shuffling introduces local randomness at chunk boundaries to smooth transitions and maintain coherence.

LV-Bench

Dataset: To tackle the challenge of minute-long video generation, we curate a dataset of 1000 videos from diverse open-source sources and annotate them in detail. We collect high-quality video chunks with lengths of at least 50 seconds from DanceTrack, GOT-10k, HD-VILA-100M, and ShareGPT4V. To obtain high-quality annotations, we employ GPT-4o as a data engine to generate fine-grained captions for every 2–3 seconds in each video. Human-in-the-loop validation consists of manual visual checks at every stage of data production, including data sourcing, chunk splitting, and captioning, to ensure high-quality annotations. In the data sourcing stage, human annotators select high-quality videos and determine whether each raw video is suitable for inclusion. In the chunk splitting stage, human annotators examine samples to verify that each chunk is free from errors such as incorrect transitions. In the captioning stage, human annotators review the generated descriptions to ensure semantic accuracy and coherence. At each stage, at least two human annotators participate to provide inter-rater reliability. We then randomly divided LV-Bench into an 8:2 split for training and evaluation.


Architecture diagram


Metrics: Drift penalties have been widely adopted to address information dilution and degradation in long video generation. For example, IP-FVR focuses on preserving identity consistency, while MoCA employs an identity perceptual loss to penalize frame-to-frame identity drift. Inspired by the Mean Absolute Percentage Error (MAPE) and Weighted Mean Absolute Percentage Error (WMAPE), we propose a new metric called Video Drift Error (VDE) to measure changes in video quality. We further design 5 long video generation metrics based on VDE. The core idea involves dividing a long video into multiple smaller segments, each evaluated according to specific quality metrics (such as clarity, motion smoothness, etc). Specifically, (1) VDE Clarity measures temporal drift in image sharpness, where creeping blur increases the score, while a low value indicates stable clarity over time. (2) VDE Motion measures drift in motion smoothness, where a low score indicates consistent dynamics without jitter or freezing. (3) VDE Aesthetic measures drift in visual appeal, where a low score indicates sustained and coherent aesthetics over time. (4) VDE Background measures background stability, where a low score indicates a consistent setting without drift or flicker over time. (5) VDE Subject tracks identity drift, where a low score indicates the subject remains consistently recognizable over time. Following previous works, we also include five complementary metrics from VBench. The details are included in the Appendix.