BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

Zeyu Zhang¹ Shuning Chang¹ Yuanyu He¹² Yizeng Han¹ Jiasheng Tang^13* Fan Wang¹ Bohan Zhuang^12*

¹DAMO Academy, Alibaba Group ²ZIP Lab, Zhejiang University ³Hupan Lab

^*Corresponding authors.

TL;DR: BlockVid is a semi-AR block diffusion framework equipped with semantic sparse KV caching, block forcing, and noise scheduling. Furthermore, LV-Bench is a fine-grained benchmark for minute-long videos with dedicated metrics to evaluate long-range coherence.

Visualization Results

Shot 1

forest flying view

Shot 2

low-angle view

Shot 3

get lost in forest

Shot 4

girl walks out

Shot 5

walks towards camera

Shot 6

POV of forest

Shot 7

meet and discuss

Shot 8

close-up to serious face

Shot 9

worrying expression

Shot 10

continue exploration

Shot 11

high-angle view

Shot 12

wide shot

Shot 13

arrive at a clearing

Shot 14

find an abandoned house

Shot 15

focus on the house

Shot 16

cautiously discuss

Shot 17

close-up to the door

Shot 18

tries to open the door

Shot 19

interior of the house

Shot 20

enter and explore

Shot 21

continue exploration

Shot 22

into a dusty room

Shot 23

showing room's layout

Shot 24

close-up to the shelf

Shot 25

find something on desk

Shot 26

magic ball appears

Shot 27

their shocked faces

0:00 / 0:00

Shot 1

forest flying view

Shot 2

low-angle view

Shot 3

get lost in forest

Shot 4

girl walks out

Shot 5

walks towards camera

Shot 6

POV of forest

Shot 7

meet and discuss

Shot 8

close-up to serious face

Shot 9

worrying expression

Shot 10

continue exploration

Shot 11

high-angle view

Shot 12

wide shot

Shot 13

arrive at a clearing

Shot 14

find an abandoned house

Shot 15

focus on the house

Shot 16

cautiously discuss

Shot 17

close-up to the door

Shot 18

tries to open the door

Shot 19

interior of the house

Shot 20

enter and explore

Shot 21

continue exploration

Shot 22

into a dusty room

Shot 23

showing room's layout

Shot 24

close-up to the shelf

Shot 25

find something on desk

Shot 26

magic ball appears

Shot 27

their shocked faces

0:00 / 0:00

Shot 1

forest flying view

Shot 2

low-angle view

Shot 3

get lost in forest

Shot 4

girl walks out

Shot 5

walks towards camera

Shot 6

POV of forest

Shot 7

meet and discuss

Shot 8

close-up to serious face

Shot 9

worrying expression

Shot 10

continue exploration

Shot 11

high-angle view

Shot 12

wide shot

Shot 13

arrive at a clearing

Shot 14

find an abandoned house

Shot 15

focus on the house

Shot 16

cautiously discuss

Shot 17

close-up to the door

Shot 18

tries to open the door

Shot 19

interior of the house

Shot 20

enter and explore

Shot 21

continue exploration

Shot 22

into a dusty room

Shot 23

showing room's layout

Shot 24

close-up to the shelf

Shot 25

find something on desk

Shot 26

magic ball appears

Shot 27

their shocked faces

0:00 / 0:00

Shot 1

forest flying view

Shot 2

low-angle view

Shot 3

get lost in forest

Shot 4

girl walks out

Shot 5

walks towards camera

Shot 6

POV of forest

Shot 7

meet and discuss

Shot 8

close-up to serious face

Shot 9

worrying expression

Shot 10

continue exploration

Shot 11

high-angle view

Shot 12

wide shot

Shot 13

arrive at a clearing

Shot 14

find an abandoned house

Shot 15

focus on the house

Shot 16

cautiously discuss

Shot 17

close-up to the door

Shot 18

tries to open the door

Shot 19

interior of the house

Shot 20

enter and explore

Shot 21

continue exploration

Shot 22

into a dusty room

Shot 23

showing room's layout

Shot 24

close-up to the shelf

Shot 25

find something on desk

Shot 26

magic ball appears

Shot 27

their shocked faces

0:00 / 0:00

Shot 1

forest flying view

Shot 2

low-angle view

Shot 3

get lost in forest

Shot 4

girl walks out

Shot 5

walks towards camera

Shot 6

POV of forest

Shot 7

meet and discuss

Shot 8

close-up to serious face

Shot 9

worrying expression

Shot 10

continue exploration

Shot 11

high-angle view

Shot 12

wide shot

Shot 13

arrive at a clearing

Shot 14

find an abandoned house

Shot 15

focus on the house

Shot 16

cautiously discuss

Shot 17

close-up to the door

Shot 18

tries to open the door

Shot 19

interior of the house

Shot 20

enter and explore

Shot 21

continue exploration

Shot 22

into a dusty room

Shot 23

showing room's layout

Shot 24

close-up to the shelf

Shot 25

find something on desk

Shot 26

magic ball appears

Shot 27

their shocked faces

0:00 / 0:00

Shot 1

forest flying view

Shot 2

low-angle view

Shot 3

get lost in forest

Shot 4

girl walks out

Shot 5

walks towards camera

Shot 6

POV of forest

Shot 7

meet and discuss

Shot 8

close-up to serious face

Shot 9

worrying expression

Shot 10

continue exploration

Shot 11

high-angle view

Shot 12

wide shot

Shot 13

arrive at a clearing

Shot 14

find an abandoned house

Shot 15

focus on the house

Shot 16

cautiously discuss

Shot 17

close-up to the door

Shot 18

tries to open the door

Shot 19

interior of the house

Shot 20

enter and explore

Shot 21

continue exploration

Shot 22

into a dusty room

Shot 23

showing room's layout

Shot 24

close-up to the shelf

Shot 25

find something on desk

Shot 26

magic ball appears

Shot 27

their shocked faces

0:00 / 0:00

Visualization Comparison

MAGI-1

Self Forcing

PAVDM

FramePack

SkyReel-V2

BlockVid (Ours)

Method

Overview of the BlockVid semi-AR framework: The generation of chunk c+1 is conditioned on both a local KV cache and a globally retrieved context. The global context is dynamically assembled by retrieving top-l semantically similar KV chunks via prompt embedding similarity. Upon generation, the bank is updated with the new chunk's most salient KV tokens.

Architecture Designs for BlockVid.

BlockVid introduces a semi-AR block diffusion architecture. During training, we are given a single-shot long video \( V = \{V_1, V_2, V_3, \ldots, V_n\} \), where each video chunk \( V_i \in \mathbb{R}^{(1+T) \times H \times W \times 3} \), with \(T\) frames, height \(H\), width \(W\), and 3 RGB channels. We also have the corresponding chunk level prompts \( \mathcal{Y} = \{y_i\}_{i=1}^{n} \), with \(y_i\) conditioning \(V_i\). Specifically, the first frame serves as the image guidance. The 3D causal VAE compresses its spatio-temporal dimensions to \([(1+T/4),\, H/8,\, W/8]\) while expanding the number of channels to 16, resulting in the latent representation \( Z \in \mathbb{R}^{(1 + T/4) \times H/8 \times W/8 \times 16} \). The first frame is compressed only spatially to better handle the image guidance.

During post-training, we introduce Block Forcing, a training strategy that stabilizes long video generation by jointly integrating Velocity Forcing and Self Forcing objectives. Velocity Forcing aligns predicted dynamics with semantic history to prevent drift, while Self Forcing closes the training–inference gap by exposing the model to its own roll-outs and enforcing sequence-level realism.

In the latent space, the representation \( Z \) is first processed by the block diffusion denoiser to produce the denoised latent \( \tilde{Z} \). During this procedure, the semantic sparse KV cache is dynamically constructed and preserved as a compact memory of salient keys and values, serving as semantic guidance for subsequent chunk generation. Subsequently, the denoised latent \( \tilde{Z} \) is projected back into the video space \( \tilde{X} \).

Besides, we design a noise scheduling strategy that operates both during training and inference to stabilize long video generation. During training, progressive noise scheduling gradually increases noise levels across chunks. While during inference, noise shuffling introduces local randomness at chunk boundaries to smooth transitions and maintain coherence.

LV-Bench

Dataset: To tackle the challenge of minute-long video generation, we curate a dataset of 1000 videos from diverse open-source sources and annotate them in detail. We collect high-quality video chunks with lengths of at least 50 seconds from DanceTrack, GOT-10k, HD-VILA-100M, and ShareGPT4V. To obtain high-quality annotations, we employ GPT-4o as a data engine to generate fine-grained captions for every 2–3 seconds in each video. Human-in-the-loop validation consists of manual visual checks at every stage of data production, including data sourcing, chunk splitting, and captioning, to ensure high-quality annotations. In the data sourcing stage, human annotators select high-quality videos and determine whether each raw video is suitable for inclusion. In the chunk splitting stage, human annotators examine samples to verify that each chunk is free from errors such as incorrect transitions. In the captioning stage, human annotators review the generated descriptions to ensure semantic accuracy and coherence. At each stage, at least two human annotators participate to provide inter-rater reliability. We then randomly divided LV-Bench into an 8:2 split for training and evaluation.

Metrics: Drift penalties have been widely adopted to address information dilution and degradation in long video generation. For example, IP-FVR focuses on preserving identity consistency, while MoCA employs an identity perceptual loss to penalize frame-to-frame identity drift. Inspired by the Mean Absolute Percentage Error (MAPE) and Weighted Mean Absolute Percentage Error (WMAPE), we propose a new metric called Video Drift Error (VDE) to measure changes in video quality. We further design 5 long video generation metrics based on VDE. The core idea involves dividing a long video into multiple smaller segments, each evaluated according to specific quality metrics (such as clarity, motion smoothness, etc). Specifically, (1) VDE Clarity measures temporal drift in image sharpness, where creeping blur increases the score, while a low value indicates stable clarity over time. (2) VDE Motion measures drift in motion smoothness, where a low score indicates consistent dynamics without jitter or freezing. (3) VDE Aesthetic measures drift in visual appeal, where a low score indicates sustained and coherent aesthetics over time. (4) VDE Background measures background stability, where a low score indicates a consistent setting without drift or flicker over time. (5) VDE Subject tracks identity drift, where a low score indicates the subject remains consistently recognizable over time. Following previous works, we also include five complementary metrics from VBench. The details are included in the Appendix.