FlashBlock: Attention Caching for Efficient
Long-Context Block Diffusion

Zhuokun Chen1,*
Jianfei Cai1
Bohan Zhuang2
1Monash University
2Zhejiang University

Abstract

Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over an ever-growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, substantially reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy. When integrated, it substantially improves model accuracy under aggressive sparsification by offsetting much of the performance loss induced by sparsity. Experiments on diffusion language models and video generation demonstrate up to 1.44× higher token throughput and up to 1.6× reduction in attention time, with negligible impact on generation quality.

Cross-Step Redundancy in Block Diffusion

We empirically analyze attention behavior in block diffusion and find a clear separation: Block-external attention (from tokens already generated) remains largely stable across diffusion steps, while Block-internal attention (tokens currently being denoised) varies significantly.

Block-External Attention Stability Block-Internal Attention Variability
Figure 1: Cross-step stability of block-external vs. block-internal attention. Top: Block-external attention ($A_{\mathrm{out}}$) shows high similarity across steps. Bottom: Block-internal attention ($A_{\mathrm{in}}$) varies substantially.

Method: Block-External Attention Caching

Method overview. We explicitly decompose attention into block-internal and block-external components. At the first diffusion step of a block, we compute and cache the output ($A_{\mathrm{out}}$) and log-normalizer ($L_{\mathrm{out}}$) from block-external tokens. In subsequent diffusion steps within the same block, we only recompute attention for the block-internal tokens ($A_{\mathrm{in}}$), reusing the cached block-external statistics.on in log-space.

FlashBlock Method Overview
Figure 2: Overview of FlashBlock. We reuse cached block-external attention and only recompute block-internal attention, significantly reducing KV cache access and efficient computation.

Experimental Results

Efficiency

FlashBlock consistently reduces per-step inference latency compared to the baseline, with the gap widening as context length increases. Remarkably, as the context length grows, the rate of latency increase is only half that of the original model, leading to a theoretical speedup upper bound of roughly 2x.

Latency Comparison
Figure 3: Per-step inference latency. Our method (orange) vs. Baseline (blue).

Mathematical Reasoning & Coding

Method Block Size TPS GSM8K MATH500 AIME MBPP HumanEval
Trado-8B-Thinking 4 312 93.25 86.00 33.33 25.60 50.61
Trado-8B + FlashBlock 4 451 93.12 85.80 33.33 33.60 51.22
Trado-8B-Thinking 8 532 91.74 82.00 26.67 29.00 54.27
Trado-8B + FlashBlock 8 674 90.22 81.80 26.67 32.00 53.66

Combination with Sparse Attention

FlashBlock is orthogonal to sparse attention. When combined, it significantly improves performance by recovering information lost due to sparsification (Table 2 of paper).

Method GSM8K MATH500 HumanEval
Acc. Δ Acc. Δ Pass@1 Δ
SparseD (d=20%) 34.72 +7.96 39.40 +7.40 23.78 +9.76
SparseD + Ours 42.68 46.80 33.54
SparseD (d=30%) 68.61 +3.64 59.20 +5.40 29.88 +6.71
SparseD + Ours 72.25 64.60 36.59
SparseD (d=40%) 84.61 +2.65 66.20 +3.20 39.02 +5.49
SparseD + Ours 87.26 69.40 44.51

Video Generation

Qualitative results show that FlashBlock maintains generation quality and temporal consistency while improving efficiency.

Video Demos

Complex Landscape
Baseline
The camera begins in a vast grassland, where the lush green grass sways gently in the breeze, the air fresh, and the soft rustling of the leaves fills the space. As the camera move
Ours
The camera begins in a vast grassland, where the lush green grass sways gently in the breeze, the air fresh, and the soft rustling of the leaves fills the space. As the camera move
Baseline
The camera gently descends, passing through the layers of waves, entering the deep underwater world. The surrounding coral reefs are vibrant and colorful, with a variety of tropica
Ours
The camera gently descends, passing through the layers of waves, entering the deep underwater world. The surrounding coral reefs are vibrant and colorful, with a variety of tropica
Baseline
The camera moves through the vast expanse of the universe, where stars twinkle against the dark backdrop, the Milky Way arcing like a silver river across the sky. Nebulae slowly ro
Ours
The camera moves through the vast expanse of the universe, where stars twinkle against the dark backdrop, the Milky Way arcing like a silver river across the sky. Nebulae slowly ro
Baseline
The camera starts at a tranquil coastline, where the waves gently crash against the rocks, the salty sea breeze fills the air, and the atmosphere feels fresh and alive. As the came
Ours
The camera starts at a tranquil coastline, where the waves gently crash against the rocks, the salty sea breeze fills the air, and the atmosphere feels fresh and alive. As the came
Human Identity
Baseline
A man is dancing.
Ours
A man is dancing.
Baseline
A man is playing badminton.
Ours
A man is playing badminton.
Baseline
A man is playing ping-pong.
Ours
A man is playing ping-pong.
Baseline
A woman is playing basketball.
Ours
A woman is playing basketball.
Motion Rationality
Baseline
A person is drinking coffee from a cup.
Ours
A person is drinking coffee from a cup.
Baseline
A person is eating a slice of pizza.
Ours
A person is eating a slice of pizza.
Baseline
A person is eating hamburger.
Ours
A person is eating hamburger.
Baseline
A person is eating ice cream.
Ours
A person is eating ice cream.