DFlash: Block Diffusion for Flash Speculative Decoding

Jian Chen, Zhijian Liu

Paper Coming Soon
Code

TL;DR: In this work, we introduce DFlash, a method utilizing a lightweight block diffusion model for drafting in speculative decoding. This enables efficient and high-quality parallel drafting, pushing the limits of speculative decoding. DFlash achieves up to 6.17×\mathbf{6.17\times} lossless acceleration for Qwen3-8B, nearly 2.5×\mathbf{2.5\times} faster than the state-of-the-art speculative decoding method EAGLE-3, as shown in Figure 1.

Huggingface Models: [Qwen3-4B-DFlash-b16] [Qwen3-8B-DFlash-b16]

Demo video of DFlash. DFlash achieves fast decoding with lossless generation, significantly outperforming EAGLE-3 in speed. We use JetLM/SDAR-8B-Chat-b16 with confidence-threshold sampling (0.9) for standalone block diffusion, and RedHatAI/Qwen3-8B-speculator.eagle3 with a speculation length of 7 for EAGLE-3.
DFlash Design

Figure 1. Speedup comparison between DFlash, EAGLE-3 against Autoregressive Decoding. Overall, DFlash achieves more than 2.5× higher speedup than EAGLE-3.

Quick Start

pip install transformers==4.57.3 torch==2.9.0 accelerate
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

# 1. Load the DFlash Draft Model
model = AutoModel.from_pretrained(
    "z-lab/Qwen3-8B-DFlash-b16", 
    trust_remote_code=True, 
    dtype="auto", 
    device_map="cuda:0"
).eval()

# 2. Load the Target Model
target = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B", 
    dtype="auto", 
    device_map="cuda:0"
).eval()

# 3. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
tokenizer.add_special_tokens({"mask_token": "<|MASK|>"})

# 4. Prepare Input
prompt = "How many positive whole-number divisors does 196 have?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# 5. Run Speculative Decoding
generate_ids = model.spec_generate(
    input_ids=model_inputs["input_ids"], 
    max_new_tokens=2048, 
    temperature=0.0, 
    target=target, 
    mask_token_id=tokenizer.mask_token_id, 
    stop_token_ids=[tokenizer.eos_token_id]
)

print(tokenizer.decode(generate_ids[0], skip_special_tokens=True))

Why DFlash?

Autoregressive Large Language Models (LLMs) have transformed the AI landscape, but their sequential nature creates a bottleneck: inference is slow, and GPU compute is often under-utilized.

Speculative decoding addresses this bottleneck by using a small draft model to generate tokens that the target LLM verifies in parallel. While effective, state-of-the-art methods like EAGLE-3 still rely on autoregressive drafting. This serial drafting process is inefficient and prone to error accumulation, effectively capping speedups at roughly 2–3×\times for popular models like the Qwen3 series.

Diffusion LLMs (dLLMs) offer parallel text generation and bidirectional context modeling, presenting a promising alternative to autoregressive LLMs. However, current dLLMs still suffer from performance degradation compared to their autoregressive counterparts. Furthermore, their requirement for a large number of denoising steps to maintain generation quality limits their raw inference speed [1].

This presents a clear trade-off: AR models are performant but slow, while dLLMs allow for fast, parallel generation but often suffer from lower accuracy. Can we combine the strengths of both while avoiding their respective weaknesses? The natural solution is to utilize diffusion for drafting, taking advantage of parallelism, while relying on the AR model for verification.

How DFlash Works

However, using diffusion for drafting is non-trivial:

  1. Existing Diffusion Speculators are Impractical: Methods like DiffuSpec[2] and SpecDiff-2[3] rely on massive 7B-parameter draft models. This high memory footprint makes them prohibitively expensive for real-world serving and limits speedups to ~3-4×\times.
  2. Small Diffusion Models Don’t Work: Simply shrinking the diffusion drafter fails. We trained a lightweight 5-layer block diffusion model (block size 16) from scratch on data generated by Qwen3-4B and perform speculative decoding for Qwen3-4B on some math tasks. As shown in the table below, without additional help, the small model lacks the reasoning capability to align with the target, resulting in limited speedup.
TempGSM8K
Speedup / τ\tau
Math500
Speedup / τ\tau
AIME24
Speedup / τ\tau
AIME25
Speedup / τ\tau
02.83 / 3.383.73 / 4.613.43 / 4.123.35 / 4.07
12.76 / 3.293.31 / 4.122.66 / 3.232.65 / 3.24

Is there no free lunch? Can we build a drafter that is both small (fast) and accurate (high acceptance)?

The Key Insight: The Target Knows Best

We demonstrate that a free lunch does exist. Our key insight is that the large AR target model’s hidden features implicitly contain information about future tokens, a phenomenon also observed by [4].

Instead of asking a tiny diffusion model to reason from scratch, DFlash conditions the draft model on context features extracted from the target model. This fuses the deep reasoning capabilities of the large model with the parallel generation speed of the small diffusion drafter.

Design

Figure 2 illustrate the system design of DFlash.

  1. Feature Fusion: After the prefill or verification steps, we extract and fuse the hidden features from the target model.
  2. Conditioning: These features are fed directly into the Key/Value (K,V) projections of the draft model layers and stored in the draft model’s KV cache.
  3. Parallel Drafting: Conditioned on this rich context (and the last verified token), the drafter predicts the next block of tokens in parallel using diffusion.

To minimize overhead, the draft model reuses the embedding and LM head layers from the target model, and only the intermediate layers are trained. We set the number of draft layers to 5, striking a balance between draft quality and speed.

DFlash Architecture

Figure 2. The overall design of DFlash. We extract and fuse the hidden context features from the target model, feeding these features into each draft layer to perform conditional speculation.

Benchmark Results

We evaluate DFlash against the state-of-the-art speculative decoding method, EAGLE-3. The tables below compare the decoding speedup and acceptance length across various benchmarks. For DFlash, the drafting block size is 16, and the number of denoising steps is 1. For EAGLE-3, we use the pretrained model RedHatAI/Qwen3-8B-speculator.eagle3 for all experiments, with a speculation length of 7.

We set the maximum generation length to 2048 for each task. All tests disable the “thinking” mode of the Qwen3 models. Acceptance is determined using direct token matching between the draft and target sampled tokens. Check out our GitHub repository to see how to reproduce the results.

Math Benchmarks

MethodTempGSM8K
Speedup / τ\tau
Math500
Speedup / τ\tau
AIME24
Speedup / τ\tau
AIME25
Speedup / τ\tau
Average
Speedup / τ\tau
Qwen3-8B-speculator.eagle302.13x / 2.892.18x / 2.942.25x / 3.042.18x / 2.932.19x / 2.95
Qwen3-4B-DFlash-b1605.17x / 6.506.19x / 7.846.00x / 7.475.79x / 7.285.79x / 7.27
Qwen3-8B-DFlash-b1605.20x / 6.556.17x / 7.875.91x / 7.485.85x / 7.315.78x / 7.30
Qwen3-8B-speculator.eagle312.07x / 2.792.03x / 2.751.88x / 2.541.81x / 2.441.95x / 2.63
Qwen3-4B-DFlash-b1614.73x / 5.985.14x / 6.673.84x / 4.973.89x / 5.014.40x / 5.66
Qwen3-8B-DFlash-b1614.78x / 6.045.02x / 6.573.87x / 5.063.84x / 5.034.38x / 5.68

Code Benchmarks

MethodTempHumaneval
Speedup / τ\tau
MBPP
Speedup / τ\tau
LiveCodeBench
Speedup / τ\tau
SWE-Bench
Speedup / τ\tau
Average
Speedup / τ\tau
Qwen3-8B-speculator.eagle302.48x / 3.362.27x / 3.082.24x / 3.161.90x / 2.552.22x / 3.04
Qwen3-4B-DFlash-b1605.26x / 6.634.87x / 6.195.41x / 6.972.97x / 3.704.63x / 5.87
Qwen3-8B-DFlash-b1605.20x / 6.554.75x / 6.005.43x / 7.122.92x / 3.694.58x / 5.84
Qwen3-8B-speculator.eagle312.30x / 3.112.15x / 2.922.17x / 3.001.66x / 2.212.07x / 2.81
Qwen3-4B-DFlash-b1614.80x / 6.054.35x / 5.555.00x / 6.602.51x / 3.094.17x / 5.32
Qwen3-8B-DFlash-b1614.35x / 5.404.07x / 5.175.15x / 6.792.30x / 2.823.97x / 5.05

Chat Benchmarks

MethodTempMT-Bench
Speedup / τ\tau
Alpaca
Speedup / τ\tau
Average
Speedup / τ\tau
Qwen3-8B-speculator.eagle301.94x / 2.721.88x / 2.681.91x / 2.70
Qwen3-4B-DFlash-b1602.87x / 4.352.23x / 3.102.55x / 3.73
Qwen3-8B-DFlash-b1602.79x / 4.252.27x / 3.162.53x / 3.71
Qwen3-8B-speculator.eagle311.81x / 2.551.79x / 2.561.80x / 2.56
Qwen3-4B-DFlash-b1612.63x / 4.032.16x / 2.992.40x / 3.51
Qwen3-8B-DFlash-b1612.50x / 3.742.11x / 2.882.31x / 3.31

Conclusion

DFlash demonstrates the promise of applying diffusion during the drafting stage of speculative decoding, pushing the speed limits of autoregressive LLMs. While dLLMs offer parallel generation, they often suffer from quality degradation compared to state-of-the-art autoregressive models. DFlash shows that by utilizing dLLMs strictly for drafting, we can fully leverage their parallel efficiency without sacrificing output quality. Crucially, by conditioning the drafter on context features from the capable target LLM, DFlash maintains high acceptance rates.

DFlash establishes a new direction for the development of diffusion LLMs. Rather than struggling to train dLLMs to match the accuracy of autoregressive models, we can instead deploy them as specialized drafters. This approach allows us to safely reduce the number of denoising steps, fully utilizing parallel generation, while relying on verification to prevent quality loss. Furthermore, training a lightweight dLLM for drafting requires significantly less compute than training a large, standalone dLLM.

We are currently working on integrating DFlash into popular serving frameworks. We also plan to support a wider range of models, including large MoE models. DFlash is compatible with various inference-time acceleration techniques for dLLMs, such as those introduced in DiffuSpec [2] and SpecDiff-2 [3], which can further improve speedups; we plan to support these integrations soon. The current results are based purely on block diffusion with a block size of 16.

Reference

[1] Qian Y-Y, Su J, Hu L, Zhang P, Deng Z, Zhao P, Zhang H. d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation[J]. arXiv preprint, 2025.

[2] Li G, Fu Z, Fang M, et al. Diffuspec: Unlocking diffusion language models for speculative decoding[J]. arXiv preprint arXiv:2510.02358, 2025.

[3] Sandler J, Christopher J K, Hartvigsen T, et al. SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding[J]. arXiv preprint arXiv:2511.00606, 2025.

[4] Samragh M, Kundu A, Harrison D, et al. Your llm knows the future: Uncovering its multi-token prediction potential[J]. arXiv preprint arXiv:2507.11851, 2025.

Citation

If you find DFlash useful for your research or applications, please cite our project. The full paper is coming soon!

@article{chen2026dflash,
  title   = {DFlash: Block Diffusion for Flash Speculative Decoding},
  author  = {Chen, Jian and Liu, Zhijian},
  journal = {arXiv preprint},
  year    = {2026},
  url     = {[https://github.com/z-lab/dflash](https://github.com/z-lab/dflash)},
  note    = {Paper coming soon}
}