FlashDrive: Flash Vision-Language-Action Inference For Autonomous Driving
Zekai Li*, Yihao Liang*, Hongfei Zhang*, Jian Chen, Zhijian Liu
This is an early preview. The paper and additional results will be available shortly.
Traditional autonomous driving systems separate perception and planning, which limits their ability to handle rare, complex scenarios: the “long tail” of edge cases that real-world driving demands. Vision-Language-Action (VLA) models offer a fundamentally different approach: by integrating chain-of-thought reasoning into end-to-end driving, they can think through novel situations step by step, producing both explicit reasoning traces and trajectory predictions. At CES 2026, NVIDIA released Alpamayo, the industry’s first open-source reasoning VLA model for autonomous driving, marking a major milestone for this paradigm.
But reasoning takes time. Alpamayo 1 (10B parameters, built on Qwen3-VL) takes 788ms per step on an NVIDIA RTX PRO 6000, roughly 1.3 Hz, far short of the real-time requirements for safe driving. We profiled the full pipeline and found no single dominant bottleneck: encode, prefill, decode, and action each consume substantial time.
FlashDrive is an algorithm-system co-design framework that attacks all four stages, reducing end-to-end latency to 176ms, a 4.5× speedup with negligible accuracy loss. With W4A16 quantization on top, latency drops further to 158ms (4.9×), and the model fits on an RTX 4090 where the baseline cannot even load.
The Bottleneck Is Everywhere
A typical VLA driving model’s inference breaks into four stages: vision encoding, prompt prefilling, reasoning token decoding, and action generation via flow matching. We profiled Alpamayo 1 and found that latency is spread across all four stages with no single dominant bottleneck. Getting close to real-time requires optimizing the entire stack.
Streaming Inference
Unlike a chatbot VLM that processes a single image per request, a driving VLA must ingest a continuous multi-camera video stream. At every step, the model processes a sliding window of temporal frames across multiple camera views (e.g., 4 frames × 4 views). But consecutive time steps overlap by 75%: three out of four frames are identical. Re-encoding the full window from scratch every step wastes computation on frames the model has already seen.
We introduce a streaming inference strategy that processes only the new frame:
- KV cache reuse from the three previously encoded frames eliminates 75% of vision computation.
- Pre-RoPE key caching with on-the-fly rotary embeddings handles dynamic position shifts as old frames are evicted and new ones arrive.
- A custom streaming attention mask accommodates view-major token ordering across cameras, ensuring each new frame attends only to frames from the current and previous views while remaining causal within itself.
This reduces the effective sequence length by 75%, significantly accelerating the encode and prefill stages.
There is a subtlety. The streaming KV cache is an approximation: cached keys and values were computed under a different attention context than the current frame would produce in a full forward pass. This degrades accuracy. The obvious fix, finetuning the full VLM on streaming inputs, actually makes things worse. Why? Reasoning tokens are generated autoregressively and attend mainly to recent tokens, making them robust to stale cache entries. The action expert, by contrast, integrates information across the entire KV cache through cross-attention to produce continuous trajectories, amplifying even small distributional mismatches.
This asymmetry suggests a targeted fix: freeze the VLM and finetune only the action expert. We expose the expert to the compounding approximation errors it will encounter at deployment by rolling out multiple streaming steps to populate the KV cache (no gradients), then enabling gradients at the final step. This cleanly recovers accuracy to near-baseline.
| ADE@6.4s (m) ↓ | minADE@6.4s (m) ↓ | |
|---|---|---|
| Baseline (no streaming) | 1.85 | 0.80 |
| + Streaming | 2.30 | 1.07 |
| + Streaming, finetune VLM | 4.97 | 3.38 |
| + Streaming, finetune expert | 1.93 | 0.87 |
Speculative Reasoning
The reasoning capability that makes VLA models powerful for long-tail scenarios comes at a cost: the model must generate explicit reasoning tokens (e.g., chain-of-causation traces) before producing an action. Autoregressive decoding produces these tokens one at a time, making this the single largest bottleneck in the pipeline.
But driving-domain reasoning is unusually easy to draft. The reasoning sequences are short (~16 tokens), follow a highly structured template, and are conditioned on rich visual context that already determines most of the content. This makes the per-token entropy substantially lower than in open-ended language generation, creating an opportunity for speculative decoding with high acceptance rates.
We use our DFlash, a block diffusion model, as a parallel drafter. Instead of drafting tokens one at a time like conventional speculative methods, DFlash generates an entire block of candidates in a single forward pass, naturally capturing the intra-block correlations present in structured reasoning. Because speculative verification guarantees the output distribution is identical to standard autoregressive decoding, this acceleration comes with zero quality loss.
Adaptive-Step Flow Matching
VLA models need to bridge the gap between language-level reasoning and continuous vehicle control. This is typically done through a flow-matching head that converts the model’s reasoning into trajectory waypoints. The standard approach uses 10 denoising steps, but are they all necessary?
The naive solution is to simply use fewer uniformly-spaced steps. But this hurts quality, because the velocity field is not uniform across the denoising trajectory. We profiled it and found a striking U-shaped pattern: velocity changes sharply at the first and last steps but is nearly constant through the middle. The endpoints matter most; the middle is redundant.
This non-uniformity has a clear physical interpretation: the early steps establish the coarse trajectory structure (lane choice, turn direction), the final steps snap the prediction onto the manifold of physically plausible trajectories (satisfying kinematic constraints and road geometry), and the intermediate steps perform only minor refinements to an already well-determined path. The endpoints carry the signal; the middle carries the inertia.
We exploit this by caching the velocity at middle steps and reusing it in lieu of recomputation. This concentrates compute on the steps that shape the trajectory the most, cutting action generation time while preserving trajectory quality.
Quantization
VLA models inherit the large parameter counts of their VLM backbones, making them far heavier than traditional perception-only driving models. A 10B-parameter VLA model in FP16 exceeds the memory of most consumer-grade GPUs. We apply AWQ to quantize model weights to W4A16 format, roughly halving the memory footprint. This not only enables deployment on devices like the RTX 4090 (where the FP16 model cannot even load) but also further reduces latency through lower memory bandwidth pressure.
System Optimizations
The VLA pipeline is uniquely heterogeneous: vision encoding, language processing, autoregressive decoding, and diffusion-based action generation each have different compute profiles. Algorithmic improvements alone leave performance on the table without tight system engineering:
- CUDA Graphs. Autoregressive generation launches many small kernels with high CPU dispatch overhead. Compiling the full four-stage pipeline into CUDA graphs eliminates this overhead.
- Kernel Fusion. We fuse Q/K/V projections into a single kernel launch and merge the gate and up-projections within MLP layers. Combined with max-autotune compilation for element-wise and reduction operations, this eliminates memory round-trips and launch gaps.
Results
We evaluate FlashDrive on Alpamayo 1 running on an RTX PRO 6000. Every stage of the pipeline contributes to the overall speedup; no single optimization would have been sufficient on its own. With W4A16 quantization on top, latency drops further to 158ms (4.9×).
Conclusion
VLA inference is not a monolithic bottleneck but a cascade of four stages, each dominated by a different form of redundancy. By matching each bottleneck to a lightweight algorithmic shortcut and layering them on system-level compilation and fusion, the speedups compound to 4.9× with negligible accuracy loss.
This approach extends beyond autonomous driving to any VLA deployment where latency is the binding constraint. Sub-200ms inference on consumer-grade hardware brings reasoning-capable driving models into the range where real-time deployment becomes viable, without sacrificing the chain-of-thought that makes VLA models powerful.
Citation
@article{li2026flashdrive,
title = {{FlashDrive: Flash Vision-Language-Action Inference For Autonomous Driving}},
author = {Li, Zekai and Liang, Yihao and Zhang, Hongfei and Chen, Jian and Liu, Zhijian},
year = {2026}
}
Z Lab