FlashDrive: Flash Vision-Language-Action Inference For Autonomous Driving
Zekai Li*, Yihao Liang*, Hongfei Zhang, Jian Chen, Zhijian Liu
This is an early preview. The paper and additional results will be available shortly.
Traditional autonomous driving systems separate perception and planning, which leaves them brittle on the “long tail” of rare, complex scenarios that real-world driving demands. Vision-Language-Action (VLA) models take a fundamentally different approach: by integrating chain-of-thought reasoning into end-to-end driving, they can think through novel situations step by step, producing explicit reasoning traces alongside trajectory predictions. This year, NVIDIA released Alpamayo 1 and Alpamayo 1.5, the industry’s first open-source reasoning VLA models for autonomous driving.
But reasoning takes time. Alpamayo 1.5 (10B parameters, built on Qwen3-VL) takes 716ms per step on an NVIDIA RTX PRO 6000, roughly 1.4 Hz, far short of the real-time requirements for safe driving. FlashDrive is an algorithm-system co-design framework that attacks all four stages (encode, prefill, decode, and action), reducing end-to-end latency to 159ms, a 4.5× speedup with negligible accuracy loss.
The Bottleneck Is Everywhere
A typical VLA driving model’s inference breaks into four stages: vision encoding, prompt prefilling, reasoning token decoding, and action generation via flow matching. We profiled Alpamayo 1.5 and found that latency is spread across all four stages with no single dominant bottleneck. Getting close to real-time requires optimizing the entire stack.
Streaming Inference
Unlike a chatbot VLM that processes a single image per request, a driving VLA must ingest a continuous multi-camera video stream. At every step, the model processes a sliding window of temporal frames across multiple camera views (e.g., 4 frames × 4 views). But consecutive time steps overlap by 75%: three out of four frames are identical. Re-encoding the full window from scratch every step wastes computation on frames the model has already seen.
We introduce a streaming inference strategy that processes only the new frame:
- KV cache reuse from the three previously encoded frames eliminates 75% of vision computation.
- Pre-RoPE key caching with on-the-fly rotary embeddings handles dynamic position shifts as old frames are evicted and new ones arrive.
- A custom streaming attention mask accommodates view-major token ordering across cameras, ensuring each new frame attends only to frames from the current and previous views while remaining causal within itself.
This reduces the effective sequence length by 75%, accelerating the encode and prefill stages.
There’s a subtlety. The streaming KV cache is an approximation: cached keys and values were computed under a different attention context than the current frame would produce in a full forward pass. This degrades accuracy. The obvious fix, fine-tuning the full VLM on streaming inputs, actually makes things worse. Why? Reasoning tokens are generated autoregressively and attend mainly to recent tokens, making them robust to stale cache entries. The action expert, by contrast, integrates information across the entire KV cache through cross-attention to produce continuous trajectories, amplifying even small distributional mismatches.
This asymmetry suggests a targeted fix: freeze the VLM and fine-tune only the action expert. We expose the expert to the compounding approximation errors it will encounter at deployment by rolling out multiple streaming steps to populate the KV cache (no gradients), then enabling gradients at the final step. This cleanly recovers accuracy to near-baseline.
| ADE@6.4s (m) ↓ | minADE@6.4s (m) ↓ | |
|---|---|---|
| Baseline (no streaming) | 1.85 | 0.80 |
| + Streaming | 2.30 | 1.07 |
| + Streaming, fine-tune VLM | 4.97 | 3.38 |
| + Streaming, fine-tune expert | 1.93 | 0.87 |
Speculative Reasoning
The reasoning capability that makes VLA models powerful for long-tail scenarios comes at a cost: the model must generate explicit reasoning tokens (e.g., chain-of-causation traces) before producing an action. Autoregressive decoding produces these tokens one at a time, making this the largest bottleneck in the pipeline.
But driving-domain reasoning is unusually easy to draft. The reasoning sequences are short (~16 tokens), follow a highly structured template, and are conditioned on rich visual context that already determines most of the content. This makes the per-token entropy substantially lower than in open-ended language generation, creating an opportunity for speculative decoding with high acceptance rates.
We use our DFlash, a block diffusion model, as a parallel drafter. Instead of drafting tokens one at a time like conventional speculative methods, DFlash generates an entire block of candidates in a single forward pass, naturally capturing the intra-block correlations present in structured reasoning. Because speculative verification guarantees the output distribution is identical to standard autoregressive decoding, this acceleration comes with zero quality loss.
Adaptive-Step Flow Matching
VLA models must bridge language-level reasoning and continuous vehicle control. This is typically done through a flow-matching head that converts the model’s reasoning into trajectory waypoints. The standard approach uses 10 denoising steps, but are all of them necessary?
The naive solution is to use fewer uniformly-spaced steps. But this hurts quality, because the velocity field is not uniform across the denoising trajectory. We profiled it and found a striking U-shaped pattern: velocity changes sharply at the first and last steps but is nearly constant through the middle. The endpoints matter most; the middle is redundant.
This non-uniformity has a clear physical interpretation: the early steps establish the coarse trajectory structure (lane choice, turn direction), the final steps snap the prediction onto the manifold of physically plausible trajectories (satisfying kinematic constraints and road geometry), and the intermediate steps perform only minor refinements to an already well-determined path. The endpoints carry the signal; the middle carries the inertia.
We exploit this by caching the velocity at middle steps and reusing it instead of recomputing. This concentrates compute on the steps that shape the trajectory the most, cutting action generation time while preserving trajectory quality.
Quantization
Quantization compresses model weights and activations to lower precision, trading numerical headroom for speed. But there’s a design choice. Standard methods like AWQ quantize only the weights to 4-bit (W4A16): this helps memory-bound decoding by shrinking the data the GPU must load per token, but leaves the compute-bound prefill stage untouched. For a chatbot LLM where decoding dominates, that trade-off is acceptable. For a VLA model with thousands of vision tokens in every prompt, prefill is too expensive to ignore.
W4A8 quantization targets both regimes: 4-bit weights cut memory bandwidth for decoding, while 8-bit activations unlock faster INT8 matrix multiplies for the compute-heavy prefill. One format, two bottlenecks addressed.
The harder question is which W4A8 method. VLA reasoning generates chain-of-thought tokens (~16 per step), and each feeds back into the model, so quantization error compounds at every token. Methods like AWQ leave weight outliers partially intact; over a full reasoning trace, those residual errors accumulate into measurable trajectory drift. We use our ParoQuant, whose scaled pairwise rotation suppresses outliers far more thoroughly, keeping the compounding error in check.
System Optimizations
The VLA pipeline is unusually heterogeneous: vision encoding, language processing, autoregressive decoding, and diffusion-based action generation each have different compute profiles. Algorithmic improvements alone leave performance on the table without tight system engineering:
- CUDA Graphs. Autoregressive generation launches many small kernels with high CPU dispatch overhead. Compiling the full four-stage pipeline into CUDA graphs eliminates this overhead.
- Kernel Fusion. We fuse Q/K/V projections into a single kernel launch and merge the gate and up-projections within MLP layers. Combined with max-autotune compilation for element-wise and reduction operations, this eliminates memory round-trips and launch gaps.
Results
On an RTX PRO 6000, algorithmic and system optimizations cut latency from 716ms to 159ms (4.5×). Every technique targets a different stage, so the gains compound rather than saturate: no single optimization accounts for more than half the total speedup.
The same optimizations transfer across NVIDIA platforms, from the in-car Jetson Thor to datacenter workstation GPUs, with per-device speedups ranging from 4.0× to 5.7×.
| Jetson Thor | RTX 3090 | RTX 4090 | RTX 5090 | RTX PRO 6000 | |
|---|---|---|---|---|---|
| Alpamayo 1.5 (ms) ↓ | 3770 | 1788 | 1187 | 986 | 716 |
| + FlashDrive (ms) ↓ | 944 | 363 | 209 | 192 | 159 |
| Speedup | 4.0× | 4.9× | 5.7× | 5.1× | 4.5× |
Conclusion
VLA inference is not a monolithic bottleneck but a cascade of stages, each hiding a different form of redundancy. Temporal overlap in vision, low entropy in reasoning, velocity smoothness in flow matching, numerical headroom in weights: each yields to a targeted shortcut, and because the redundancies are orthogonal, the speedups compound to 4.5× with negligible accuracy loss.
This extends beyond driving to any VLA deployment where latency is the binding constraint. Sub-200ms inference on a single GPU brings reasoning-capable VLA models into the range where real-time deployment becomes viable, without sacrificing the chain-of-thought that makes them powerful.
Citation
@article{li2026flashdrive,
title = {{FlashDrive: Flash Vision-Language-Action Inference For Autonomous Driving}},
author = {Li, Zekai and Liang, Yihao and Zhang, Hongfei and Chen, Jian and Liu, Zhijian},
year = {2026}
}
Z Lab