FlashDrive: Flash Vision-Language-Action Inference For Autonomous Driving

Zekai Li*, Yihao Liang*, Hongfei Zhang, Jian Chen, Zhijian Liu

Preview

This is an early preview. The paper and additional results will be available shortly.

Traditional autonomous driving systems separate perception and planning, which leaves them brittle on the “long tail” of rare, complex scenarios that real-world driving demands. Vision-Language-Action (VLA) models take a fundamentally different approach: by integrating chain-of-thought reasoning into end-to-end driving, they can think through novel situations step by step, producing explicit reasoning traces alongside trajectory predictions. This year, NVIDIA released Alpamayo 1 and Alpamayo 1.5, the industry’s first open-source reasoning VLA models for autonomous driving.

But reasoning takes time. Alpamayo 1.5 (10B parameters, built on Qwen3-VL) takes 716ms per step on an NVIDIA RTX PRO 6000, roughly 1.4 Hz, far short of the real-time requirements for safe driving. FlashDrive is an algorithm-system co-design framework that attacks all four stages (encode, prefill, decode, and action), reducing end-to-end latency to 159ms, a 4.5× speedup with negligible accuracy loss.

The Bottleneck Is Everywhere

A typical VLA driving model’s inference breaks into four stages: vision encoding, prompt prefilling, reasoning token decoding, and action generation via flow matching. We profiled Alpamayo 1.5 and found that latency is spread across all four stages with no single dominant bottleneck. Getting close to real-time requires optimizing the entire stack.

Encode Prefill Decode Action
88
177.2
263.8
187.4
716ms
Decode and action together account for nearly two-thirds of the 716ms total, but encode and prefill are large enough that no single-stage fix suffices.

Streaming Inference

Unlike a chatbot VLM that processes a single image per request, a driving VLA must ingest a continuous multi-camera video stream. At every step, the model processes a sliding window of temporal frames across multiple camera views (e.g., 4 frames × 4 views). But consecutive time steps overlap by 75%: three out of four frames are identical. Re-encoding the full window from scratch every step wastes computation on frames the model has already seen.

We introduce a streaming inference strategy that processes only the new frame:

  • KV cache reuse from the three previously encoded frames eliminates 75% of vision computation.
  • Pre-RoPE key caching with on-the-fly rotary embeddings handles dynamic position shifts as old frames are evicted and new ones arrive.
  • A custom streaming attention mask accommodates view-major token ordering across cameras, ensuring each new frame attends only to frames from the current and previous views while remaining causal within itself.

This reduces the effective sequence length by 75%, accelerating the encode and prefill stages.

There’s a subtlety. The streaming KV cache is an approximation: cached keys and values were computed under a different attention context than the current frame would produce in a full forward pass. This degrades accuracy. The obvious fix, fine-tuning the full VLM on streaming inputs, actually makes things worse. Why? Reasoning tokens are generated autoregressively and attend mainly to recent tokens, making them robust to stale cache entries. The action expert, by contrast, integrates information across the entire KV cache through cross-attention to produce continuous trajectories, amplifying even small distributional mismatches.

This asymmetry suggests a targeted fix: freeze the VLM and fine-tune only the action expert. We expose the expert to the compounding approximation errors it will encounter at deployment by rolling out multiple streaming steps to populate the KV cache (no gradients), then enabling gradients at the final step. This cleanly recovers accuracy to near-baseline.

ADE@6.4s (m) ↓minADE@6.4s (m) ↓
Baseline (no streaming)1.850.80
+ Streaming2.301.07
+ Streaming, fine-tune VLM4.973.38
+ Streaming, fine-tune expert1.930.87
Streaming alone degrades accuracy (2.30m vs 1.85m ADE). Fine-tuning the VLM makes it worse (4.97m). Fine-tuning only the action expert recovers to near-baseline (1.93m). Results obtained on Alpamayo 1.

Speculative Reasoning

The reasoning capability that makes VLA models powerful for long-tail scenarios comes at a cost: the model must generate explicit reasoning tokens (e.g., chain-of-causation traces) before producing an action. Autoregressive decoding produces these tokens one at a time, making this the largest bottleneck in the pipeline.

But driving-domain reasoning is unusually easy to draft. The reasoning sequences are short (~16 tokens), follow a highly structured template, and are conditioned on rich visual context that already determines most of the content. This makes the per-token entropy substantially lower than in open-ended language generation, creating an opportunity for speculative decoding with high acceptance rates.

We use our DFlash, a block diffusion model, as a parallel drafter. Instead of drafting tokens one at a time like conventional speculative methods, DFlash generates an entire block of candidates in a single forward pass, naturally capturing the intra-block correlations present in structured reasoning. Because speculative verification guarantees the output distribution is identical to standard autoregressive decoding, this acceleration comes with zero quality loss.

Adaptive-Step Flow Matching

VLA models must bridge language-level reasoning and continuous vehicle control. This is typically done through a flow-matching head that converts the model’s reasoning into trajectory waypoints. The standard approach uses 10 denoising steps, but are all of them necessary?

The naive solution is to use fewer uniformly-spaced steps. But this hurts quality, because the velocity field is not uniform across the denoising trajectory. We profiled it and found a striking U-shaped pattern: velocity changes sharply at the first and last steps but is nearly constant through the middle. The endpoints matter most; the middle is redundant.

Rel. Diff. ‖vᵢ₊₁−vᵢ‖/‖vᵢ‖ (%)01020304050600→11→22→33→44→55→66→7
Velocity changes drop from 27% at step 0→1 to under 6% in the middle, then rise again at the end.
Cosine Similarity0.850.900.9510→11→22→33→44→55→66→7
Middle steps reach cosine similarity above 0.99, confirming they are nearly redundant.

This non-uniformity has a clear physical interpretation: the early steps establish the coarse trajectory structure (lane choice, turn direction), the final steps snap the prediction onto the manifold of physically plausible trajectories (satisfying kinematic constraints and road geometry), and the intermediate steps perform only minor refinements to an already well-determined path. The endpoints carry the signal; the middle carries the inertia.

We exploit this by caching the velocity at middle steps and reusing it instead of recomputing. This concentrates compute on the steps that shape the trajectory the most, cutting action generation time while preserving trajectory quality.

Quantization

Quantization compresses model weights and activations to lower precision, trading numerical headroom for speed. But there’s a design choice. Standard methods like AWQ quantize only the weights to 4-bit (W4A16): this helps memory-bound decoding by shrinking the data the GPU must load per token, but leaves the compute-bound prefill stage untouched. For a chatbot LLM where decoding dominates, that trade-off is acceptable. For a VLA model with thousands of vision tokens in every prompt, prefill is too expensive to ignore.

W4A8 quantization targets both regimes: 4-bit weights cut memory bandwidth for decoding, while 8-bit activations unlock faster INT8 matrix multiplies for the compute-heavy prefill. One format, two bottlenecks addressed.

The harder question is which W4A8 method. VLA reasoning generates chain-of-thought tokens (~16 per step), and each feeds back into the model, so quantization error compounds at every token. Methods like AWQ leave weight outliers partially intact; over a full reasoning trace, those residual errors accumulate into measurable trajectory drift. We use our ParoQuant, whose scaled pairwise rotation suppresses outliers far more thoroughly, keeping the compounding error in check.

System Optimizations

The VLA pipeline is unusually heterogeneous: vision encoding, language processing, autoregressive decoding, and diffusion-based action generation each have different compute profiles. Algorithmic improvements alone leave performance on the table without tight system engineering:

  • CUDA Graphs. Autoregressive generation launches many small kernels with high CPU dispatch overhead. Compiling the full four-stage pipeline into CUDA graphs eliminates this overhead.
  • Kernel Fusion. We fuse Q/K/V projections into a single kernel launch and merge the gate and up-projections within MLP layers. Combined with max-autotune compilation for element-wise and reduction operations, this eliminates memory round-trips and launch gaps.

Results

Encode Prefill Decode Action
Alpamayo 1.5
88
177.2
263.8
187.4
716ms
+ FlashDrive
12.5
52.5
48.2
46.2
159ms
Per-stage latency on RTX PRO 6000. FlashDrive cuts every stage for a 4.5× speedup.
ADE@6.4s
1.72
1.56
A slight accuracy gain.
minADE@6.4s
0.77
0.84
Within 0.1m.

On an RTX PRO 6000, algorithmic and system optimizations cut latency from 716ms to 159ms (4.5×). Every technique targets a different stage, so the gains compound rather than saturate: no single optimization accounts for more than half the total speedup.

The same optimizations transfer across NVIDIA platforms, from the in-car Jetson Thor to datacenter workstation GPUs, with per-device speedups ranging from 4.0× to 5.7×.

Jetson ThorRTX 3090RTX 4090RTX 5090RTX PRO 6000
Alpamayo 1.5 (ms) ↓377017881187986716
+ FlashDrive (ms) ↓944363209192159
Speedup4.0×4.9×5.7×5.1×4.5×
End-to-end latency across five NVIDIA platforms, from in-car Jetson Thor to datacenter RTX PRO 6000. A single FlashDrive implementation delivers a consistent 4.0–5.7× speedup.

Conclusion

VLA inference is not a monolithic bottleneck but a cascade of stages, each hiding a different form of redundancy. Temporal overlap in vision, low entropy in reasoning, velocity smoothness in flow matching, numerical headroom in weights: each yields to a targeted shortcut, and because the redundancies are orthogonal, the speedups compound to 4.5× with negligible accuracy loss.

This extends beyond driving to any VLA deployment where latency is the binding constraint. Sub-200ms inference on a single GPU brings reasoning-capable VLA models into the range where real-time deployment becomes viable, without sacrificing the chain-of-thought that makes them powerful.

Citation

@article{li2026flashdrive,
  title   = {{FlashDrive: Flash Vision-Language-Action Inference For Autonomous Driving}},
  author  = {Li, Zekai and Liang, Yihao and Zhang, Hongfei and Chen, Jian and Liu, Zhijian},
  year    = {2026}
}