NVILA: Efficient Frontier Visual Language Models

Zhijian Liu*, Ligeng Zhu*, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin†, Song Han†, Yao Lu†

Over a year out, most of NVILA’s specific numbers have been surpassed. The principles behind them have not. For the full technical details, see the official blog. What follows is a retrospective on what lasted.

NVILA treated efficiency as a first-class design goal across the full VLM lifecycle — training, fine-tuning, and deployment. At the time, the field optimized for accuracy first and dealt with cost later. We thought that was backwards: cost determines who can actually use a VLM, not just who can benchmark it.

Scale Then Compress

The core idea is counterintuitive: to make a model cheaper, first make it more expensive. Scale up resolution, frame count, and data to raise the accuracy ceiling, then compress tokens to bring cost back down. This works because scaling introduces massive redundancy — and redundancy is much easier to remove than information is to recover.

What surprised us most was how little the compression mechanism mattered. A 2×2 spatial-to-channel reshape — literally rearranging pixels into channels — beat learned compressors like TokenLearner and Perceiver Resampler. Temporal averaging (the simplest possible pooling) beat attention-based poolers. A loss-based heuristic (DeltaLoss) pruned half the training data with no accuracy drop.

Don’t start with a clever compressor — start with more signal. Rich inputs with naive compression consistently beat limited inputs with sophisticated compression. The information content of the input matters more than the elegance of the bottleneck.

Full-Lifecycle Co-Design

Most work on VLM efficiency picks one stage to optimize. This makes sense for a paper, but it leaves a surprising amount on the table, because efficiency gains across stages multiply — they don’t add.

Training. FP8 precision (COAT) doubled throughput; data pruning halved the steps. Either alone is useful. Together: ~4.5×.

Fine-tuning. We found something practitioners care about more than researchers: the vision encoder and LLM want very different learning rates (5–50× apart). Get this wrong and fine-tuning either diverges or stalls. LayerNorm-only tuning matched LoRA while being faster, bringing fine-tuning down to a single 24 GB GPU — the difference between “you need a cluster” and “you have a laptop.”

Deployment. The two inference phases have opposite bottlenecks, and treating them uniformly is a common mistake. Prefilling is compute-bound (the vision tower dominates after token compression), so it wants W8A8. Decoding is memory-bound (the LLM dominates), so it wants W4A16. One quantization strategy for both leaves performance on the table.

What Lasted

The VLM field moves fast. What lasted was less the specific techniques and more the bets behind them: that simple token compression beats learned compression, that efficiency should be co-designed across the full lifecycle, and that dataset quality dominates quantity. Multiple teams converged on these ideas independently, which is the strongest evidence they were right.

If there’s one principle I’d distill from the whole project, it’s this: design for the ceiling, then optimize for the floor. The instinct in efficiency research is to start from constraints — limited compute, limited memory, limited data — and engineer around them. NVILA went the other direction: start with the richest possible representation (higher resolution, more frames, more data), then figure out what’s safe to throw away. It turns out that most of what you add is safe to throw away, because scaling introduces massive redundancy. But you can only discover that redundancy by scaling first. Starting small and trying to be clever is a trap — you end up optimizing for a ceiling that’s too low.

Citation

@inproceedings{liu2025nvila,
  title     = {{NVILA: Efficient Frontier Visual Language Models}},
  author    = {Liu, Zhijian and Zhu, Ligeng and Shi, Baifeng and Zhang, Zhuoyang and Lou, Yuming and Yang, Shang and Xi, Haocheng and Cao, Shiyi and Gu, Yuxian and Li, Dacheng and Li, Xiuyu and Fang, Yunhao and Chen, Yukang and Hsieh, Cheng-Yu and Huang, De-An and Cheng, An-Chieh and Nath, Vishwesh and Hu, Jinyi and Liu, Sifei and Krishna, Ranjay and Xu, Daguang and Wang, Xiaolong and Molchanov, Pavlo and Kautz, Jan and Yin, Hongxu and Han, Song and Lu, Yao},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025}
}