Revolutionizing Video Understanding with VideoNSA: Native Sparse Attention

Global: VideoNSA Introduces Native Sparse Attention for Scalable Video Understanding

Researchers led by Enxin Song and colleagues released a new preprint on arXiv on October 2, 2025, with a revised version on January 30, 2026, describing VideoNSA, a method that integrates Native Sparse Attention into video-language models to improve long‑video comprehension.

Limitations of Existing Video‑Language Models

Current multimodal models often struggle with extended context lengths, missing critical transition frames and failing to maintain coherence across lengthy video sequences. These constraints hinder performance on tasks that require temporal reasoning and spatial understanding.

Native Sparse Attention Adaptation

VideoNSA adapts the Qwen2.5‑VL architecture by applying dense attention to textual inputs while employing a hardware‑aware hybrid sparse attention mechanism for video frames. This approach preserves computational efficiency without sacrificing accuracy.

Training Corpus and Procedure

The team trained the model end‑to‑end on a curated 216 K video instruction dataset, leveraging the same token budget across modalities. The training pipeline incorporated a global‑local attention allocation strategy to balance detail and overview.

Performance Gains

Benchmark evaluations show that VideoNSA outperforms token‑compression techniques and training‑free sparse baselines on long‑video understanding, temporal reasoning, and spatial tasks. Notably, the model scales reliably to 128 K tokens.

Ablation Insights

Four key findings emerged from ablation studies: (1) reliable scaling to 128 K tokens; (2) an optimal global‑local attention split at a fixed budget; (3) task‑dependent branch usage patterns; and (4) learnable combined sparse attention that creates dynamic attention sinks.

Implications and Future Directions

The results suggest that native sparse attention can be a practical solution for extending the context window of video‑language models. The authors plan to explore further scaling, multimodal integration, and real‑world deployment scenarios.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

VideoNSA Introduces Native Sparse Attention for Scalable Video Understanding