Trainable Sparse Attention Mechanism Increases LLM Decoding Speed
Global: Trainable Sparse Attention Mechanism Increases LLM Decoding Speed
Researchers at Tsinghua University’s Natural Language Processing group have introduced a new sparse attention mechanism, called NOSA, alongside an inference system named NOSI, to accelerate decoding in large language models while managing GPU memory constraints. The work, posted on arXiv in October 2025, targets the key‑value (KV) cache bottleneck that limits batch size and throughput during inference.
Memory Bottleneck in LLM Decoding
During generation, the KV cache stores intermediate representations for each token, consuming a substantial portion of GPU memory. Larger inference batches, which could improve throughput, are often restricted by this memory usage.
Limitations of Existing Offloading Techniques
Previous training‑free KV cache offloading methods move redundant context to the CPU and retrieve a sparse subset for attention. While they reduce memory pressure, they can degrade quality on long‑generation tasks because the sparse patterns differ between training and inference. Additionally, trainable sparse attention approaches have struggled to integrate with offloading due to unpredictable KV accesses that increase CPU‑to‑GPU data transfers.
Introducing NOSA: Constrained Sparse Attention
NOSA is designed to be trainable while explicitly limiting the volume of KV data transferred between CPU and GPU. By enforcing a predefined sparsity pattern, the mechanism reduces communication overhead and preserves the benefits of offloading without sacrificing model performance.
NOSI: An Inference System Built for NOSA
The accompanying system, NOSI, implements the constrained KV cache offloading strategy, ensuring that the sparse attention operations of NOSA are executed efficiently. Together, they form a pipeline that keeps most KV data on the CPU and only moves the necessary subset for each attention step.
Empirical Performance Gains
Experiments on 1‑billion, 3‑billion, and 8‑billion parameter language models show that NOSA outperforms existing KV offloading baselines on standard, long‑input, and long‑generation benchmarks. Decoding throughput increased by up to 5.04× over FullAttn, 1.92× over InfLLMv2, and 1.83× over ShadowKV, according to the authors’ reported results.
Open Source Release
The research team has made the code for both NOSA and NOSI publicly available on GitHub, inviting further evaluation and integration by the broader AI community.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung