Accelerating LLM Decoding with Trainable Sparse Attention

Global: Trainable Sparse Attention Mechanism Increases LLM Decoding Speed

Researchers at Tsinghua University’s Natural Language Processing group have introduced a new sparse attention mechanism, called NOSA, alongside an inference system named NOSI, to accelerate decoding in large language models while managing GPU memory constraints. The work, posted on arXiv in October 2025, targets the key‑value (KV) cache bottleneck that limits batch size and throughput during inference.

Memory Bottleneck in LLM Decoding

During generation, the KV cache stores intermediate representations for each token, consuming a substantial portion of GPU memory. Larger inference batches, which could improve throughput, are often restricted by this memory usage.

Limitations of Existing Offloading Techniques

Previous training‑free KV cache offloading methods move redundant context to the CPU and retrieve a sparse subset for attention. While they reduce memory pressure, they can degrade quality on long‑generation tasks because the sparse patterns differ between training and inference. Additionally, trainable sparse attention approaches have struggled to integrate with offloading due to unpredictable KV accesses that increase CPU‑to‑GPU data transfers.

Introducing NOSA: Constrained Sparse Attention

NOSA is designed to be trainable while explicitly limiting the volume of KV data transferred between CPU and GPU. By enforcing a predefined sparsity pattern, the mechanism reduces communication overhead and preserves the benefits of offloading without sacrificing model performance.

NOSI: An Inference System Built for NOSA

The accompanying system, NOSI, implements the constrained KV cache offloading strategy, ensuring that the sparse attention operations of NOSA are executed efficiently. Together, they form a pipeline that keeps most KV data on the CPU and only moves the necessary subset for each attention step.

Empirical Performance Gains

Experiments on 1‑billion, 3‑billion, and 8‑billion parameter language models show that NOSA outperforms existing KV offloading baselines on standard, long‑input, and long‑generation benchmarks. Decoding throughput increased by up to 5.04× over FullAttn, 1.92× over InfLLMv2, and 1.83× over ShadowKV, according to the authors’ reported results.

Open Source Release

The research team has made the code for both NOSA and NOSI publicly available on GitHub, inviting further evaluation and integration by the broader AI community.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Trainable Sparse Attention Mechanism Increases LLM Decoding Speed