Researchers Discover Secondary Attention Sinks in Transformer Models

Global: Researchers Identify Secondary Attention Sinks in Transformer Models

Discovery of Secondary Attention Sinks

Researchers have uncovered a new class of attention sinks—termed secondary sinks—in large transformer architectures, expanding the understanding of how attention mass is allocated across tokens. Unlike the well‑studied primary sinks that dominate attention from the beginning‑of‑sequence token, these secondary sinks emerge later in the processing pipeline and still attract a notable share of attention despite limited semantic relevance.

Distinguishing Primary and Secondary Sinks

The study differentiates primary sinks, which appear in early layers, persist throughout the network, and command a large proportion of attention, from secondary sinks that arise chiefly in middle layers, may persist for a variable number of layers, and draw a smaller yet significant amount of attention mass. Both types function as “attention sinks,” but their temporal and spatial characteristics within the model differ markedly.

Experimental Scope

To characterize these phenomena, the authors conducted extensive experiments across eleven distinct model families, systematically probing the layers where secondary sinks appear and measuring their influence on attention distribution. The analysis spanned a range of model sizes, enabling comparison between modest and large‑scale architectures.

Mechanisms of Sink Formation

Findings indicate that secondary sinks are generated by specific middle‑layer MLP modules. These modules map token representations onto vectors that align with the direction of the primary sink for the corresponding layer. The ℓ₂‑norm of these vectors determines each secondary sink’s score and dictates how many subsequent layers the sink will influence, thereby shaping its impact on the attention mechanism.

Effects on Attention Distribution

The emergence of secondary sinks coincides with a weakening of the primary sink in middle layers, suggesting a redistribution of attention mass as processing progresses. Although secondary sinks capture less attention than primary sinks, their presence nonetheless alters the overall attention landscape, potentially affecting downstream token representations.

Scale‑Dependent Patterns

In larger models, the location and longevity of sinks—referred to as sink levels—exhibit more deterministic and frequent patterns. The researchers identified three distinct sink levels in the QwQ‑32B model and six levels in the Qwen3‑14B model, highlighting a scaling relationship between model size and sink behavior.

Potential Implications

Understanding secondary attention sinks may inform future model design and interpretability efforts, offering insights into how attention mechanisms evolve across layers and how they might be optimized or mitigated in applications requiring precise token weighting.This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Researchers Identify Secondary Attention Sinks in Transformer Models