New Study Reveals Vulnerabilities in Mixture-of-Experts Language Models via Training-Free Attack
Global: New Study Reveals Vulnerabilities in Mixture-of-Experts Language Models via Training-Free Attack
Researchers publishing a paper on arXiv in December 2025 introduced GateBreaker, a training‑free, lightweight framework that can compromise the safety alignment of modern mixture‑of‑experts (MoE) large language models (LLMs) during inference. The work, authored by an unnamed team of AI safety scholars, aims to assess how sparsely activated architectures handle harmful inputs and to highlight potential weaknesses that could affect downstream applications worldwide.
Background on MoE Architectures
Mixture‑of‑Experts models have become popular for scaling LLMs because they activate only a small subset of parameters for each query, reducing computational cost while maintaining performance. Despite their growing deployment in critical domains, most safety research has focused on dense LLMs, leaving the unique characteristics of MoE safety largely unexplored.
GateBreaker Methodology
The proposed attack proceeds in three stages. First, gate‑level profiling identifies safety experts that are disproportionately routed when processing potentially harmful prompts. Second, expert‑level localization pinpoints the specific neurons within those experts that constitute the safety mechanism. Finally, targeted safety removal disables the identified neurons, effectively weakening the model’s alignment without broadly degrading utility.
Empirical Findings
Experimental evaluation on eight recently aligned MoE LLMs showed that disabling roughly 3% of neurons in the targeted expert layers raised the average attack success rate (ASR) from 7.4% to 64.9%, while only modestly affecting overall model performance. The authors note that the safety‑related neurons are concentrated and coordinated by the sparse routing process.
Transferability Across Models
When the same set of compromised neurons was transferred to other models within the same family, a one‑shot attack increased ASR from 17.9% to 67.7%, indicating that safety structures can be shared across related architectures.
Extension to Vision‑Language Models
GateBreaker was also applied to five MoE vision‑language models, achieving a 60.9% ASR on unsafe image inputs, suggesting that the vulnerability extends beyond text‑only systems.
Implications for AI Safety
The findings underscore the need for dedicated safety evaluations of sparsely activated models and may prompt developers to redesign routing mechanisms or incorporate additional safeguards. As MoE architectures continue to scale, understanding their alignment properties will be essential for responsible deployment.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung