Vulnerabilities in Mixture-of-Experts Language Models Exposed

Global: New Study Reveals Vulnerabilities in Mixture-of-Experts Language Models via Training-Free Attack

Researchers publishing a paper on arXiv in December 2025 introduced GateBreaker, a training‑free, lightweight framework that can compromise the safety alignment of modern mixture‑of‑experts (MoE) large language models (LLMs) during inference. The work, authored by an unnamed team of AI safety scholars, aims to assess how sparsely activated architectures handle harmful inputs and to highlight potential weaknesses that could affect downstream applications worldwide.

Background on MoE Architectures

Mixture‑of‑Experts models have become popular for scaling LLMs because they activate only a small subset of parameters for each query, reducing computational cost while maintaining performance. Despite their growing deployment in critical domains, most safety research has focused on dense LLMs, leaving the unique characteristics of MoE safety largely unexplored.

GateBreaker Methodology

The proposed attack proceeds in three stages. First, gate‑level profiling identifies safety experts that are disproportionately routed when processing potentially harmful prompts. Second, expert‑level localization pinpoints the specific neurons within those experts that constitute the safety mechanism. Finally, targeted safety removal disables the identified neurons, effectively weakening the model’s alignment without broadly degrading utility.

Empirical Findings

Experimental evaluation on eight recently aligned MoE LLMs showed that disabling roughly 3% of neurons in the targeted expert layers raised the average attack success rate (ASR) from 7.4% to 64.9%, while only modestly affecting overall model performance. The authors note that the safety‑related neurons are concentrated and coordinated by the sparse routing process.

Transferability Across Models

When the same set of compromised neurons was transferred to other models within the same family, a one‑shot attack increased ASR from 17.9% to 67.7%, indicating that safety structures can be shared across related architectures.

Extension to Vision‑Language Models

GateBreaker was also applied to five MoE vision‑language models, achieving a 60.9% ASR on unsafe image inputs, suggesting that the vulnerability extends beyond text‑only systems.

Implications for AI Safety

The findings underscore the need for dedicated safety evaluations of sparsely activated models and may prompt developers to redesign routing mechanisms or incorporate additional safeguards. As MoE architectures continue to scale, understanding their alignment properties will be essential for responsible deployment.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Study Reveals Vulnerabilities in Mixture-of-Experts Language Models via Training-Free Attack