Spatial Token Rearrangement Enables High-Rate LLM Jailbreaks
Global: Spatial Token Rearrangement Enables High-Rate LLM Jailbreaks
A new study released in January 2026 demonstrates that large language models (LLMs) can be coaxed into producing disallowed content by rearranging tokens in spatial patterns, a technique the authors label “SpatialJB.” The research, posted on the preprint server arXiv, shows success rates approaching 100 % against several leading commercial LLMs, even when the models employ advanced moderation tools such as OpenAI’s Moderation API.
Attack Methodology
SpatialJB exploits the autoregressive, token‑by‑token inference of Transformers by redistributing tokens across rows, columns, or diagonals within the input prompt. This spatial perturbation does not alter the semantic meaning perceived by the model but disrupts the internal representation enough to bypass existing output guardrails.
Experimental Results
The authors evaluated the attack on a suite of prominent LLMs, reporting an average attack success rate (ASR) of 98 % to 100 % without additional defenses. When the OpenAI Moderation API was enabled, SpatialJB still achieved an ASR exceeding 75 %, outperforming previously reported jailbreak techniques.
Implications for LLM Guardrails
Findings suggest that current moderation strategies, which primarily focus on lexical and semantic filters, may be insufficient against attacks that manipulate the spatial layout of tokens. Consequently, developers of LLM‑based applications may need to reconsider the underlying assumptions of their safety mechanisms.
Proposed Defenses
To counteract SpatialJB, the paper outlines baseline defense strategies, including input normalization that reorders tokens into a canonical sequence before processing and enhanced detection models that analyze token arrangement patterns. Preliminary evaluations indicate modest reductions in ASR, though the authors acknowledge that more robust solutions are required.
Future Directions
The authors call for further research into spatial semantics and its impact on model robustness, emphasizing the importance of integrating such considerations into the design of next‑generation LLM safety frameworks.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung