New Multi-Turn Jailbreak Framework ICON Achieves 97.1% Success Against Leading LLMs
Global: New Multi-Turn Jailbreak Framework ICON Achieves 97.1% Success Against Leading LLMs
Researchers have unveiled ICON, an automated multi‑turn jailbreak system that constructs authoritative‑style contexts to bypass safety mechanisms in large language models (LLMs). The framework, described in a recent arXiv preprint, routes malicious intent to semantically congruent contexts and iteratively refines prompts, achieving an average attack success rate of 97.1% across eight state‑of‑the‑art LLMs.
Background on LLM Jailbreaks
Recent advances in LLM capabilities have been accompanied by a rise in jailbreak techniques that exploit conversational dynamics to elicit prohibited content. Traditional approaches often rely on incremental prompt adjustments, requiring extensive model interaction and frequently becoming trapped in suboptimal attack trajectories.
Intent‑Context Coupling Phenomenon
The authors identify a pattern they term “Intent‑Context Coupling,” where safety constraints relax when a malicious intent aligns with a context that appears authoritative or semantically coherent, such as scientific discourse. This observation underpins the design of ICON’s context‑generation strategy.
The ICON Framework
ICON begins by mapping a given malicious intent to a matching context pattern through a prior‑guided semantic routing process. It then instantiates this pattern into a sequence of prompts that progressively build an authoritative‑style narrative, ultimately guiding the LLM to produce the targeted prohibited output.
Hierarchical Optimization Strategy
To avoid stagnation in ineffective contexts, ICON employs a hierarchical optimization that combines local prompt refinement with global context switching. This dual‑level approach enables the system to adjust both the fine‑grained wording of prompts and the broader thematic direction of the conversation.
Experimental Evaluation
Testing across eight leading LLMs, including models from major AI labs, the framework consistently outperformed existing jailbreak methods, reaching an average attack success rate of 97.1%. The results suggest that the intent‑context coupling mechanism substantially enhances the effectiveness of multi‑turn attacks.
Implications and Future Work
The findings highlight a new vector for adversarial exploitation of LLM safety controls and underscore the need for more robust alignment techniques. The authors propose further research into detection mechanisms that can identify and mitigate intent‑context coupling patterns before they result in policy violations.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung