Untargeted Gradient-Based Jailbreak Attack: A New Threat to Safety-Aligned LLMs

Global: Untargeted Gradient-Based Jailbreak Attack Shows Over 80% Success Against Safety-Aligned LLMs

Researchers have introduced a new untargeted jailbreak method for large language models (LLMs) in a paper posted to arXiv in October 2025. The technique, termed Untargeted Jailbreak Attack (UJA), seeks to increase the likelihood that an LLM produces unsafe content without targeting a specific response. By focusing on an untargeted safety probability objective, the authors aim to overcome the constraints of prior gradient‑based attacks that rely on fixed target outputs. The work was conducted by an unnamed research team and is intended to highlight vulnerabilities in safety‑aligned LLMs.

Limitations of Existing Targeted Attacks

Previous gradient‑based jailbreak strategies typically optimize adversarial suffixes to coerce an LLM into generating a predefined target answer. This fixed‑target approach narrows the adversarial search space, which can reduce overall effectiveness. Moreover, achieving the large gap between the target response and the model’s original output often requires many optimization iterations, leading to low efficiency.

Untargeted Objective Expands Search Space

UJA replaces the fixed‑target formulation with an untargeted objective that maximizes the probability of producing any unsafe output. By removing the constraint of a specific response pattern, the method broadens the set of exploitable prompts, allowing more flexible exploration of model vulnerabilities.

Decomposed Optimization Framework

To keep the optimization tractable, the authors decompose the untargeted objective into two differentiable sub‑objectives: one that searches for the most harmful potential response and another that identifies the adversarial prompt leading to that response. The paper includes a theoretical analysis that validates this decomposition and demonstrates how gradient information can be leveraged for both sub‑tasks.

Performance Evaluation

Extensive experiments reported in the abstract indicate that UJA achieves attack success rates exceeding 80% against recent safety‑aligned LLMs after only 100 optimization iterations. This performance surpasses the best previously reported gradient‑based attacks by more than 30%.

Implications for AI Safety

The findings suggest that untargeted adversarial objectives can substantially increase the risk of unsafe content generation, even in models that have undergone safety alignment. The authors note that defenders may need to consider broader threat models that account for non‑specific safety violations.

Future Research Directions

The study recommends further investigation into defensive techniques that can detect or mitigate untargeted jailbreak attempts, as well as deeper analysis of how different model architectures respond to such attacks. Continued collaboration between safety researchers and developers is emphasized to improve robustness.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Untargeted Gradient-Based Jailbreak Attack Shows Over 80% Success Against Safety-Aligned LLMs