Black-Box Evasion Attacks Threaten In-Context Learning Classifiers

Global: Study Reveals Black-Box Evasion Attacks Threaten In-Context Learning Classifiers

Researchers have introduced a new adversarial framework, ICL-Evader, that targets large language models (LLMs) used for in-context learning (ICL) text classification. The work, presented in an arXiv preprint, outlines a zero‑query threat model that requires no access to model parameters, gradients, or query‑based feedback during attack generation.

Zero‑Query Threat Model

The proposed threat model operates under highly practical constraints, allowing attackers to craft evasion inputs without interacting with the target classifier. By relying solely on publicly available knowledge of LLM behavior, the approach sidesteps traditional requirements for probing or gradient information.

Novel Attack Techniques

ICL-Evader comprises three distinct attacks—Fake Claim, Template, and Needle‑in‑a‑Haystack—that exploit limitations in how LLMs process in‑context prompts. Each method manipulates the prompt structure to induce misclassification while remaining invisible to standard detection mechanisms.

Empirical Evaluation

Experiments spanning sentiment analysis, toxicity detection, and illicit promotion tasks demonstrate that the attacks can achieve success rates as high as 95.3%. The results markedly surpass those of conventional natural‑language‑processing attacks, which perform poorly under the same zero‑query constraints.

Defense Strategies

The authors systematically assess a range of defensive measures and identify a combined defense recipe that mitigates all three attacks with less than 5% degradation in classification accuracy. This joint approach balances robustness with minimal impact on utility.

Tool Release and Open Resources

To facilitate broader adoption of the defensive insights, the team has released an automated tool that proactively fortifies standard ICL prompts against adversarial evasion. The source code and evaluation datasets are publicly accessible via a GitHub repository.

Implications for Secure AI Deployment

These findings highlight a previously underexplored vulnerability in ICL‑based systems and suggest that practitioners should incorporate the proposed defenses when deploying LLM‑driven classifiers in real‑world settings.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Study Reveals Black-Box Evasion Attacks Threaten In-Context Learning Classifiers