Study Reveals Black-Box Evasion Attacks Threaten In-Context Learning Classifiers
Global: Study Reveals Black-Box Evasion Attacks Threaten In-Context Learning Classifiers
Researchers have introduced a new adversarial framework, ICL-Evader, that targets large language models (LLMs) used for in-context learning (ICL) text classification. The work, presented in an arXiv preprint, outlines a zero‑query threat model that requires no access to model parameters, gradients, or query‑based feedback during attack generation.
Zero‑Query Threat Model
The proposed threat model operates under highly practical constraints, allowing attackers to craft evasion inputs without interacting with the target classifier. By relying solely on publicly available knowledge of LLM behavior, the approach sidesteps traditional requirements for probing or gradient information.
Novel Attack Techniques
ICL-Evader comprises three distinct attacks—Fake Claim, Template, and Needle‑in‑a‑Haystack—that exploit limitations in how LLMs process in‑context prompts. Each method manipulates the prompt structure to induce misclassification while remaining invisible to standard detection mechanisms.
Empirical Evaluation
Experiments spanning sentiment analysis, toxicity detection, and illicit promotion tasks demonstrate that the attacks can achieve success rates as high as 95.3%. The results markedly surpass those of conventional natural‑language‑processing attacks, which perform poorly under the same zero‑query constraints.
Defense Strategies
The authors systematically assess a range of defensive measures and identify a combined defense recipe that mitigates all three attacks with less than 5% degradation in classification accuracy. This joint approach balances robustness with minimal impact on utility.
Tool Release and Open Resources
To facilitate broader adoption of the defensive insights, the team has released an automated tool that proactively fortifies standard ICL prompts against adversarial evasion. The source code and evaluation datasets are publicly accessible via a GitHub repository.
Implications for Secure AI Deployment
These findings highlight a previously underexplored vulnerability in ICL‑based systems and suggest that practitioners should incorporate the proposed defenses when deploying LLM‑driven classifiers in real‑world settings.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung