Guided Perturbation Sensitivity Detects Adversarial Text with High Accuracy

Global: Guided Perturbation Sensitivity Detects Adversarial Text with High Accuracy

A new detection framework called Guided Perturbation Sensitivity (GPS) has been presented by researchers to identify adversarially modified text inputs. The approach evaluates how embedding representations shift when words deemed important are masked, allowing the system to distinguish crafted perturbations from naturally salient terms. The work appears in an arXiv preprint (arXiv:2508.11667v2) and targets transformer‑based language models commonly used for natural‑language processing tasks. By focusing on detection rather than model retraining, the method aims to provide a lightweight safeguard against a range of text‑based attacks.

Method Overview

GPS first ranks words in a sentence using importance heuristics, such as gradient‑based scores, to identify the top‑k critical tokens. Each selected token is then masked, and the resulting change in the model’s embedding space is measured. The pattern of sensitivity across the masked tokens is fed into a bidirectional LSTM detector, which classifies the input as either benign or adversarial.

Detection Accuracy

Experimental evaluation across three benchmark datasets, three distinct attack algorithms, and two victim transformer models shows that GPS attains detection accuracies exceeding 85 %. The results indicate that adversarially perturbed words produce markedly higher masking sensitivity than naturally important words, a distinction that the BiLSTM classifier exploits effectively.

Computational Efficiency

Compared with several state‑of‑the‑art defenses that require extensive model fine‑tuning or ensemble inference, GPS operates with lower computational overhead. The masking and embedding‑difference calculations are lightweight, and the BiLSTM detector adds modest processing time, making the framework suitable for real‑time deployment.

Word Ranking and NDCG Analysis

The study employs Normalized Discounted Cumulative Gain (NDCG) to assess the quality of word‑ranking strategies. Gradient‑based ranking achieves the highest NDCG scores, outperforming attention‑based, hybrid, and random selections. A strong correlation (ρ = 0.65) is observed between ranking quality and overall detection performance for word‑level attacks, underscoring the importance of accurate importance estimation.

Generalization to Unseen Scenarios

GPS demonstrates robust generalization when applied to datasets, attack types, and victim models that were not part of the training set. The detector maintains comparable accuracy without requiring retraining, suggesting that the sensitivity patterns it captures are broadly applicable across diverse adversarial contexts.

Broader Implications

By providing an attack‑agnostic, detection‑only solution, GPS contributes a practical tool for enhancing the security of NLP systems against adversarial manipulation. Future research may explore extending the framework to multilingual settings, integrating additional importance heuristics, or combining detection with adaptive mitigation strategies.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.