New White-Box Adversarial Attack Leverages SHAP Values to Challenge Vision Models

Global: New White-Box Adversarial Attack Leverages SHAP Values to Challenge Vision Models

A paper submitted to arXiv on January 15, 2026 introduces a white‑box adversarial evasion technique that exploits SHAP (SHapley Additive exPlanations) values to manipulate computer‑vision classifiers. The authors—Frank Mollard, Marcus Becker, and Florian Roehrbein—report that the method can lower confidence scores or cause misclassifications while remaining visually imperceptible.

Attack Overview

The proposed approach computes SHAP values for each pixel at inference time, quantifying each input’s contribution to the model’s output. By selectively perturbing pixels with high positive SHAP influence, the attack crafts adversarial examples that steer the model toward an incorrect label.

Methodological Details

According to the authors, the technique operates in a white‑box setting, granting full access to model parameters and gradients. The perturbations are constrained to remain within a small L∞ norm, ensuring that changes are not detectable by the human eye. The paper outlines an optimization routine that iteratively adjusts pixel values based on the gradient of the SHAP‑derived loss.

Performance Comparison

The researchers compare the SHAP‑based attack with the Fast Gradient Sign Method (FGSM), a widely cited baseline. Their experiments suggest that the SHAP attack achieves a higher success rate in “gradient hiding” scenarios, where traditional gradient‑based attacks struggle.

Security Implications

If validated, the findings could indicate a new vulnerability vector for deep‑learning vision systems, particularly those deployed in safety‑critical environments. The authors note that the attack’s reliance on model interpretability tools may broaden the attack surface for systems that expose SHAP explanations to end users.

Defensive Considerations

Potential mitigations mentioned include limiting access to SHAP explanations, incorporating adversarial training with SHAP‑generated examples, and employing detection mechanisms that monitor abnormal SHAP value distributions.

Research Landscape

This work adds to a growing body of literature examining the security ramifications of model‑explainability techniques. As AI interpretability gains traction, the study underscores the need for balanced deployment strategies that weigh transparency against adversarial risk.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.