Study Reveals High Transferability of Adversarial Videos Across Multimodal Language Models
Global: Transferability of Adversarial Videos Across Multimodal LLMs Assessed
Researchers have introduced a novel Image-to-Video MLLM (I2V-MLLM) attack that significantly improves the transferability of adversarial video samples across video‑based multimodal large language models (V‑MLLMs). In black‑box experiments using BLIP‑2 as a surrogate, the method achieved average attack success rates of 57.98% on the MSVD‑QA benchmark and 58.26% on MSRVTT‑QA for zero‑shot video question‑answering tasks.
Background
While V‑MLLMs have demonstrated susceptibility to adversarial examples in controlled settings, the ability of such attacks to generalize to unseen models—a realistic threat scenario—has remained largely unexplored. Existing attack strategies often falter in black‑box contexts, limiting their practical relevance.
Limitations of Prior Approaches
Previous methods are constrained by three primary shortcomings: they lack generalization when perturbing video features, they concentrate on sparse key‑frames rather than the full temporal sequence, and they fail to incorporate multimodal information that is essential for disrupting video representations.
Methodology
The I2V‑MLLM attack addresses these gaps by leveraging an image‑based multimodal large language model (I‑MLLM) as a surrogate to craft adversarial videos. The approach integrates multimodal interactions and spatiotemporal cues to manipulate latent video representations, and it introduces a perturbation propagation technique to accommodate unknown frame‑sampling strategies employed by target models.
Experimental Evaluation
Experiments were conducted on two widely used video‑text multimodal benchmarks, MSVD‑QA and MSRVTT‑QA, under zero‑shot video QA conditions. The black‑box attacks were benchmarked against white‑box attacks on the same target models, allowing a direct assessment of transferability.
Findings
The I2V‑MLLM attack achieved average attack success rates of 57.98% on MSVD‑QA and 58.26% on MSRVTT‑QA, performance that is competitive with white‑box attacks despite the lack of direct access to target model parameters. These results demonstrate strong cross‑model transferability of adversarial video samples.
Implications
The study highlights a pressing need for more robust defense mechanisms against adversarial video inputs in V‑MLLMs. By exposing vulnerabilities that persist in black‑box scenarios, the findings encourage further research into detection, mitigation, and model hardening techniques for multimodal AI systems.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung