Assessing Transferability of Adversarial Videos Across Multimodal LLMs

Global: Transferability of Adversarial Videos Across Multimodal LLMs Assessed

Researchers have introduced a novel Image-to-Video MLLM (I2V-MLLM) attack that significantly improves the transferability of adversarial video samples across video‑based multimodal large language models (V‑MLLMs). In black‑box experiments using BLIP‑2 as a surrogate, the method achieved average attack success rates of 57.98% on the MSVD‑QA benchmark and 58.26% on MSRVTT‑QA for zero‑shot video question‑answering tasks.

Background

While V‑MLLMs have demonstrated susceptibility to adversarial examples in controlled settings, the ability of such attacks to generalize to unseen models—a realistic threat scenario—has remained largely unexplored. Existing attack strategies often falter in black‑box contexts, limiting their practical relevance.

Limitations of Prior Approaches

Previous methods are constrained by three primary shortcomings: they lack generalization when perturbing video features, they concentrate on sparse key‑frames rather than the full temporal sequence, and they fail to incorporate multimodal information that is essential for disrupting video representations.

Methodology

The I2V‑MLLM attack addresses these gaps by leveraging an image‑based multimodal large language model (I‑MLLM) as a surrogate to craft adversarial videos. The approach integrates multimodal interactions and spatiotemporal cues to manipulate latent video representations, and it introduces a perturbation propagation technique to accommodate unknown frame‑sampling strategies employed by target models.

Experimental Evaluation

Experiments were conducted on two widely used video‑text multimodal benchmarks, MSVD‑QA and MSRVTT‑QA, under zero‑shot video QA conditions. The black‑box attacks were benchmarked against white‑box attacks on the same target models, allowing a direct assessment of transferability.

Findings

The I2V‑MLLM attack achieved average attack success rates of 57.98% on MSVD‑QA and 58.26% on MSRVTT‑QA, performance that is competitive with white‑box attacks despite the lack of direct access to target model parameters. These results demonstrate strong cross‑model transferability of adversarial video samples.

Implications

The study highlights a pressing need for more robust defense mechanisms against adversarial video inputs in V‑MLLMs. By exposing vulnerabilities that persist in black‑box scenarios, the findings encourage further research into detection, mitigation, and model hardening techniques for multimodal AI systems.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Study Reveals High Transferability of Adversarial Videos Across Multimodal Language Models