Study Finds Visual Prompt Benchmarks Vulnerable to Minor Design Choices

Global: Study Finds Visual Prompt Benchmarks Sensitive to Minor Design Choices Across Vision-Language Models

A new research paper posted on arXiv on December 2025 reports that minor variations in visual prompting—such as marker color, size, or compression level—can substantially alter the performance rankings of vision‑language models (VLMs). The authors evaluated nine widely used open‑ and closed‑source VLMs on two visually prompted tasks and observed that seemingly trivial changes to benchmark setup produced large shifts in leaderboard positions.

Background

The paper notes that recent benchmarks like BLINK assess visual perception by pairing questions with explicit visual markers placed directly on images. While these benchmarks aim to isolate visual reasoning from textual priors, the study highlights that the design of these markers has been largely overlooked.

Key Findings

According to the authors, altering a marker’s color from red to blue was enough to reverse the relative rankings of several models. Similarly, modest adjustments to marker size lifted the open‑source InternVL3‑8B model to a standing comparable with larger proprietary systems such as Gemini 2.5 Pro. The analysis also revealed that JPEG compression levels applied during API calls could further reshuffle model line‑ups.

Implications for Model Ranking

These results suggest that visual prompting benchmarks are more vulnerable to low‑level implementation details than conventional semantic VLM evaluations. Consequently, the authors caution that leaderboard outcomes may reflect benchmark configuration rather than intrinsic model capabilities.

Introducing VPBench

To address the observed instability, the researchers curated a larger benchmark named VPBench, which incorporates 16 distinct visual marker variants. VPBench is intended to provide a more robust assessment framework by systematically varying marker attributes and dataset size.

Availability and Future Work

The VPBench dataset and the accompanying analysis framework have been released publicly at https://lisadunlap.github.io/vpbench/. The authors encourage the community to adopt the benchmark for more reliable VLM evaluation and to explore additional factors that may influence model performance.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Study Finds Visual Prompt Benchmarks Sensitive to Minor Design Choices Across Vision-Language Models

Background

Key Findings

Implications for Model Ranking

Introducing VPBench

Availability and Future Work

Data and Protocol

Privacy Protocol