ViPer Framework Achieves Near-Human Success Against Major Visual Reasoning CAPTCHAs

Global: ViPer Framework Achieves Near-Human Success Against Major Visual Reasoning CAPTCHAs

Researchers introduced ViPer, a unified attack framework for Visual Reasoning CAPTCHAs (VRCs), in a paper posted to arXiv in January 2026. The system combines structured multi‑object visual perception with adaptive large‑language‑model (LLM) reasoning to solve VRC challenges. It was evaluated across six leading providers—VTT, Geetest, NetEase, Dingxiang, Shumei, and Xiaodun—demonstrating success rates up to 93.2%, a level comparable to human performance.

Background on Visual Reasoning CAPTCHAs

VRCs present users with complex visual scenes accompanied by natural‑language queries that require compositional inference over objects, attributes, and spatial relationships. Existing solvers fall into two main paradigms: vision‑centric approaches that rely on template‑specific detectors but struggle with novel layouts, and reasoning‑centric methods that leverage LLMs yet lack fine‑grained visual perception.

ViPer Architecture

ViPer addresses these gaps by parsing visual layouts to identify objects and their attributes, grounding the extracted information to the semantics of the accompanying question, and then using a modular LLM component to infer the target coordinates. The pipeline is designed to be adaptable, allowing substitution of different LLM backbones without significant performance loss.

Performance Evaluation

Across the six VRC providers, ViPer achieved success rates ranging from 89.5% to 93.2%. This outperformed prior solvers such as GraphNet (83.2%), Oedipus (65.8%), and the Holistic approach (89.5%). The results suggest that ViPer consistently surpasses existing benchmarks on heterogeneous VRC deployments.

Robustness to LLM Variants

The framework was tested with multiple LLMs, including GPT, Grok, DeepSeek, and Kimi. In each case, ViPer maintained accuracy above 90%, indicating that its visual‑perception component effectively supports diverse reasoning engines.

Defensive Countermeasure: Template‑Space Randomization

To anticipate defensive needs, the authors propose Template‑Space Randomization (TSR), a lightweight technique that perturbs the linguistic templates of VRCs without changing the underlying task semantics. Empirical results show that TSR reduces the success rate of ViPer and similar solvers, highlighting a potential path toward more human‑solvable but machine‑resistant CAPTCHAs.

Implications for Future CAPTCHA Design

The study underscores the importance of jointly advancing visual perception and language reasoning in both attack and defense contexts. By demonstrating a near‑human level of success, ViPer sets a benchmark for future research on robust, user‑friendly CAPTCHA mechanisms.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.