Introducing RAIR Benchmark: Standardized Evaluation for LLM and VLM Search Relevance

Global: New RAIR Benchmark Provides Standardized Evaluation for LLM and VLM Search Relevance

A new benchmark called Rule-Aware benchmark with Image for Relevance assessment (RAIR) was introduced to evaluate search relevance performance of large language models and visual language models in Chinese e‑commerce scenarios. The dataset, derived from real‑world cases, was released by researchers on arXiv on December 2025. It provides a standardized framework and a set of universal rules for relevance assessment. Experiments involved 14 open‑ and closed‑source models, with GPT‑5 achieving the highest scores. The benchmark is now publicly available for industry and academic use.

Benchmark Structure

RAIR is organized around three distinct subsets that together address fundamental, challenging, and multimodal aspects of relevance assessment. Each subset follows the same universal rule set, enabling consistent comparison across model families.

General Subset

The general subset contains industry‑balanced samples intended to measure baseline competencies of relevance models. It reflects typical e‑commerce queries and product listings, offering a broad view of model performance under ordinary conditions.

Long‑Tail Hard Subset

The long‑tail hard subset focuses on rare or ambiguous cases that test the limits of current models. By emphasizing difficult queries, it reveals performance gaps that may be obscured in more balanced evaluations.

Visual Salience Subset

The visual salience subset adds image data to the relevance task, allowing assessment of multimodal understanding. This component evaluates how well visual language models integrate visual cues with textual information to determine relevance.

Evaluation Findings

Researchers applied the RAIR framework to 14 models, ranging from open‑source LLMs to proprietary systems. Across all subsets, GPT‑5 recorded the top performance, yet the results indicated that even the most advanced model faced notable challenges, particularly on the long‑tail hard subset.

Industry Impact

By establishing a common set of rules and a publicly available dataset, RAIR aims to foster more transparent and comparable relevance testing across the sector. Stakeholders can use the benchmark to identify strengths and weaknesses in their models and to guide future development.

Access and Future Directions

The RAIR dataset and accompanying evaluation scripts are hosted on an open‑access repository linked from the arXiv paper. The authors suggest expanding the benchmark to additional languages and domains to further broaden its applicability.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New RAIR Benchmark Provides Standardized Evaluation for LLM and VLM Search Relevance