Introducing RxnBench: A Benchmark for Multimodal LLMs in Chemical Reaction Understanding

Global: Study Introduces RxnBench to Test Multimodal LLMs on Chemical Reaction Understanding

A team of researchers announced in December 2025 the creation of RxnBench, a benchmark designed to evaluate how well multimodal large language models (MLLMs) comprehend chemical reactions presented in authentic scientific literature. The initiative seeks to fill a gap in assessing AI-driven discovery tools that must interpret dense, graphical reaction information across PDFs.

Benchmark Structure and Objectives

RxnBench consists of two distinct tasks. The Single‑Figure Question‑Answer (SF‑QA) component presents 1,525 questions derived from 305 curated reaction schematics, probing fine‑grained visual perception and mechanistic reasoning. The Full‑Document Question‑Answer (FD‑QA) task draws on 108 full‑text articles, requiring models to integrate text, diagrams, and tables to answer complex queries.

Dataset Composition

Each SF‑QA item pairs a reaction figure with a targeted question that tests the model’s ability to identify reagents, intermediates, and mechanistic steps. FD‑QA questions span multiple pages, demanding cross‑modal synthesis of information such as experimental conditions, kinetic data, and structural annotations. All source material originates from peer‑reviewed chemistry publications.

Evaluation Methodology

The authors evaluated several state‑of‑the‑art MLLMs, including models with standard visual encoders and variants that incorporate inference‑time reasoning modules. Accuracy was measured against a human‑generated answer key, with performance reported separately for text extraction, visual recognition, and logical reasoning sub‑components.

Key Findings

Results indicate that while the models reliably extract explicit textual content, they fall short on deep chemical logic and precise structural recognition. Models equipped with inference‑time reasoning outperformed baseline architectures across both tasks, yet none surpassed a 50% accuracy threshold on the FD‑QA challenge.

Implications for AI‑Driven Chemistry

The study underscores a critical capability gap for autonomous AI chemists. According to the authors, advancing domain‑specific visual encoders and strengthening reasoning engines are essential steps toward reliable AI assistance in chemical research and development.

Future Directions

The authors propose expanding RxnBench to include kinetic modeling, stereochemical analysis, and broader reaction classes. They also call for collaborative efforts to develop open‑source multimodal models tailored to the nuances of chemical literature.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Study Introduces RxnBench to Test Multimodal LLMs on Chemical Reaction Understanding