Evaluating Spatial Reasoning in Multimodal LLMs with GamiBench

Global: New Benchmark Evaluates Spatial Reasoning in Multimodal LLMs

A team of researchers has introduced GamiBench, a benchmark designed to assess spatial reasoning and 2D-to-3D planning capabilities in multimodal large language models (MLLMs). The benchmark focuses on origami-inspired folding tasks and aims to fill a gap in existing evaluations that often overlook sequential, viewpoint‑dependent reasoning.

Benchmark Overview

GamiBench comprises 186 regular and 186 impossible 2D crease patterns, each paired with its corresponding 3D folded shape. The dataset presents each pattern from six distinct viewpoints, enabling a comprehensive analysis of how models process visual information across perspectives.

Task Structure

Three visual question‑answering (VQA) tasks are built around the dataset: (1) predicting the final 3D fold configuration, (2) distinguishing valid from invalid viewpoints, and (3) detecting impossible crease patterns. By requiring models to answer across multiple steps, the benchmark evaluates both final outcomes and intermediate reasoning.

Evaluation Metrics

Beyond standard accuracy, GamiBench introduces two diagnostic metrics: viewpoint consistency (VC), which measures a model’s ability to maintain coherent predictions across different angles, and impossible‑fold selection rate (IFSR), which quantifies success in identifying physically infeasible patterns. These metrics capture cross‑view consistency and physical feasibility.

Model Performance

Preliminary experiments reported in the abstract indicate that leading models, including GPT‑5 and Gemini‑2.5‑Pro, struggle with single‑step spatial understanding tasks within GamiBench. Their performance lags behind human baselines, highlighting current limitations in geometric cognition.

Implications for Research

The introduction of GamiBench provides a standardized framework for evaluating geometric understanding in MLLMs, addressing the shortfall of benchmarks that focus solely on static images or final predictions. Researchers can use the suite to diagnose specific weaknesses and guide the development of more robust spatial reasoning capabilities.

Access and Resources

The full dataset and accompanying code are publicly available on GitHub at https://github.com/stvngo/GamiBench, encouraging community participation and further refinement.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Benchmark Evaluates Spatial Reasoning in Multimodal LLMs