New Benchmark Evaluates Language Models’ Auditory Reasoning Without Audio Input
Global: New Benchmark Evaluates Language Models’ Auditory Reasoning Without Audio Input
Researchers led by Hyunjong Ok and colleagues have introduced a benchmark designed to assess how well text‑only language models can reason about auditory concepts such as pitch, loudness, and sound‑source associations. The work, first submitted on September 22, 2025 and revised on January 28, 2026, appears on the preprint server arXiv under the title *AuditoryBench++*.
Benchmark Overview
According to the authors, AuditoryBench++ comprises a suite of tasks that span simple auditory comparisons to more complex, context‑dependent reasoning scenarios. The benchmark aims to provide fine‑grained diagnostics of a model’s ability to process and integrate auditory knowledge without direct exposure to sound.
Methodology: AIR‑CoT
The paper proposes a novel inference technique called Auditory Imagination Reasoning with Chain‑of‑Thought (AIR‑CoT). This method generates auditory information on the fly by detecting spans marked with special tokens and injecting relevant knowledge, thereby augmenting the model’s reasoning process.
Experimental Findings
Extensive experiments reported by the authors show that recent large language models (LLMs) and multimodal LLMs evaluated on AuditoryBench++ generally achieve higher accuracy when equipped with AIR‑CoT. The technique outperforms both baseline off‑the‑shelf models and those enhanced with static auditory knowledge.
Implications for Language Models
The results suggest that current text‑only models lack robust auditory commonsense, limiting their effectiveness in multimodal applications that require sound‑related understanding. By integrating dynamic auditory imagination, models can bridge part of this gap, according to the study.
Community Resources
The authors have made the benchmark, code, and data publicly available via a project page linked in the paper, encouraging further research and replication.
Broader Research Landscape
This work aligns with a growing research focus on evaluating non‑visual modalities—such as sound and haptics—in large language models, highlighting the need for comprehensive multimodal evaluation frameworks.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung