Study Reveals Over-Search Issues in Retrieval-Enhanced LLMs - Boosting Accuracy, but Increasing Hallucinations

Global: Study Identifies Over-Search Issues in Retrieval‑Enhanced Large Language Models

A new arXiv preprint released in January 2026 examines how search‑augmented large language models (LLMs) frequently invoke external retrieval even when it does not improve response quality, a behavior the authors label “over‑searching.” The research evaluates this phenomenon across query types, model categories, retrieval conditions, and multi‑turn conversations, highlighting both performance trade‑offs and computational costs.

Evaluation Scope and Methodology

The authors conduct a systematic assessment that spans answerable and unanswerable queries, a range of LLM architectures—from standard reasoning models to deep research systems—and varying levels of retrieval noise. Multi‑turn dialogue settings are also examined to determine how over‑searching compounds over successive interactions.

Impact on Answer Accuracy and Abstention

Results indicate that while search generally boosts answer accuracy for answerable queries, it detrimentally affects the model’s ability to abstain on unanswerable questions, leading to increased hallucinations when irrelevant context is incorporated.

Pronounced Effects in Complex and Multi‑Turn Scenarios

Over‑searching is found to be more pronounced in complex reasoning models and deep research systems. The issue intensifies when retrieval outputs are noisy and escalates across turns in multi‑turn conversations, amplifying both inefficiency and error propagation.

Significance of Retrieved Evidence Composition

The study highlights that the composition of retrieved evidence plays a critical role; the presence of negative evidence—information that contradicts a query—can improve a model’s capacity to correctly abstain from answering.

Introducing Tokens Per Correctness (TPC)

To quantify the performance‑cost trade‑off, the authors propose a new metric called Tokens Per Correctness (TPC), which measures the number of tokens processed relative to correctly answered queries, offering a standardized way to assess efficiency in search‑augmented LLMs.

Mitigation Strategies and Dataset Release

Mitigation approaches are explored at both the query formulation and retrieval stages, including techniques to filter irrelevant results and adjust prompting strategies. Additionally, the authors release the OverSearchQA dataset to support ongoing research into more efficient search‑augmented language models.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Study Identifies Over-Search Issues in Retrieval‑Enhanced Large Language Models