PRISMA: A Reinforcement-Learning-Guided Framework for Open-Domain Multi-Hop QA

Global: New RL Framework PRISMA Improves Multi-Hop Retrieval for Open-Domain QA

Researchers introduced PRISMA, a reinforcement‑learning‑guided framework designed to enhance open‑domain, multi‑hop question answering over large text collections. The approach was detailed in a preprint posted to arXiv (arXiv:2601.05465) in January 2026 and aims to mitigate two primary obstacles that have limited the reliability of Retrieval‑Augmented Generation systems.

Problem Overview

Current RAG pipelines often suffer from retrieval collapse, where iterative searches fail to locate intermediate evidence needed to bridge answer components, leading downstream reasoning modules to stall. In addition, end‑to‑end training exhibits learning instability because credit assignment across long reasoning chains is weak, causing models to overfit to benchmark‑specific patterns.

PRISMA Architecture

The proposed system adopts a Plan‑Retrieve‑Inspect‑Solve‑Memoize workflow. A Planner decomposes complex queries into sub‑tasks, the Retriever fetches candidate passages, the Inspector evaluates the relevance of retrieved evidence, the Solver generates answers grounded in verified context, and the Memoizer stores successful reasoning traces for future reuse.

Training Strategy

To optimize each component, the authors employ a Two‑Stage Group Relative Policy Optimization (GRPO). Stage I calibrates the Planner and Solver as specialized experts, while Stage II introduces Observation‑Aware Residual Policy Optimization (OARPO) to improve the Inspector’s ability to verify context and trigger corrective actions when evidence is insufficient.

Experimental Results

Benchmarks across ten open‑domain multi‑hop QA datasets show that PRISMA achieves state‑of‑the‑art performance, surpassing prior RL‑based methods by measurable margins. The authors report consistent gains in both accuracy and retrieval efficiency, indicating robust generalization beyond the training corpora.

Deployment Implications

Because the framework decouples planning, retrieval, and verification, it can be integrated into existing RAG infrastructures with modest computational overhead. The memoization component further reduces latency in production settings by reusing proven reasoning paths.

Future Directions

The study suggests several avenues for extension, including scaling the Inspector to handle multimodal evidence, refining policy optimization for longer reasoning horizons, and evaluating PRISMA in domain‑specific applications such as legal or medical information retrieval.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New RL Framework PRISMA Improves Multi-Hop Retrieval for Open-Domain QA