Boosting LLM Reasoning Performance with Test-Time Policy Evolution

Global: Test-Time Policy Evolution Boosts LLM Reasoning Performance

Researchers led by Zhengbo Jiao submitted a new study to arXiv on January 28, 2026, introducing a test‑time framework that dynamically adapts large language model (LLM) policies during inference. The work, titled “Policy of Thoughts: Scaling LLM Reasoning via Test‑time Policy Evolution,” proposes that real‑time learning from failed attempts can improve complex, long‑horizon reasoning tasks.

Background on LLM Reasoning Challenges

Current LLMs operate under a frozen policy assumption, which can cause instability when tackling multi‑step problems that require sustained logical coherence. Existing test‑time scaling techniques typically treat execution feedback as an external signal for post‑hoc filtering or rewriting, without integrating it into the model’s internal decision‑making process.

Proposed Policy of Thoughts Framework

The authors present the Policy of Thoughts (PoT) framework, which reframes reasoning as an within‑instance online optimization problem. PoT first generates a diverse set of candidate solutions through an efficient exploration mechanism. It then applies Group Relative Policy Optimization (GRPO) to update a transient LoRA adapter based on execution feedback, creating a closed‑loop system that refines the model’s reasoning priors for each specific instance.

Online Optimization via GRPO

GRPO leverages group‑wise comparisons of candidate trajectories to compute relative policy gradients, allowing the temporary adapter to evolve during a single inference session. This approach enables the model to internalize successes and failures, effectively “learning” a better policy on the fly without altering the underlying base weights.

Experimental Results

In benchmark evaluations, a 4‑billion‑parameter model equipped with PoT achieved 49.71% accuracy on LiveCodeBench, surpassing the performance of GPT‑4o and DeepSeek‑V3 despite being more than 50 times smaller. The authors report that the framework delivers substantial gains across multiple reasoning‑intensive tasks, highlighting its scalability and efficiency.

Implications and Future Work

The study suggests that test‑time policy evolution could become a viable pathway for enhancing LLM capabilities without the computational overhead of larger models. The authors indicate plans to explore broader domains, integrate additional feedback modalities, and assess the approach’s robustness in real‑world applications.

Limitations

While the results are promising, the authors acknowledge that the transient adapter introduces additional inference latency and that the method’s effectiveness may vary with different model architectures and task distributions. Further research is needed to quantify these trade‑offs and to develop strategies for minimizing overhead.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Test-Time Policy Evolution Boosts LLM Reasoning Performance