Active Learning Boosts RLVR Efficiency for LLM Mathematical Reasoning

Global: Active Learning Enhances RLVR Efficiency for LLM Mathematical Reasoning

Researchers at arXiv have introduced an active learning framework into Reinforcement Learning with Verifiable Reward (RLVR) to address the high query costs associated with training large language models (LLMs) for mathematical reasoning. By selecting fewer but more informative queries, the new approach attains full‑dataset performance while using only 30% of the training data, thereby reducing annotation expenses.

Background on RLVR and Query Costs

RLVR has emerged as a technique to improve LLM reasoning by rewarding correct answers verified against external tools. Existing implementations typically require extensive query budgets, which translate into substantial annotation labor and financial outlay.

Limitations of Traditional Active Learning

Initial experiments applying classic active learning sampling strategies—such as uncertainty sampling based solely on model confidence—did not surpass random query selection. The researchers attribute this shortfall to the strategies’ neglect of objective uncertainty, focusing instead on subjective model uncertainty.

Introducing Uncertainty Consistency

To bridge the gap, the team proposes an uncertainty consistency metric that measures the alignment between subjective uncertainty and objective uncertainty. In offline settings, this alignment is quantified using the Point‑Biserial Correlation Coefficient (PBC), providing a statistical gauge of how well model confidence reflects true answer correctness.

Online Variant for Dynamic Training

Because online training involves limited sampling and shifting output distributions, estimating offline PBC directly is impractical. The authors therefore develop an online variant derived from normalized advantage scores and subjective uncertainty. Theoretical analysis demonstrates that this online metric is strictly negatively correlated with the offline PBC, indicating that higher online scores correspond to better sample selection.

Empirical Validation

Experimental results confirm that the uncertainty consistency‑driven active learning method consistently outperforms both random selection and established active learning baselines. Notably, the approach reaches the performance of training on the entire dataset after processing only 30% of the queries, confirming its cost‑efficiency for RLVR‑based reasoning tasks.

Implications for Future Research

The findings suggest that incorporating objective uncertainty considerations into active learning can substantially lower the resource demands of RLVR training. This may accelerate the deployment of more capable LLMs in domains where verification and mathematical reasoning are critical.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Active Learning Enhances RLVR Efficiency for LLM Mathematical Reasoning