Two-Stage Training Boosts Reasoning Capabilities of LLMs with Limited Data
Global: Two-Stage Training Boosts Reasoning Capabilities of LLMs with Limited Data
Researchers presenting a new arXiv preprint have introduced a two‑stage training strategy designed to improve large language models’ (LLMs) reasoning abilities when high‑quality training data are scarce. The approach first warms up a model by distilling long chains of thought from Knights & Knaves logic puzzles, then applies Reinforcement Learning with Verifiable Rewards (RLVR) using a small set of target‑domain examples. The work aims to address the data‑efficiency bottleneck that hampers the development of reasoning‑capable LLMs.
Warmup Phase with Knights & Knaves Puzzles
In the initial stage, the model is exposed to curated long chains of thought generated from a toy domain of Knights & Knaves logic puzzles. By distilling these detailed reasoning traces, the model acquires general problem‑solving skills without requiring large amounts of domain‑specific data. The authors report that this warmup alone yields measurable gains on a variety of downstream benchmarks.
Reinforcement Learning with Verifiable Rewards
The second stage employs RLVR, a reinforcement‑learning framework that provides verifiable reward signals based on the correctness of generated reasoning steps. Training proceeds on a limited dataset—no more than 100 examples from the target domain—allowing the model to fine‑tune its reasoning for specific tasks while retaining the general capabilities learned during warmup.
Benchmark Performance Across Tasks
Experimental results show that the warmed‑up model outperforms a baseline model trained solely with RLVR on the same small dataset. Improvements are observed on established evaluation suites such as MATH, HumanEval⁺, and MMLU‑Pro, indicating that the warmup phase contributes to broader reasoning competence beyond the narrow training domain.
Enhanced Sample Efficiency
By introducing the warmup step, the authors demonstrate higher sample efficiency during RLVR training. The warmed‑up model achieves comparable or superior accuracy with fewer training examples, suggesting that the two‑stage pipeline reduces the data requirements typically associated with reinforcement‑learning fine‑tuning.
Maintaining Cross‑Domain Generalizability
Even after RLVR fine‑tuning on a specific domain, the warmed‑up model retains its ability to perform well on unrelated tasks. This contrasts with models trained directly on the target data, which often exhibit reduced generalizability. The findings imply that the warmup phase helps preserve a more versatile reasoning foundation.
Implications for Data‑Scarce Environments
The study highlights the promise of combining a reasoning‑focused warmup with RLVR for building robust LLMs when high‑quality annotated data are limited. Such a strategy could be valuable for organizations seeking to deploy reasoning‑capable models in specialized domains without incurring the cost of large‑scale data collection.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung