Self-Distillation Policy Optimization Boosts Sample Efficiency in Language Model Reinforcement Learning
Global: Self-Distillation Policy Optimization Improves RL with Rich Feedback
A team of researchers has introduced Self-Distillation Policy Optimization (SDPO), a new approach that transforms textual feedback from verifiable environments into a dense learning signal for large language models. The method, detailed in a recent arXiv preprint, aims to overcome the credit‑assignment bottleneck inherent in reinforcement learning with verifiable rewards (RLVR), which traditionally relies on a single scalar outcome per attempt.
Method Overview
SDPO treats the current model, conditioned on the feedback it receives, as a self‑teacher. By distilling the model’s own feedback‑informed next‑token predictions back into the policy, the technique enables the model to retrospectively identify and correct its mistakes without an external teacher or explicit reward model.
Performance Gains
Across benchmarks in scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO demonstrated improved sample efficiency and higher final accuracy compared with strong RLVR baselines. Notably, the approach also surpassed baselines in standard RLVR settings that provide only scalar feedback, leveraging successful rollouts as implicit feedback for failed attempts.
Implications for Scalar‑Reward Environments
By extracting actionable information from rich textual cues such as runtime errors or judge evaluations, SDPO reduces reliance on sparse scalar rewards. This capability suggests that even environments traditionally limited to binary outcomes can benefit from richer internal signals, potentially reshaping how reinforcement learning is applied to code and mathematical tasks.
Test‑Time Acceleration
When applied at inference time to individual test questions, SDPO accelerated discovery on difficult binary‑reward tasks, achieving the same discovery probability as best‑of‑k sampling or multi‑turn conversations while using three times fewer attempts.
Future Directions
The authors propose extending SDPO to broader domains and investigating its integration with external reward models. Continued exploration may reveal additional efficiencies in settings where feedback is abundant but not explicitly quantified.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung