MO-GRPO Introduces Automatic Reward Normalization for Multi-Objective Reinforcement Learning
Global: MO-GRPO Introduces Automatic Reward Normalization for Multi-Objective Reinforcement Learning
A research team has presented MO-GRPO, an extension of Group Relative Policy Optimization, to mitigate reward‑hacking in multi‑objective reinforcement learning. The work, posted on arXiv in September 2025 (arXiv:2509.22047v2), proposes a variance‑based normalization scheme that automatically reweights multiple reward signals, aiming to preserve balanced learning without manual tuning.
Background and Motivation
Group Relative Policy Optimization (GRPO) performs well when a single, reliable reward model is available, but many practical tasks require optimizing several objectives simultaneously. In such settings, prior studies have observed that GRPO can over‑emphasize one objective while neglecting others, a phenomenon commonly referred to as reward hacking.
Methodology: Automatic Reward Normalization
MO‑GRPO addresses the imbalance by scaling each reward component according to the variance of its observed values. This simple normalization ensures that all objectives contribute evenly to the overall loss function, preserving the original preference ordering among policies while eliminating the need for hand‑crafted scaling factors.
Theoretical Guarantees
The authors provide an analytical proof that the variance‑based weighting guarantees equal contribution from each reward function to the gradient update. Consequently, the algorithm maintains the relative ranking of policies defined by the original multi‑objective criteria.
Experimental Evaluation
Four benchmark domains were used to assess MO‑GRPO: (i) a multi‑armed bandit problem, (ii) a simulated control suite named Mo‑Gymnasium, (iii) machine‑translation tasks on the WMT benchmark (English‑Japanese and English‑Chinese), and (iv) an instruction‑following task. Each experiment compared MO‑GRPO against the baseline GRPO.
Performance Outcomes
Across all domains, MO‑GRPO demonstrated more stable learning dynamics and higher aggregate performance metrics. In the translation benchmarks, the algorithm achieved modest gains in BLEU scores while maintaining balanced improvements across both language pairs. The results suggest that automatic reward normalization can enhance robustness in multi‑objective reinforcement learning scenarios.
Future Directions
The study highlights the potential for extending variance‑based normalization to other policy‑gradient methods and exploring adaptive schemes that respond to non‑stationary reward distributions. Further research may also investigate theoretical links between variance weighting and risk‑sensitive optimization.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung