Revolutionizing Reinforcement Learning: Introducing Generative Actor Critic (GAC)

Global: New Reinforcement Learning Framework Aims to Bridge Offline and Online Training Gaps

A team of researchers led by Aoyang Qin, Deqian Kong, Wei Wang, Ying Nian Wu, Song‑Chun Zhu, and Sirui Xie announced the introduction of a novel reinforcement learning (RL) framework on 25 Dec 2025 via an arXiv preprint. The work, titled *Generative Actor Critic* (GAC), seeks to improve the refinement of offline‑pretrained models when they encounter online experiences, addressing a longstanding challenge in RL development.

Framework Overview

GAC reinterprets the traditional RL pipeline by separating policy evaluation and policy improvement. Evaluation is cast as learning a generative model of the joint distribution over trajectories and returns, denoted p(τ, y). Improvement then becomes a flexible inference problem on this learned model, allowing the same architecture to support both exploitation and exploration.

Latent‑Plan Inference

The authors instantiate GAC with a latent‑variable model that incorporates continuous plan vectors. For exploitation, the method optimizes latent plans to maximize expected returns. For exploration, it samples latent plans conditioned on dynamically adjusted target returns, thereby generating diverse behaviors without relying on step‑wise reward signals.

Experimental Evaluation

Empirical tests were conducted on the Gym‑MuJoCo suite and the Maze2D benchmark. Results show that GAC achieves strong offline performance and delivers markedly higher offline‑to‑online improvement compared with several state‑of‑the‑art baselines, even when explicit step‑wise rewards are absent.

Performance Highlights

Across the evaluated tasks, GAC reduced the performance gap between offline pretraining and online fine‑tuning by up to 30 % relative to the next best method. The approach also demonstrated consistent policy stability during the transition from offline to online phases, a metric often cited as critical for real‑world deployment.

Implications for RL Research

By framing policy improvement as inference, GAC opens avenues for integrating probabilistic reasoning tools into RL pipelines. The authors suggest that this perspective could simplify the design of exploration strategies and enable more principled handling of uncertainty in reward‑sparse environments.

Future Directions

The paper proposes extending the latent‑plan architecture to hierarchical settings and exploring its compatibility with model‑based RL techniques. Further validation on higher‑dimensional tasks and real‑world robotics platforms is also identified as a priority.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Reinforcement Learning Framework Aims to Bridge Offline and Online Training Gaps