Improving LLM Mathematical Reasoning with Generative Adversarial Reasoner

Global: Generative Adversarial Reasoner Improves LLM Mathematical Reasoning

A team of AI researchers announced a new training framework in a December 2025 arXiv preprint aimed at reducing calculation errors and illogical steps in large language models (LLMs). The paper, titled “Generative Adversarial Reasoner,” describes an on‑policy approach that pairs an LLM reasoner with an LLM‑based discriminator to enhance mathematical problem solving. By leveraging adversarial reinforcement learning, the authors seek to provide more reliable step‑by‑step reasoning for complex tasks.

Joint Training Architecture

The proposed system operates as a co‑evolutionary loop where the reasoner generates reasoning chains while the discriminator evaluates each segment for logical soundness. Both components are updated simultaneously, allowing the discriminator to learn to spot errors and the reasoner to adapt its outputs to avoid them. This adversarial dynamic is intended to produce dense, step‑level feedback beyond traditional end‑task rewards.

Slice‑Based Review Schedule

To manage computational costs, the framework divides each reasoning chain into slices of comparable length that are deemed logically complete. The discriminator then provides concise, structured justifications for each slice, enabling targeted assessment without processing the entire chain at once. This partitioning strategy aims to balance thorough evaluation with efficiency.

Reward Structure and Sample Efficiency

Learning signals are coupled so that the reasoner receives positive reinforcement for steps that maintain logical consistency and lead to correct answers, while the discriminator is rewarded for accurately detecting flawed reasoning. Consequently, the system generates well‑calibrated on‑policy rewards that supplement sparse exact‑match signals, improving credit assignment and overall sample efficiency during training.

Benchmark Gains on Mathematical Tests

Experimental results reported on several mathematical benchmarks show consistent improvements over strong baselines. On the AIME24 dataset, the DeepSeek‑R1‑Distill‑Qwen‑7B model increased its score from 54.0 to 61.3, a rise of 7.3 points, and the DeepSeek‑R1‑Distill‑Llama‑8B model rose from 43.7 to 53.7, a gain of 10.0 points. These figures illustrate the potential of the adversarial framework to elevate LLM performance on rigorous problem‑solving tasks.

Flexible Reward Shaping

The modular nature of the discriminator permits additional reward‑shaping objectives, including teacher distillation, preference alignment, and proof‑based reasoning. By customizing the discriminator’s evaluation criteria, developers can tailor the training process to specific downstream goals without redesigning the entire system.

Future Directions

Authors suggest that extending the adversarial approach to other domains, such as code generation or scientific reasoning, could further validate its versatility. Ongoing work may explore scaling the framework to larger models and integrating human‑in‑the‑loop feedback to refine discriminator judgments.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Adversarial Training Framework Boosts LLM Math Reasoning Performance