Boosting Mathematical Reasoning in Large Language Models with Data Synthesis

Global: New Data Synthesis Paradigm Boosts Mathematical Reasoning in Large Language Models
Researchers have introduced a novel data synthesis paradigm designed to enhance the mathematical reasoning capabilities of large language models (LLMs) by providing problems with clearly defined and well‑graded difficulty levels.

Limitations of Existing Data Synthesis

Current approaches to generating training data for mathematical tasks often suffer from limited diversity and insufficient control over problem difficulty, which hampers the effectiveness of curriculum‑learning strategies.

MathMixup Methodology

The proposed framework, named MathMixup, employs hybrid and decomposed generation strategies that combine automated self‑checking mechanisms with manual screening. This dual process ensures semantic clarity and establishes a structured difficulty gradient across the synthesized problems.

Dataset Construction

Using the MathMixup pipeline, the authors assembled the MathMixupQA dataset, a collection of difficulty‑controllable mathematical reasoning problems intended for seamless integration with existing training corpora.

Curriculum Learning Integration

A curriculum‑learning strategy built around the graded problems of MathMixupQA enables flexible sequencing of training examples, allowing models to progress from simpler to more complex tasks in a systematic manner.

Performance Gains

Experimental evaluation demonstrated that a fine‑tuned Qwen2.5-7B model achieved an average score of 52.6% across seven established mathematical benchmarks, surpassing previously reported state‑of‑the‑art results.

Broader Impact

The findings validate the effectiveness of difficulty‑controllable data synthesis and curriculum learning for improving LLM mathematical reasoning, suggesting broad applicability for future data‑centric training pipelines.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Data Synthesis Paradigm Boosts Mathematical Reasoning in Large Language Models