Improving Language Model Reasoning with Synthetic Chain-of-Thought Datasets

Global: Synthetic Chain-of-Thought Datasets Enhance Language Model Reasoning Even When Final Answers Are Wrong

A team of AI researchers has reported that language models can improve their reasoning performance by training on synthetic chain-of-thought (CoT) traces generated by more capable models, even when those traces culminate in incorrect answers. The study, posted on arXiv, evaluated models ranging from 1.5 billion to 9 billion parameters across several reasoning benchmarks, including MATH, GSM8K, Countdown, and MBPP.

Rethinking Human-Annotated Datasets

Traditional approaches to enhancing model reasoning rely on human‑curated CoT examples, which are labor‑intensive and may not align closely with the distributional characteristics of the target model. The new research challenges this paradigm by proposing synthetic datasets that mirror the model’s own linguistic patterns.

Generating Synthetic Traces

Researchers first employed larger, more capable language models to produce step‑by‑step reasoning sequences for a variety of tasks. Although many of these sequences concluded with wrong answers, the intermediate steps often contained valid logical operations. These synthetic traces were then used to fine‑tune smaller models.

Empirical Gains Across Benchmarks

Fine‑tuned models demonstrated measurable improvements on all four evaluated datasets compared with counterparts trained solely on human‑annotated CoT data. For example, performance on the GSM8K math benchmark increased by up to 3.2 percentage points, while code generation accuracy on MBPP rose by roughly 2.8 percentage points.

Distributional Proximity as a Key Factor

To test the hypothesis that a closer distribution to the target model facilitates learning, the authors paraphrased human‑written traces using a language model, thereby shifting their statistical properties toward the model’s own output. This paraphrasing step alone yielded modest performance gains, supporting the notion that distributional alignment matters.

Tolerance for Flawed Reasoning

The study also introduced deliberately corrupted CoT traces, varying the degree of error in intermediate steps. Results indicated that models remained robust to a substantial amount of noise, extracting useful reasoning patterns even when portions of the trace were incorrect.

Implications for Dataset Curation

Findings suggest that curators might prioritize datasets that better reflect the linguistic and reasoning style of the models they aim to improve, rather than focusing exclusively on the correctness of final answers. This could streamline data collection and reduce reliance on extensive human annotation.

Future Directions and Limitations

While the experiments span multiple model families and tasks, the authors acknowledge that the approach has yet to be validated on larger, state‑of‑the‑art models or on domains requiring extensive world knowledge. Ongoing work will explore scaling effects and the interplay between synthetic and human‑generated data.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Synthetic Chain-of-Thought Datasets Enhance Language Model Reasoning Even When Final Answers Are Wrong