Study Finds Post-Training Methods Can Revive Dataset Contamination in LLMs

Global: Study Finds Post‑Training Methods Can Revive Dataset Contamination in Large Language Models

Researchers released a new arXiv preprint in January 2026 that investigates how dataset contamination interacts with standard post‑training stages in large language model (LLM) pipelines. The study examined clean checkpoints of Qwen2.5 (0.5 B and 1.5 B parameters) and Gemma3 (1 B and 4 B parameters), introduced duplicated test items from GSM8K and MBPP into the early portion of a 25 B‑token pre‑training corpus, and evaluated model performance before and after two widely used post‑training techniques.

Methodology

The authors injected five copies of each GSM8K and MBPP test item into the first 2 B tokens of an otherwise clean 25 B‑token dataset. Both contaminated and clean versions of the models were trained to the same token count. After pre‑training, the models underwent either supervised fine‑tuning (SFT) or reinforcement learning using group relative policy optimization (GRPO). No contamination was present in the post‑training data, allowing the researchers to isolate the effects of earlier leakage.

Pre‑training Contamination Effects

Initial evaluation showed that contamination produced noticeable performance spikes on the contaminated benchmarks. However, as pre‑training progressed, the inflation diminished; by the end of the 25 B‑token run, the apparent advantage had largely vanished, approaching zero in most cases.

Supervised Fine‑Tuning Outcomes

When the contaminated models were fine‑tuned with SFT, the leaked information resurfaced, leading to inflated scores exclusively on the tasks that had been contaminated. This effect was consistent across model sizes, indicating that SFT can amplify memorized artifacts without extending benefits to unrelated tasks.

Reinforcement Learning (GRPO) Outcomes

Applying GRPO also re‑exposed the leaked content, but the impact differed. In addition to improving performance on the contaminated benchmarks, GRPO produced modest gains on related, uncontaminated tasks such as GSMPlus and HumanEval, suggesting a broader generalization of the leaked knowledge.

Scale‑Related Trends

The study reported that larger models intensified these patterns. Bigger SFT models memorized more of the injected data, while larger GRPO models were more likely to translate the memorized information into capabilities that generalized beyond the original tasks.

Implications for Auditing

Authors conclude that contamination audits should be conducted after post‑training, not solely during pre‑training, to capture the resurgence of leaked data. They also note that RL‑based post‑training, while not immune to contamination effects, may mitigate over‑estimation of model performance compared with purely supervised approaches.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Study Finds Post‑Training Methods Can Revive Dataset Contamination in Large Language Models