Understanding Grokking: A New Theoretical Explanation

Global: Study Links Grokking to Norm Minimization on Zero-Loss Manifold

Researchers led by Tiberiu Musat released a preprint on arXiv that proposes a new theoretical explanation for the delayed generalization phenomenon known as grokking. The paper, first submitted on 2 Nov 2025 and revised on 8 Jan 2026, argues that after a model has memorized its training data, gradient descent continues to minimize the weight norm while remaining on the zero‑loss manifold, thereby driving eventual generalization.

Background on Grokking

Grokking describes a counter‑intuitive learning trajectory in which a neural network achieves perfect training accuracy early on, yet only later attains high test accuracy after a substantial training delay. This behavior has been observed across a range of synthetic and real‑world tasks, prompting investigations into the underlying mechanisms.

Prior Theoretical Explanations

Earlier work linked the delayed onset of generalization to representation learning facilitated by weight decay, suggesting that regularization nudges the model toward smoother solutions. However, those studies stopped short of providing a precise dynamical account of how the network transitions from memorization to generalization.

Norm Minimization Framework

The new study frames post‑memorization learning as a constrained optimization problem. By treating the zero‑loss condition as a manifold, the authors prove that, in the limit of infinitesimally small learning rates and weight‑decay coefficients, gradient descent effectively minimizes the Euclidean norm of the weights while staying on that manifold. This formal result establishes a direct connection between the training dynamics and norm reduction.

Analytical Results

To make the framework tractable, the authors introduce an approximation that isolates the dynamics of a subset of parameters from the rest of the network. Applying this approximation to a two‑layer network, they derive a closed‑form expression describing how the first‑layer weights evolve after the network has reached zero training loss.

Experimental Confirmation

Simulations based on the derived gradients reproduce the hallmark features of grokking, including the prolonged gap between memorization and test‑set performance and the emergence of more structured internal representations. The experiments validate the theoretical predictions across multiple network configurations.

Implications and Future Work

By casting grokking as norm minimization on a zero‑loss manifold, the paper offers a unifying perspective that could inform the design of training schedules and regularization strategies. The authors suggest that extending the analysis to deeper architectures and stochastic optimization settings may further clarify the role of norm dynamics in modern deep learning.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Study Links Grokking to Norm Minimization on Zero-Loss Manifold