Understanding Gradient Heterogeneity in Transformer Training

Global: Gradient Heterogeneity Explains Optimizer Differences in Transformer Training

Researchers from an international team have presented a new analysis of why adaptive optimizers such as Adam often outperform stochastic gradient descent (SGD) when training Transformer models. The study, posted on arXiv in February 2025, introduces the concept of gradient heterogeneity—the variation in gradient norms across parameter blocks—and examines its impact on convergence for both SGD and sign‑based methods. According to the authors, understanding this phenomenon sheds light on Adam’s empirical success and offers guidance for learning‑rate scaling.

Understanding Gradient Heterogeneity

The authors define gradient heterogeneity as the disparity in the magnitude of gradients among different groups of parameters within a model. They argue that such disparity, together with Hessian heterogeneity, can slow the convergence of traditional gradient‑based algorithms because these methods treat all dimensions uniformly, regardless of their individual scales.

Theoretical Implications for Optimizers

Through a formal analysis, the paper demonstrates that SGD’s steepest‑descent direction under the Euclidean norm becomes less effective when gradient heterogeneity is high. In contrast, sign‑based approaches like SignSGD, which follow the steepest‑descent direction under the ℓ∞ norm, are considerably less sensitive to these variations. The authors derive upper bounds on iteration complexity that highlight distinct learning‑rate scaling requirements for each optimizer.

Adam as a Sign‑Based Method

Building on the theoretical results, the study interprets Adam’s coordinate‑wise normalization as a mechanism that emphasizes gradient signs rather than magnitudes. This perspective positions Adam as a “soft” variant of SignSGD, explaining why it can retain the robustness of sign‑based updates while still benefiting from adaptive learning‑rate adjustments.

Impact of Layer Normalization

The investigation also traces the source of gradient heterogeneity to architectural choices, particularly the placement of layer‑normalization layers. Models that employ post‑layer‑normalization (Post‑LN) exhibit markedly higher heterogeneity compared with pre‑layer‑normalization (Pre‑LN) designs, suggesting that normalization strategy directly influences optimizer behavior.

Empirical Validation

Experimental results from fine‑tuning Transformers on natural‑language processing and computer‑vision benchmarks corroborate the theoretical predictions. Across multiple datasets, Adam consistently outperformed SGD, while SignSGD narrowed the performance gap, especially in Post‑LN configurations. The authors note that adjusting learning‑rate schedules in line with their complexity bounds further improves convergence.

Resources and Future Work

The full codebase supporting the analysis is publicly available on GitHub (https://github.com/tom4649/gradient-heterogeneity). The authors propose extending the framework to other model families and exploring additional normalization schemes to mitigate heterogeneity effects.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Gradient Heterogeneity Explains Optimizer Differences in Transformer Training