Generalized Primal Averaging Optimizer Shows Speed Gains Over AdamW
Global: Generalized Primal Averaging Optimizer Demonstrates Performance Improvements
A team of machine‑learning researchers announced a new optimizer, Generalized Primal Averaging (GPA), in a December 2025 arXiv preprint. The method extends Nesterov’s accelerated gradient technique, aiming to unify and improve upon recent averaging‑based optimizers such as DiLoCo and Schedule‑Free. GPA is designed for non‑distributed training environments and seeks to reduce memory usage while accelerating convergence.
Relation to Existing Optimizers
GPA builds on the theoretical foundation of Nesterov momentum and incorporates ideas from DiLoCo, which employs a two‑loop structure to aggregate pseudo‑gradients, and Schedule‑Free, which performs uniform averaging. By replacing uniform averaging with exponential moving averaging, GPA offers a smoother update rule that operates at every iteration.
Algorithmic Design
The key innovation lies in decoupling Nesterov’s interpolation constants, allowing the optimizer to perform smooth iterate averaging without the memory‑intensive loops required by DiLoCo. This structural change simplifies implementation and lowers the memory footprint, making GPA attractive for large‑scale model training.
Empirical Performance on Language Models
Experimental results reported in the preprint indicate that GPA outperforms both single‑worker DiLoCo and the widely used AdamW optimizer. On Llama‑160M, Llama‑1B, and Llama‑8B models, GPA achieved speedups of 8.71%, 10.13%, and 9.58% respectively, measured as fewer training steps to reach a target validation loss.
Results on Vision Transformers
When applied to ImageNet Vision Transformer workloads, GPA delivered a 7% speedup in the small‑batch regime and a 25.5% speedup in the large‑batch regime, again compared with AdamW under identical conditions.
Theoretical Guarantees
The authors provide a proof that, for any base optimizer with an (O(sqrt{T})) regret bound—where (T) denotes the number of iterations—GPA either matches or exceeds the original convergence guarantees, depending on the chosen interpolation constants.
Potential Impact
By combining reduced memory overhead with consistent empirical gains, GPA could become a practical alternative for researchers and engineers training large language and vision models. Its compatibility with existing optimization frameworks may facilitate broader adoption across the AI community.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung