Transformers Outperform Deeper Counterparts at Scale: Study Finds

Global: Study Finds Wider Transformers Outperform Deeper Counterparts at Scale

On Jan. 28, 2026, researchers Md Muhtasim Munif Fahim and Md Rezaul Karim released a paper that challenges the conventional emphasis on depth in transformer architectures. Analyzing models ranging from 17 million to 7 billion parameters, they concluded that increasing width yields more efficient loss reduction than adding layers, especially at production‑scale sizes.

Key Findings

The authors derived architecture‑conditioned scaling laws indicating optimal depth scales with total compute as D* ≈ C^0.12, while optimal width scales as W* ≈ C^0.34. Consequently, width should expand roughly 2.8 times faster than depth. A critical depth threshold, D_crit ≈ W^0.44, marks the point where additional layers raise loss despite adding parameters—a phenomenon they label the “Depth Delusion.”

Experimental Validation

Empirical tests covered 30 distinct transformer configurations across the parameter spectrum. Each model was trained on high‑compute sample sets representative of contemporary workloads. The resulting fit achieved an R² of 0.922, supporting the proposed scaling relationships.

Case Study at 7 B Parameters

At the 7‑billion‑parameter level, a 64‑layer model with 6.38 billion parameters performed 0.12 nats worse than a 32‑layer model containing 6.86 billion parameters. This outcome illustrates that deeper models can be less effective even when they possess comparable or greater parameter counts.

Implications for Model Design

The study suggests that future large‑scale language model development may benefit from allocating compute resources toward wider architectures rather than deeper ones. Practitioners aiming for optimal loss reduction should consider the derived depth‑width trade‑off curves when planning model scaling strategies.

Limitations and Future Work

The analysis focuses on transformer models trained on representative high‑compute datasets and does not address domain‑specific fine‑tuning or alternative architectural families. The authors recommend extending the scaling framework to other model families and exploring the impact of training dynamics on the identified critical depth.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Wider Transformers Show Superior Performance Compared to Deeper Designs at 7‑Billion‑Parameter Scale

Key Findings

Experimental Validation

Case Study at 7 B Parameters

Implications for Model Design

Limitations and Future Work

Data and Protocol

Privacy Protocol

Case Study at 7 B Parameters