Least-Loaded Expert Parallelism Cuts Memory Use and Boosts Speed for Mixture-of-Experts Models
Global: Least-Loaded Expert Parallelism Cuts Memory Use and Boosts Speed for Mixture-of-Experts Models
Researchers have introduced a new algorithm called Least-Loaded Expert Parallelism (LLEP) to address token routing imbalances in large Mixture-of-Experts (MoE) models during post‑training and inference. The method dynamically redistributes excess tokens and associated parameters from overloaded devices to underutilized ones, aiming to keep all devices within their memory limits while minimizing overall latency.
Background on Mixture‑of‑Experts
MoE architectures rely on a collection of expert sub‑networks that are selectively activated for each input token. Conventional training pipelines enforce explicit load‑balancing constraints to keep the routing of tokens statistically even across experts.
Persistent Imbalance in Practice
Despite these constraints, the authors observed that well‑trained MoE models often exhibit significant routing skew, concentrating domain‑specific knowledge in a limited subset of experts. This natural imbalance can become problematic when models are deployed across multiple devices using standard Expert Parallelism (EP), which assumes balanced routing.
Limitations of Standard Expert Parallelism
Standard EP distributes experts uniformly across hardware but does not adapt to uneven token loads. Under extreme imbalance, a small number of devices may receive a disproportionate share of tokens, leading to compute‑ and memory‑bound failures during inference or post‑training, where load‑balancing mechanisms are typically unavailable.
Introducing Least‑Loaded Expert Parallelism
LLEP addresses this gap by monitoring device workloads in real time and rerouting surplus tokens, along with the necessary expert parameters, to devices with available capacity. The algorithm respects each device’s memory constraints while striving to complete the collective workload in the shortest possible time.
Reported Performance Improvements
Experimental results across various model scales indicate that LLEP can achieve up to 5× speedup and a 4× reduction in peak memory usage compared with conventional EP. In a specific benchmark involving the gpt‑oss‑120b model, LLEP delivered approximately 1.9× faster processing.
Theoretical and Empirical Validation
The authors provide extensive theoretical analysis to justify the algorithm’s optimality under the stated constraints, complemented by comprehensive empirical evaluations and ablation studies that isolate the contributions of dynamic token rerouting.
Implications for Hardware‑Specific Tuning
These findings suggest that LLEP offers a principled framework for hardware‑aware hyper‑parameter tuning, enabling practitioners to maximize throughput and minimize memory footprints on heterogeneous device clusters.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung