Boosting Ultra-Low-Bit Quantization for Large Language Models with Sliced Wasserstein Loss

Global: New Sliced Wasserstein Loss Boosts Ultra-Low-Bit Quantization for Large Language Models

In January 2026, researchers announced a novel loss function designed to improve the performance of ultra‑low‑bit post‑training quantization for large language models. The approach, described in an arXiv preprint, targets the hidden economic and environmental costs associated with deploying high‑precision models by aligning output distributions between full‑precision and quantized versions.

Background on Model Quantization

Quantization reduces the precision of model parameters, often lowering memory usage and energy consumption. While 4‑bit quantization is commonly adopted, pushing below this threshold can distort activation distributions, leading to noticeable drops in perplexity and downstream task accuracy.

Proposed Sliced Wasserstein Loss

The authors introduce a sliced Wasserstein loss that operates on random linear projections of model outputs. By matching the projected distributions of the original and quantized models, the loss complements traditional mean‑squared error objectives without adding inference‑time overhead.

Integration with Existing Frameworks

The loss function is compatible with any post‑training quantization pipeline that includes a retraining step. To demonstrate versatility, the researchers incorporated it into two state‑of‑the‑art methods, OmniQuant and TesseraQ, both of which already target ultra‑low‑bit regimes.

Performance Gains Reported

Experimental results show consistent improvements across multiple models and bit‑width settings. For LLaMA‑2‑7B, the loss recovered 4.12 %–20.37 % of OmniQuant’s lost accuracy; for OPT‑6.7B, recovery ranged from 0.93 % to 7.65 %; and for LLaMA‑2‑13B, gains were between 2.26 % and 6.20 %. When paired with TesseraQ, relative accuracy degradation decreased by 3.63 %–7.63 %.

Implications and Availability

These findings suggest that distribution‑aware calibration can meaningfully narrow the performance gap introduced by aggressive quantization, potentially lowering the resource footprint of large language models in production environments. The authors have released their implementation on GitHub to encourage further research and adoption.

Future Directions

The study opens avenues for exploring additional distribution‑matching techniques and extending the approach to other model families and tasks. Continued evaluation on real‑world workloads will help assess the broader impact on energy efficiency and deployment costs.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Sliced Wasserstein Loss Boosts Ultra-Low-Bit Quantization for Large Language Models