New 1-bit PTQ Method for LLMs: Reducing Memory and Compute Load

Global: New 1-bit PTQ Method for LLMs Reduces Memory and Compute Load

Background

Researchers have introduced a post‑training quantization technique that compresses large language model (LLM) weights to a single bit, addressing the high memory and computational costs that limit practical deployment. The approach, described in a paper posted to arXiv in October 2024, targets the efficiency gap between full‑precision and binarized models.

Method Overview

The proposed method, named ARB‑LLM, employs an alternating refined binarization (ARB) algorithm to iteratively adjust binarization parameters. According to the authors, this iterative refinement narrows the distribution shift between binarized and full‑precision weights, thereby reducing quantization error.

Algorithmic Enhancements

To further improve accuracy, the study extends ARB with two variants—ARB‑X and ARB‑RC—that incorporate calibration data and address column‑wise deviation in LLM weight distributions. A column‑group bitmap (CGB) strategy is also introduced to refine weight partitioning across columns.

Performance Evaluation

Experimental results reported in the paper indicate that ARB‑LLM_X and ARB‑LLM_RC outperform existing state‑of‑the‑art binarization methods for LLMs. Notably, ARB‑LLM_RC is described as the first binary post‑training quantization technique to exceed the performance of FP16 models of comparable size.

Availability and Future Work

The authors state that source code and trained models will be released on GitHub at https://github.com/ZHITENGLI/ARB-LLM, enabling further validation and extension by the research community.

Implications

If the reported gains hold across broader benchmarks, the technique could facilitate the deployment of LLMs on resource‑constrained hardware, expanding access to advanced language capabilities without sacrificing accuracy.This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New 1-bit Post-Training Quantization Method Reduces Memory and Compute Requirements for Large Language Models