Sherry Framework Enhances Ternary Quantization for Edge Deployment of Large Language Models
Global: Sherry Framework Enhances Ternary Quantization for Edge Deployment of Large Language Models
Researchers have introduced Sherry, a hardware‑efficient ternary quantization framework designed to lower memory and computational demands of large language models (LLMs) on resource‑constrained edge devices. The approach targets the longstanding trade‑off between 2‑bit aligned packing, which wastes bits, and 1.67‑bit irregular packing, which slows inference.
Background and Challenge
Deploying LLMs on edge hardware has become increasingly difficult as model sizes grow, while existing ternary quantization methods fail to align with commodity processors. Misaligned packing either inflates storage requirements or hampers execution speed, limiting practical adoption.
Sherry’s Design and Sparsity Scheme
Sherry introduces a 3:4 fine‑grained sparsity pattern that packs blocks of four weights into five bits, achieving a regularized 1.25‑bit width and restoring power‑of‑two alignment. This novel packing reduces bit waste without sacrificing the regular structure required by typical CPUs and accelerators.
Arenas Residual Mechanism
The authors also identify a weight‑trapping issue during sparse ternary training that can cause representational collapse. To mitigate this, Sherry incorporates Arenas, an annealing residual synapse mechanism that preserves diversity in weight representations throughout training.
Performance Evaluation
Empirical tests on LLaMA‑3.2 across five benchmarks show that Sherry matches state‑of‑the‑art ternary performance while delivering notable efficiency gains. On an Intel i7‑14700HX CPU, a 1B‑parameter model experiences zero accuracy loss relative to leading baselines, a 25% reduction in bit usage, and a 10% speed increase.
Implications for Edge AI
By reconciling bit‑level efficiency with hardware alignment, Sherry enables more practical deployment of powerful LLMs on devices such as smartphones, IoT gateways, and embedded systems, potentially expanding the reach of advanced AI capabilities.
Availability
The implementation and training scripts for Sherry are publicly released on GitHub at https://github.com/Tencent/AngelSlim, allowing the research community to reproduce and extend the results.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung