Embedded-Neural-Network/llm_quant.md at master · zssloth/Embedded-Neural-Network · GitHub

Summary/Tutorials

Extreme Low-Bit Quantization

QuIP: 2-Bit Quantization of Large Language Models With Guarantee
Extreme LLM Compression of Using Additive Quantization, https://arxiv.org/pdf/2401.06118.pdf
Enabling Fast 2-bit LLM on GPUs: Memory Alignment and Asynchronous Dequantization

Binarized LLM

PB-LLM: PARTIALLY BINARIZED LARGE LANGUAGE MODELS, https://arxiv.org/pdf/2310.00034.pdf
BitNet: Scaling 1-bit Transformers for Large Language Models
blog: 1-bit Quantization: Run Models with Trillions of Parameters on Your Computer

Mixed-Precision Quantization

LLM-MQ: Mixed-precision Quantization for Efficient LLM Deployment, link

Compressed Model Evaluation

COMPRESSING LLMS: THE TRUTH IS RARELY PURE AND NEVER SIMPLE https://arxiv.org/pdf/2310.01382.pdf
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study

Nonlinear Quantization/New Data Format

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

Quantization with Compensation

SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

System-Level Optimization

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
LLM in a flash: Efficient Large Language Model Inference with Limited Memory

KV cache compression/Activation Quantization

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (UC Berkeley)
Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

Others

Z-FOLD: A Frustratingly Easy Post-Training Quantization Scheme for LLMs (Samsung Research)
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models