Skip to content

Latest commit

 

History

History
33 lines (32 loc) · 2.27 KB

llm_quant.md

File metadata and controls

33 lines (32 loc) · 2.27 KB

Summary/Tutorials

Extreme Low-Bit Quantization

  • QuIP: 2-Bit Quantization of Large Language Models With Guarantee
  • Extreme LLM Compression of Using Additive Quantization, https://arxiv.org/pdf/2401.06118.pdf
  • Enabling Fast 2-bit LLM on GPUs: Memory Alignment and Asynchronous Dequantization

Binarized LLM

Mixed-Precision Quantization

  • LLM-MQ: Mixed-precision Quantization for Efficient LLM Deployment, link

Compressed Model Evaluation

  • COMPRESSING LLMS: THE TRUTH IS RARELY PURE AND NEVER SIMPLE https://arxiv.org/pdf/2310.01382.pdf
  • Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study

Nonlinear Quantization/New Data Format

  • FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
  • ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
  • A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

Quantization with Compensation

  • SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION
  • SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

System-Level Optimization

  • Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
  • LLM in a flash: Efficient Large Language Model Inference with Limited Memory

KV cache compression/Activation Quantization

  • KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (UC Berkeley)
  • Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

Others

  • Z-FOLD: A Frustratingly Easy Post-Training Quantization Scheme for LLMs (Samsung Research)
  • Norm Tweaking: High-performance Low-bit Quantization of Large Language Models