- https://www.ningxuefei.cc/talks/llm-efficiency-intro_tutorialonly.pdf
- https://arxiv.org/pdf/2401.15347.pdf
- QuIP: 2-Bit Quantization of Large Language Models With Guarantee
- Extreme LLM Compression of Using Additive Quantization, https://arxiv.org/pdf/2401.06118.pdf
- Enabling Fast 2-bit LLM on GPUs: Memory Alignment and Asynchronous Dequantization
- PB-LLM: PARTIALLY BINARIZED LARGE LANGUAGE MODELS, https://arxiv.org/pdf/2310.00034.pdf
- BitNet: Scaling 1-bit Transformers for Large Language Models
- blog: 1-bit Quantization: Run Models with Trillions of Parameters on Your Computer
- LLM-MQ: Mixed-precision Quantization for Efficient LLM Deployment, link
- COMPRESSING LLMS: THE TRUTH IS RARELY PURE AND NEVER SIMPLE https://arxiv.org/pdf/2310.01382.pdf
- Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study
- FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
- ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
- A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
- SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION
- SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
- Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
- LLM in a flash: Efficient Large Language Model Inference with Limited Memory
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (UC Berkeley)
- Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
- Z-FOLD: A Frustratingly Easy Post-Training Quantization Scheme for LLMs (Samsung Research)
- Norm Tweaking: High-performance Low-bit Quantization of Large Language Models