Skip to content

Awesome machine learning model compression research papers, quantization, tools, and learning material.

License

Notifications You must be signed in to change notification settings

cedrickchee/awesome-ml-model-compression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 

Repository files navigation

Awesome ML Model Compression Awesome

An awesome style list that curates the best machine learning model compression and acceleration research papers, articles, tutorials, libraries, tools and more. PRs are welcome!

Contents


Papers

General

Architecture

Quantization

Binarization

Pruning

Distillation

Low Rank Approximation

Offloading

Recent years have witnessed the emergence of systems that are specialized for LLM inference, such as FasterTransformer (NVIDIA, 2022), PaLM inference (Pope et al., 2022), Deepspeed-Inference (Aminabadi et al., 2022), Accelerate (HuggingFace, 2022), LightSeq (Wang et al., 2021), TurboTransformers (Fang et al., 2021).

To enable LLM inference on easily accessible hardware, offloading is an essential technique — to our knowledge, among current systems, only Deepspeed-Inference and Huggingface Accelerate include such functionality.

Parallelism

Compression methods for model acceleration (i.e., model parallelism) papers:

  • Does compressing activations help model parallel training? (2023) - They presents the first empirical study on the effectiveness of compression algorithms (pruning-based, learning-based, and quantization-based - using a Transformer architecture) to improve the communication speed of model parallelism. Summary: 1) activation compression not equal to gradient compression; 2) training setups matter a lot; 3) don't compress early layers' activation.

Articles

Content published on the Web.

Howtos

Assorted

Reference

Blogs

Tools

Libraries

  • TensorFlow Model Optimization Toolkit. Accompanied blog post, TensorFlow Model Optimization Toolkit — Pruning API
  • XNNPACK is a highly optimized library of floating-point neural network inference operators for ARM, WebAssembly, and x86 (SSE2 level) platforms. It's a based on QNNPACK library. However, unlike QNNPACK, XNNPACK focuses entirely on floating-point operators.
  • Bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers and quantization functions.
  • NNCP - An experiment to build a practical lossless data compressor with neural networks. The latest version uses a Transformer model (slower but best ratio). LSTM (faster) is also available.

Frameworks

Paper Implementations

  • facebookresearch/kill-the-bits - code and compressed models for the paper, "And the bit goes down: Revisiting the quantization of neural networks" by Facebook AI Research.

Videos

Talks

Training & tutorials

License

I am providing code and resources in this repository to you under an open source license. Because this is my personal repository, the license you receive to my code and resources is from me and not my employer.

About

Awesome machine learning model compression research papers, quantization, tools, and learning material.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published