BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

✨ TL;DR

BitStack breaks down large language models into tiny little blocks, which can be sorted and stacked universally, achieving megabyte-level memory-performance tradeoffs while maintaining or surpassing the performance of practical compression methods like GPTQ and AWQ. Check out our paper for more details!

📰 News

[2025-01-08] 🎈 Add support for Mistral and Qwen models!
[2024-11-06] 🚀 We've released Triton kernels optimized for fused inference with BitStack models! These kernels deliver an impressive 3x to 10x speedup over the original implementation. Just set the --fused_level flag to get started! For more details, check out the speedup visualization here.
[2024-11-01] 🎈 Try out this Colab demo and play with BitStack models across various memory budgets using an intuitive slider built with Gradio!
[2024-11-01] 📄 Check out our paper on arXiv!
[2024-10-31] ✨ Pre-decomposed models are now available on HuggingFace🤗!
[2024-10-31] 🚀 Code release! We have some awesome inference kernels for BitStack models coming soon, stay tuned!

🚀 Quick Start

⚙️ Installation

conda create -yn bitstack python=3.10
conda activate bitstack
pip install -e .

🔄 Decomposition

To run the decomposition of a model, run this script or the following command:

python -m bitstack.main \
    --model_name_or_path meta-llama/Meta-Llama-3.1-8B \
    --niter 16 \ # Number of iterations of decomposition, decrease for shorter runtime
    --k 16 \ # Number of singular vectors kept
    --nsamples 256 \ # Number of calibration samples
    --output_dir outputs \
    --do_save \
    --score_importance \ # Run the sorting process
    --generate_compression_configs # Generate compression configs

📊 Evaluation

To evaluate the decomposed model, run this script or the following command:

python -m bitstack.main \
    --model_name_or_path /YOUR/CHECKPOINT/PATH \
    --k 16 \
    --max_memory_MB 5541 \ # Maximum available memory for the model
    --load_bitstack \ # Load the decomposed model
    --do_eval \ # Perplexity evaluation
    --lm_eval \ # Zero-shot evaluation
    --output_dir outputs

📌 Checkpoints

We provide pre-decomposed models and compression configs. Currently, the following models are available, with more to come—stay tuned!

Model	Download
Llama-2	🤗7B / 🤗13B / 🤗70B
Llama-3	🤗8B / 🤗70B
Llama-3.1	🤗8B / 🤗70B
Llama-3.1-Instruct	🤗8B / 🤗70B
Llama-3.2	🤗1B / 🤗3B
Mistral-7B-v0.3	🤗7B
Qwen-2.5	🤗0.5B / 🤗1.5B / 🤗3B / 🤗7B / 🤗14B / 🤗32B / 🤗72B

You can download them via the following commands:

# (Optional) enable hf_transfer for faster download
# pip install hf_transfer
# export HF_HUB_ENABLE_HF_TRANSFER=1

huggingface-cli download BitStack/BitStack-Llama-3.1-8B --local-dir ./models/BitStack-Llama-3.1-8B

Or just download the compression config for your already decomposed model:

huggingface-cli download BitStack/BitStack-Llama-3.1-8B --local-dir /YOUR/CHECKPOINT/PATH/ --include "compression_config.json"

📖 Citation

@misc{wang2024bitstackfinegrainedsizecontrol,
      title={BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments}, 
      author={Xinghao Wang and Pengyu Wang and Bo Wang and Dong Zhang and Yunhua Zhou and Xipeng Qiu},
      year={2024},
      eprint={2410.23918},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.23918}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
bitstack		bitstack
scripts		scripts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

✨ TL;DR

📰 News

🚀 Quick Start

⚙️ Installation

🔄 Decomposition

📊 Evaluation

📌 Checkpoints

📖 Citation

About

Releases

Packages

Languages

xinghaow99/BitStack

Folders and files

Latest commit

History

Repository files navigation

BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

✨ TL;DR

📰 News

🚀 Quick Start

⚙️ Installation

🔄 Decomposition

📊 Evaluation

📌 Checkpoints

📖 Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages