BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
BitStack breaks down large language models into tiny little blocks, which can be sorted and stacked universally, achieving megabyte-level memory-performance tradeoffs while maintaining or surpassing the performance of practical compression methods like GPTQ and AWQ. Check out our paper for more details!
- [2025-01-08] 🎈 Add support for Mistral and Qwen models!
- [2024-11-06] 🚀 We've released Triton kernels optimized for fused inference with BitStack models! These kernels deliver an impressive 3x to 10x speedup over the original implementation. Just set the
--fused_level
flag to get started! For more details, check out the speedup visualization here. - [2024-11-01] 🎈 Try out this Colab demo and play with BitStack models across various memory budgets using an intuitive slider built with Gradio!
- [2024-11-01] 📄 Check out our paper on arXiv!
- [2024-10-31] ✨ Pre-decomposed models are now available on HuggingFace🤗!
- [2024-10-31] 🚀 Code release! We have some awesome inference kernels for BitStack models coming soon, stay tuned!
conda create -yn bitstack python=3.10
conda activate bitstack
pip install -e .
To run the decomposition of a model, run this script or the following command:
python -m bitstack.main \
--model_name_or_path meta-llama/Meta-Llama-3.1-8B \
--niter 16 \ # Number of iterations of decomposition, decrease for shorter runtime
--k 16 \ # Number of singular vectors kept
--nsamples 256 \ # Number of calibration samples
--output_dir outputs \
--do_save \
--score_importance \ # Run the sorting process
--generate_compression_configs # Generate compression configs
To evaluate the decomposed model, run this script or the following command:
python -m bitstack.main \
--model_name_or_path /YOUR/CHECKPOINT/PATH \
--k 16 \
--max_memory_MB 5541 \ # Maximum available memory for the model
--load_bitstack \ # Load the decomposed model
--do_eval \ # Perplexity evaluation
--lm_eval \ # Zero-shot evaluation
--output_dir outputs
We provide pre-decomposed models and compression configs. Currently, the following models are available, with more to come—stay tuned!
Model | Download |
---|---|
Llama-2 | 🤗7B / 🤗13B / 🤗70B |
Llama-3 | 🤗8B / 🤗70B |
Llama-3.1 | 🤗8B / 🤗70B |
Llama-3.1-Instruct | 🤗8B / 🤗70B |
Llama-3.2 | 🤗1B / 🤗3B |
Mistral-7B-v0.3 | 🤗7B |
Qwen-2.5 | 🤗0.5B / 🤗1.5B / 🤗3B / 🤗7B / 🤗14B / 🤗32B / 🤗72B |
You can download them via the following commands:
# (Optional) enable hf_transfer for faster download
# pip install hf_transfer
# export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download BitStack/BitStack-Llama-3.1-8B --local-dir ./models/BitStack-Llama-3.1-8B
Or just download the compression config for your already decomposed model:
huggingface-cli download BitStack/BitStack-Llama-3.1-8B --local-dir /YOUR/CHECKPOINT/PATH/ --include "compression_config.json"
@misc{wang2024bitstackfinegrainedsizecontrol,
title={BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments},
author={Xinghao Wang and Pengyu Wang and Bo Wang and Dong Zhang and Yunhua Zhou and Xipeng Qiu},
year={2024},
eprint={2410.23918},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.23918},
}