Releases: bitsandbytes-foundation/bitsandbytes
0.45.0: LLM.int8() support for H100; faster 4-bit/8-bit inference
Highlights
H100 Support for LLM.int8()
PR #1401 brings full LLM.int8() support for NVIDIA Hopper GPUs such as the H100, H200, and H800!
As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the col32
or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios.
Performance Improvements
This release includes broad performance improvements for a wide variety of inference scenarios. See this X thread for a detailed explanation.
The improvements were measured using the 🤗optimum-benchmark tool.
For more benchmark results, see benchmarking/README.md.
LLM.int8()
- Turing/Ampere/Ada: The observed per-token throughput is improved by 60-85%, while latency is decreased by 40-45%.
- H100: With our benchmarking of Llama 3.1 70B, we observed the new LLM.int8() to consistently outperform NF4 at batch size >= 8.
Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:
- Batch size = 1: 9.05 tokens/s => 15.44 tokens/s
- Batch size = 8: 66.62 tokens/s => 110.95 tokens/s
Example throughput improvement for Qwen 2.5 3B Instruct on T4:
- Batch size = 1: 3.34 tokens/s => 5.98 tokens/s
- Batch size = 8: 24.28 tokens/s => 44.15 tokens/s
NF4/FP4
- Turing/Ampere/Ada: With batch size of 1, per-token throughput is improved by 10-25% and per-token latency is decreased by 10-20%.
- H100: Across all batch sizes, per-token throughput is improved by up to 28% and per-token latency is decreased by up to 22%.
Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:
- Batch size = 1: 31.46 tokens/s => 39.03 tokens/s
- Batch size = 8: 110.70 tokens/s => 111.29 tokens/s
Example throughput improvement for Qwen 2.5 3B Instruct on T4:
- Batch size = 1: 11.05 tokens/s => 13.58 tokens/s
- Batch size = 8: 69.8 tokens/s => 76.80 tokens/s
Changes
Packaging Changes
The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB.
CUDA Toolkit Versions
- Binaries built with CUDA Toolkit 12.6.2 are now included in the PyPI distribution.
- The CUDA 12.5.0 build has been updated to CUDA Toolkit 12.5.1.
Breaking
🤗PEFT users wishing to merge adapters with 8-bit weights will need to upgrade to peft>=0.14.0
.
New
- A new public API for int8 dequantization has been added:
bitsandbytes.functional.int8_vectorwise_dequant()
. This functionality is being integrated into 🤗PEFT and 🤗transformers. - We've continued to make documentation updates. The
bitsandbytes.functional
module now has an API documentation page.
Deprecations
A number of public API functions have been marked for deprecation and will emit FutureWarning
when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users.
k-bit quantization
The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using block_wise=False
is not recommended and support will be removed in a future release.
LLM.int8() deprecations:
As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts.
The following relevant functions from bitsandbytes.functional
are now deprecated :
- dequant_min_max
- dequantize_no_absmax
- extract_outliers
- get_special_format_str
- get_transform_buffer
- get_transform_func
- mm_dequant (replacement: int8_mm_dequant)
- igemmlt (replacement: int8_linear_matmul)
- nvidia_transform
- transform
- quantize_no_absmax
- vectorwise_dequant
- vectorwise_quant (~replacement: int8_vectorwise_quant)
- vectorwise_mm_dequant (~replacement: int8_mm_dequant)
General Deprecations
Additionally the following functions from bitsandbytes.functional
are deprecated:
- _mul
- arange
- post_call
- pre_call
What's Changed
- refine docs for multi-backend alpha release by @Titus-von-Koeller in #1380
- README: Replace special Unicode text symbols with regular characters by @akx in #1385
- Update CI tools & fix typos by @akx in #1386
- Fix invalid escape sequence warning in Python 3.12 by @oshiteku in #1420
- [Build] Add CUDA 12.6.2 build; update 12.5.0 to 12.5.1 by @matthewdouglas in #1431
- LLM.int8() Refactoring: Part 1 by @matthewdouglas in #1401
New Contributors
Full Changelog: 0.44.1...0.45.0
0.44.1
What's Changed
- Fix optimizer support for Python <= 3.9 by @matthewdouglas in #1379
Full Changelog: 0.44.0...0.44.1
Multi-Backend Preview
To try this out, simply pip install 'FULL_DOWNLOAD_LINK'
with the download link from correct wheel that are included below in the release in the "Assets" section.
Note that Windows is not supported for the AMD ROCm backend.
Latest `main` wheel
To try this out, simply pip install 'FULL_DOWNLOAD_LINK'
with the download link from correct wheel that are included below in the release in the "Assets" section.
These wheels get built on every commit and become available as soon as the python-package.yml
GH workflow finished executing.
0.44.0: New AdEMAMix optimizer, Embeddings quantization, and more!
New optimizer: AdEMAMix
The AdEMAMix optimizer is a modification to AdamW which proposes tracking two EMAs to better leverage past gradients. This allows for faster convergence with less training data and improved resistance to forgetting.
We've implemented 8bit and paged variations: AdEMAMix
, AdEMAMix8bit
, PagedAdEMAMix
, and PagedAdEMAMix8bit
. These can be used with a similar API to existing optimizers.
import bitsandbytes as bnb
optimizer = bnb.optim.PagedAdEMAMix8bit(
model.parameters(),
lr=1e-4,
betas=(0.9, 0.999, 0.9999),
alpha=5.0,
eps=1e-8,
weight_decay=1e-2,
alpha=5.0,
)
8-bit Optimizers Update
The block size for all 8-bit optimizers has been reduced from 2048 to 256 in this release. This is a change from the original implementation proposed in the paper which improves accuracy.
CUDA Graphs support
A fix to enable CUDA Graphs capture of kernel functions was made in #1330. This allows for performance improvements with inference frameworks like vLLM. Thanks @jeejeelee!
Quantization for Embeddings
The trend of LLMs to use larger vocabularies continues. The embeddings can take up a significant portion of a quantized model's footprint. We now have an implementation of Embedding4bit
and Embedding8bit
thanks to @galqiwi!
Example usage:
import torch
import torch.nn as nn
from bitsandbytes.nn import Embedding4bit
fp16_module = nn.Embedding(128, 64)
quantized_module = Embedding4bit(128, 64)
quantized_module.load_state_dict(fp16_module.state_dict())
quantized_module = quantized_module.to(0)
Continuous Builds
We are now building binary wheels for each change on main
. These builds can be used to preview upcoming changes.
What's Changed
- Embedding4bit and Embedding8bit implementation by @galqiwi in #1292
- Bugfix: Load correct nocublaslt library variant when BNB_CUDA_VERSION override is set by @matthewdouglas in #1318
- Enable certain CUDA kernels to accept specified cuda stream by @jeejeelee in #1330
- Initial support for ppc64le by @mgiessing in #1316
- Cuda source cleanup , refactor and fixes by @abhilash1910 in #1328
- Update for VS2022 17.11 compatibility with CUDA < 12.4 by @matthewdouglas in #1341
- Bump the minor-patch group with 3 updates by @dependabot in #1362
- Update matplotlib requirement from ~=3.9.1 to ~=3.9.2 in the major group by @dependabot in #1361
- docs: add internal reference to multi-backend guide by @Titus-von-Koeller in #1352
- Add
move_to_device
kwarg to the optimizer'sload_state_dict
by @koute in #1344 - Add AdEMAMix optimizer by @matthewdouglas in #1360
- Change 8bit optimizer blocksize 2048->256; additional bf16 support by @matthewdouglas in #1365
New Contributors
- @jeejeelee made their first contribution in #1330
- @mgiessing made their first contribution in #1316
- @abhilash1910 made their first contribution in #1328
- @koute made their first contribution in #1344
Full Changelog: 0.43.3...v0.44.0
0.43.3: enabling LLama 405b with 8xH/A100 + 256GB RAM
Improvements:
- FSDP: Enable loading prequantized weights with bf16/fp16/fp32 quant_storage
- Background: This update, linked to Transformer PR #32276, allows loading prequantized weights with alternative storage formats. Metadata is tracked similarly to
Params4bit.__new__
post PR #970. It supports models exported with non-defaultquant_storage
, such as this NF4 model with BF16 storage. - Special thanks to @winglian and @matthewdouglas for enabling FSDP+QLoRA finetuning of Llama 3.1 405B on a single 8xH100 or 8xA100 node with as little as 256GB system RAM.
- Background: This update, linked to Transformer PR #32276, allows loading prequantized weights with alternative storage formats. Metadata is tracked similarly to
0.43.2: finetune Llama 405B on 4x GPUs with improved QLoRA+FSDP, CUDA 12.5 support
0.43.2
This release is quite significant as the QLoRA bug fix has big implications for higher seqlen
and batch sizes.
For each sequence (i.e. batch size increase of one) we expect memory savings of:
- 405B: 39GB for
seqlen=1024
, and 4888GB forseqlen=128,00
- 70B: 10.1GB for
seqlen=1024
and 1258GB forseqlen=128,00
This was due to activations being unnecessary for frozen parameters, yet the memory for them was still erroneously allocated due to the now fixed bug.
Improvements:
- docs: FSDP+QLoRA and CPU install guide (#1211 #1227, thanks @stevhliu)
- Add CUDA 12.5 and update 12.4 builds (#1284)
Bug Fixes
- 4bit getstate and 8bit deepcopy (#1230 #1231, thanks @BenjaminBossan)
- missing optimizers in
str2optimizer32bit
(#1222, thanks @EtienneDosSantos) - CUDA 12.5 build issue (#1273, thanks @HennerM)
- fix for min_8bit_size functionality in Optimizer base classes (#1286, thanks @Edenzzzz)
- QLoRA mem bug (#1270, thanks @Ther-nullptr)
- tests for cpu only platforms (#1259, thanks @galqiwi)
- restoration of quant_storage for CPU offloading (#1279)
- optim update error with non-contiguous grads/params (deepspeed) (#1187)
0.43.1: Improved CUDA setup/diagnostics + 8-bit serialization, CUDA 12.4 support, docs enhancements
Improvements:
- Improved the serialization format for 8-bit weights; this change is fully backwards compatible. (#1164, thanks to @younesbelkada for the contributions and @akx for the review).
- Added CUDA 12.4 support to the Linux x86-64 build workflow, expanding the library's compatibility with the latest CUDA versions. (#1171, kudos to @matthewdouglas for this addition).
- Docs enhancement: Improved the instructions for installing the library from source. (#1149, special thanks to @stevhliu for the enhancements).
Bug Fixes
- Fix 4bit quantization with blocksize = 4096, where an illegal memory access was encountered. (#1160, thanks @matthewdouglas for fixing and @YLGH for reporting)
Internal Improvements:
- Tests: improve memory usage (#1147, thanks @matthewdouglas)
- Add CUDA 12.4 to docs/install helper (#1136, thanks @matthewdouglas)
- Minor type/doc fixes (#1128, thanks @akx)
- Reformat Python code with Ruff (#1081, thanks @akx)
- Rework of CUDA/native-library setup and diagnostics (#1041, thanks @akx)
0.43.0: FSDP support, Official documentation, Cross-compilation on Linux and CI, Windows support
Improvements and New Features:
- QLoRA + FSDP official support is now live! #970 by @warner-benjamin and team - with FSDP you can train very large models (70b scale) on multiple 24GB consumer-type GPUs. See https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html for more details.
- Introduced improvements to the CI process for enhanced performance and efficiency during builds, specifically enabling more effective cross-compilation on Linux platforms. This was accomplished by deprecating Make and migrating to Cmake, as well as implementing new corresponding workflows. Huge thanks go to @wkpark, @rickardp, @matthewdouglas and @younesbelkada; #1055, #1050, #1111.
- Windows should be officially supported in bitsandbytes with
pip install bitsandbytes
- Updated installation instructions to provide more comprehensive guidance for users. This includes clearer explanations and additional tips for various setup scenarios, making the library more accessible to a broader audience (@rickardp, #1047).
- Enhanced the library's compatibility and setup process, including fixes for CPU-only installations and improvements in CUDA setup error messaging. This effort aims to streamline the installation process and improve user experience across different platforms and setups (@wkpark, @akx, #1038, #996, #1012).
- Setup a new documentation at https://huggingface.co/docs/bitsandbytes/main with extensive new sections and content to help users better understand and utilize the library. Especially notable are the new API docs. (big thanks to @stevhliu and @mishig25 from HuggingFace #1012). The API docs have been also addressed in #1075
Bug Fixes:
- Addressed a race condition in kEstimateQuantiles, enhancing the reliability of quantile estimation in concurrent environments (@pnunna93, #1061).
- Fixed various minor issues, including typos in code comments and documentation, to improve code clarity and prevent potential confusion (@nairbv, #1063).
Backwards Compatibility
- After upgrading from
v0.42
tov0.43
, when using 4bit quantization, models may generate slightly different outputs (approximately up to the 2nd decimal place) due to a fix in the code. For anyone interested in the details, see this comment.
Internal and Build System Enhancements:
- Implemented several enhancements to the internal and build systems, including adjustments to the CI workflows, portability improvements, and build artifact management. These changes contribute to a more robust and flexible development process, ensuring the library's ongoing quality and maintainability (@rickardp, @akx, @wkpark, @matthewdouglas; #949, #1053, #1045, #1037).
Contributors:
This release is made possible thanks to the many active contributors that submitted PRs and many others who contributed to discussions, reviews, and testing. Your efforts greatly enhance the library's quality and user experience. It's truly inspiring to work with such a dedicated and competent group of volunteers and professionals!
We give a special thanks to @TimDettmers for managing to find a little bit of time for valuable consultations on critical topics, despite preparing for and touring the states applying for professor positions. We wish him the utmost success!
We also extend our gratitude to the broader community for your continued support, feedback, and engagement, which play a crucial role in driving the library's development forward.
4-bit serialization and bug fixes
This release added 4-bit serialization, implemented by @poedator, to bitsandbytes. With this,you can call model.save()
and model.load()
for models that contain 4-bit bitsandbytes layers meaning you can save and load 4-bit models. All of this is integrated with the Hugging Face transformers stack. The 0.42.0 release also comes with many bug fixes. See below for detailed change logs.
0.42.0
Features:
- 4-bit serialization now supported. This enables 4-bit load/store. Thank you @poedator #753
- the bitsandbytes library now has a version attribute:
bitsandbytes.__version__
@rasbt #710
Bug fixes:
- Fixed bugs in dynamic exponent data type creation. Thank you @RossM, @KohakuBlueleaf, @ArrowM #659 #227 #262 #152
- Fixed an issue where 4-bit serialization would fail for layers without double quantization #868. Thank you, @poedator
- Fixed an issue where calling .to() or .cuda() on a 4-bit layer twice would result in an error #867. Thank you, @jph00
- Fixed a bug where a missing access permission in a path searched for CUDA would lead to an error @osma #677
- Fixed a bug where the GOOGLE_VM_CONFIG_LOCK_FILE variable could cause errors in colab environments @akrentsel @xaptronic #715 #883 #622
- Fixed a bug where kgetColRowStats (LLM.int8()) would fail for certain dimensions @LucQueen #905
- Fixed a bug where the adjusted regular Embedding layer was not available via
bnb.nn.Embedding
@neel04 #563 - Fixed added missing scipy requirement @dulalbert #525