[docs] Clarify FSDP-QLoRA #1211

stevhliu · 2024-05-14T18:02:46Z

Fix and clarify the docs around:

FSDP only supports floating data type in quant_storage
show a more common way to access and configure this option with BitsAndBytesConfig in Transformers
explain compute_dtype and the importance of matching quant_storage data type with the data types used everywhere else
link to code example from PEFT docs

github-actions · 2024-05-14T18:07:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Titus-von-Koeller · 2024-05-16T18:42:09Z

In the first !TIP box in the Training section, I would add a sentence (maybe in bold?) like "Note that FSDP is a distributed framework and therefore needs to be kicked off as a distributed training job, which is described in the following resources."

Otherwise it doesn't become clear for novices that these code examples alone (on this BNB doc page) don't suffice and FSDP is not run as standard single threaded Python that they can run in an interpreter or notebook, but needs to be kicked as a distributed way through Accelerate (or torchrun or alike if we weren't using HF libs).

Titus-von-Koeller · 2024-05-16T18:44:09Z

Hey @stevhliu, thanks a lot for taking the time to further polish this page, really appreciated! 🤗

I left a few suggestions. Please feel free to integrate my feedback at your discretion and then we'll merge.

Titus-von-Koeller · 2024-05-16T18:18:29Z

docs/source/fsdp_qlora.md

@@ -9,25 +9,35 @@ This guide provides a brief guide on how bitsandbytes supports storing quantized

 ## Quantized data storage

-FSDP only supports sharding float data types which can be problematic because quantized weights are typically stored as integer data types (uint8). bitsandbytes doesn't have this problem because it uses `StoreChar` to read and write quantized weights regardless of the data type storage. This makes it simple to add a `quant_storage` parameter to the [`~nn.Linear4bit`] and [`~nn.Params4bit`] classes and set it to `torch.uint8` to maintain backward compatibility with the codebase.
+FSDP only supports sharding float data types which can be problematic because quantized weights are typically stored as integer data types (uint8). bitsandbytes doesn't have this problem because it uses `StoreChar` to read and write quantized weights regardless of the data type storage. This makes it simple to add a `quant_storage` parameter to the [`~nn.Linear4bit`] and [`~nn.Params4bit`] classes and set it to `torch.uint8` to maintain backward compatibility with the codebase. With the `quant_storage` parameter, you can select any of the FSDP supported data types to shard [`~nn.Linear4bit`] with such as bfloat16, float16 or float32.


Suggested change

FSDP only supports sharding float data types which can be problematic because quantized weights are typically stored as integer data types (uint8). bitsandbytes doesn't have this problem because it uses `StoreChar` to read and write quantized weights regardless of the data type storage. This makes it simple to add a `quant_storage` parameter to the [`~nn.Linear4bit`] and [`~nn.Params4bit`] classes and set it to `torch.uint8` to maintain backward compatibility with the codebase. With the `quant_storage` parameter, you can select any of the FSDP supported data types to shard [`~nn.Linear4bit`] with such as bfloat16, float16 or float32.

FSDP only supports sharding float data types which can be problematic because quantized weights are typically stored as integer data types (uint8). bitsandbytes isn't limited by this convention because it uses `StoreChar` to read and write quantized weights regardless of the underlying data type used for storage of the quantized bytes. This made it possible to add a `quant_storage` parameter to the [`~nn.Linear4bit`] and [`~nn.Params4bit`] classes and set it to the default `torch.uint8` to maintain backward compatibility with previous iterations of the codebase. With the `quant_storage` parameter, you can select any of the FSDP supported data types to shard [`~nn.Linear4bit`] with such as bfloat16, float16 or float32.

Titus-von-Koeller · 2024-05-16T18:22:19Z

docs/source/fsdp_qlora.md

-FSDP only supports sharding float data types which can be problematic because quantized weights are typically stored as integer data types (uint8). bitsandbytes doesn't have this problem because it uses `StoreChar` to read and write quantized weights regardless of the data type storage. This makes it simple to add a `quant_storage` parameter to the [`~nn.Linear4bit`] and [`~nn.Params4bit`] classes and set it to `torch.uint8` to maintain backward compatibility with the codebase.
+FSDP only supports sharding float data types which can be problematic because quantized weights are typically stored as integer data types (uint8). bitsandbytes doesn't have this problem because it uses `StoreChar` to read and write quantized weights regardless of the data type storage. This makes it simple to add a `quant_storage` parameter to the [`~nn.Linear4bit`] and [`~nn.Params4bit`] classes and set it to `torch.uint8` to maintain backward compatibility with the codebase. With the `quant_storage` parameter, you can select any of the FSDP supported data types to shard [`~nn.Linear4bit`] with such as bfloat16, float16 or float32.
+
+You'll typically access and configure this option from [`transformers.BitsAndBytesConfig`] by setting the `bnb_4bit_quant_storage` parameter. It is very **important** the `quant_storage` data type matches the data types used throughout the model because FSDP can only wrap layers and modules that have the *same floating data type*. Making sure the data types are aligned will ensure the model is correctly sharded.


Suggested change

You'll typically access and configure this option from [`transformers.BitsAndBytesConfig`] by setting the `bnb_4bit_quant_storage` parameter. It is very **important** the `quant_storage` data type matches the data types used throughout the model because FSDP can only wrap layers and modules that have the *same floating data type*. Making sure the data types are aligned will ensure the model is correctly sharded.

You'll typically access and configure this option from [`transformers.BitsAndBytesConfig`] by setting the `bnb_4bit_quant_storage` parameter. It is very **important** that the `quant_storage` data type matches the data types used throughout the model because FSDP can only wrap layers and modules that have the *same floating point data type*. Making sure the data types are aligned will ensure the model is correctly sharded.

Titus-von-Koeller · 2024-05-16T18:27:24Z

docs/source/fsdp_qlora.md

+You'll typically access and configure this option from [`transformers.BitsAndBytesConfig`] by setting the `bnb_4bit_quant_storage` parameter. It is very **important** the `quant_storage` data type matches the data types used throughout the model because FSDP can only wrap layers and modules that have the *same floating data type*. Making sure the data types are aligned will ensure the model is correctly sharded.
+
+> [!TIP]
+> The `compute_dtype` is the data type used for computation inside the CUDA kernel, where the 4-bit quantized weights are unpacked from the data type in `quant_storage` and dequantized to `compute_dtype`. We recommend using torch.bfloat16 (if available on your hardware) for better numerical stability.


Suggested change

> The `compute_dtype` is the data type used for computation inside the CUDA kernel, where the 4-bit quantized weights are unpacked from the data type in `quant_storage` and dequantized to `compute_dtype`. We recommend using torch.bfloat16 (if available on your hardware) for better numerical stability.

> Another dtype that you'll have to be sure to set correctly when using bitsandbytes, but that might sound confusing in this context without further explanation is the `compute_dtype`: It's unrelated to the storage format used for the quantized bytes, but instead is the data type used for computation inside the CUDA kernel. In the kernel, the 4-bit quantized weights are unpacked from the data type in `quant_storage` and dequantized to `compute_dtype` for the matrix multiplication operation. We recommend using torch.bfloat16 (if available on your hardware) for better numerical stability. What you set `compute_dtype` has no effect on FSDP and sharding: So numerical stability, hardware support and computational efficiency are the only relevant considerations for setting it.

stevhliu added 2 commits May 13, 2024 12:37

clarify

13c70d3

clarify

2b7daed

feedback

d7a5a24

Titus-von-Koeller approved these changes May 19, 2024

View reviewed changes

Titus-von-Koeller merged commit 25abf8d into bitsandbytes-foundation:main May 19, 2024
2 checks passed

stevhliu deleted the fix branch May 20, 2024 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Clarify FSDP-QLoRA #1211

[docs] Clarify FSDP-QLoRA #1211

stevhliu commented May 14, 2024

github-actions bot commented May 14, 2024

Titus-von-Koeller commented May 16, 2024

Titus-von-Koeller commented May 16, 2024

Titus-von-Koeller May 16, 2024

Titus-von-Koeller May 16, 2024

Titus-von-Koeller May 16, 2024

[docs] Clarify FSDP-QLoRA #1211

[docs] Clarify FSDP-QLoRA #1211

Conversation

stevhliu commented May 14, 2024

github-actions bot commented May 14, 2024

Titus-von-Koeller commented May 16, 2024

Titus-von-Koeller commented May 16, 2024

Titus-von-Koeller May 16, 2024

Choose a reason for hiding this comment

Titus-von-Koeller May 16, 2024

Choose a reason for hiding this comment

Titus-von-Koeller May 16, 2024

Choose a reason for hiding this comment