Add fp32 support for QLoRA #595

rohan-varma · 2024-03-26T18:21:57Z

Context

QLoRA is currently coupled to bf16, but some older HW types don't support bf16, and we'd ideally like to enable at least one memory efficient finetuning solution for these HW arches. For example, T4s are 16GB and don't support bf16. This PR enables QLoRA to run compute + checkpoint in fp32 instead of bf16, to eliminate the hard coupling of bf16 to QLoRA training.

Changelog

Remove bf16 assumptions from nf4
Remove bf16 assumptions from LoRALinear and upstream
Generalize a few functions to be less coupled to bf16
Fix tests

NOTE

We've currently forked over the LinearNF4 from torchao while changes to LinearNF4 land in ao are in progress. Will revert back to using ao's implementation asap, but we need the changes in this forked version to decouple support from fp32. This is a temporary (~days) mitigation.

Test plan

Modified unittests - computation in fp32, checkpointing, parity is covered. fp32 coverage for qlora is at the same coverage as bf16.
Verified manually that gradients are computed in fp32 (via inspecting .grad field)
Checkpoints are saved in fp32:

Run recipe: tune lora_finetune_single_device --config llama2/7B_qlora_single_device dtype=fp32 epochs=1
Loss is comparable to QLoRA and LoRA-bf16 (see QLoRA #478 for those curves):

Memory - increase over QLoRA bf16 is +20% peak memory allocated, +3% reserved memory

Memory Stats::
 GPU peak memory allocation: 6.98 GB
 GPU peak memory reserved: 9.57 GB
 GPU peak memory active: 6.98 GB

Eval result:

ghstack-source-id: aa906a002fccbc9e80acfe3c4848febe23d5071f Pull Request resolved: #590

pytorch-bot · 2024-03-26T18:21:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/595

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8fdb0af with merge base 2940941 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers · 2024-03-29T23:41:27Z

torchtune/models/llama2/_component_builders.py

-            partial(reparametrize_as_bf16_state_dict_post_hook, offload_to_cpu=True)
+            partial(
+                reparametrize_as_dtype_state_dict_post_hook,
+                # TODO this is clowny, figure out a better way to get what precision the rest


Honestly I don't really see a better way to do this

ebsmothers · 2024-03-29T23:42:28Z

torchtune/modules/peft/lora.py

@@ -9,6 +9,8 @@
 import torch.nn.functional as F

 from torch import nn, Tensor
+
+# from torchtune.modules.low_precision.nf4_linear import _linear_nf4


ebsmothers

Looks great!

rohan-varma added 2 commits March 25, 2024 23:31

Test ghstack

a7da42f

ghstack-source-id: aa906a002fccbc9e80acfe3c4848febe23d5071f Pull Request resolved: #590

upd

abdebc1

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 26, 2024

rohan-varma marked this pull request as draft March 26, 2024 18:22

rohan-varma added 3 commits March 26, 2024 11:59

Merge branch 'main' of github.com:pytorch/torchtune

61562e0

Merge branch 'main' into nf32

b7229e3

Upd

65c8932

rohan-varma marked this pull request as ready for review March 26, 2024 20:59

rohan-varma changed the title ~~Nf32~~ Add fp32 support for QLoRA Mar 26, 2024

rohan-varma requested review from ebsmothers and kartikayk March 26, 2024 20:59

rohan-varma added 6 commits March 26, 2024 14:01

Upd

8e442d4

Upd

4134893

upd

695d7a5

upd

ca7fd59

foo

41af991

foo

74c05b2

rohan-varma mentioned this pull request Mar 27, 2024

QLora recipe failing on AMD MI250x #600

Closed

rohan-varma added 3 commits March 28, 2024 11:40

Add test

183169d

Merge branch 'main' of github.com:pytorch/torchtune into nf32

35ebfdf

Upd

428c03d

ebsmothers reviewed Mar 29, 2024

View reviewed changes

ebsmothers approved these changes Mar 29, 2024

View reviewed changes

rohan-varma added 2 commits April 1, 2024 14:51

Merge branch 'main' of github.com:pytorch/torchtune into nf32

accd6c4

Upd

8fdb0af

rohan-varma merged commit 2ac4258 into main Apr 2, 2024
20 checks passed

tcapelle pushed a commit to tcapelle/torchtune that referenced this pull request Apr 5, 2024

Add fp32 support for QLoRA (pytorch#595)

4c6460f

rohan-varma mentioned this pull request Apr 9, 2024

Out of CUDA on 15 GB colab . Just trying to train Mistral 7. v1 #665

Closed

joecummings deleted the nf32 branch April 11, 2024 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fp32 support for QLoRA #595

Add fp32 support for QLoRA #595

rohan-varma commented Mar 26, 2024 •

edited

Loading

pytorch-bot bot commented Mar 26, 2024 •

edited

Loading

ebsmothers Mar 29, 2024

ebsmothers Mar 29, 2024

ebsmothers left a comment

Add fp32 support for QLoRA #595

Add fp32 support for QLoRA #595

Conversation

rohan-varma commented Mar 26, 2024 • edited Loading

Context

Changelog

NOTE

Test plan

pytorch-bot bot commented Mar 26, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/595

✅ No Failures

ebsmothers Mar 29, 2024

Choose a reason for hiding this comment

ebsmothers Mar 29, 2024

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

rohan-varma commented Mar 26, 2024 •

edited

Loading

pytorch-bot bot commented Mar 26, 2024 •

edited

Loading