[WIP] Addition of Dora #936

Prakyathkantharaju · 2024-05-05T06:45:54Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?
I have added DORA for the loralinear module. Request was proposed in the issue: #893

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
- include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

pytorch-bot · 2024-05-05T06:45:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/936

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers

Thanks for opening the PR! I am still not sure about the correctness of the implementation though. Can you run forward on the same input tensor and confirm you get the same results with a known correct implementation (e.g. the one from PEFT referenced in your L138 of lora.py).

There are some other forward-looking considerations as well: specifically how we expose DoRA in our higher-level model builders (could potentially be similar to what we do for QLoRA), how we will merge weights when DoRA is applied, determining to what extent we want to support enabling and disabling DoRA adapters (this functionality is used in e.g. our DPO recipe). But for now the main thing is to make sure the linear component itself is correct and well-tested.

ebsmothers · 2024-05-06T00:09:53Z

torchtune/modules/peft/lora.py

+                + (self.alpha / self.rank)
+                * (self.lora_a.weight.T @ self.lora_b.weight.T).T,
+                dim=1,
+            ).to(self.weight.dtype)


What about the detach used in the PEFT implementation referenced in your comment?

ebsmothers · 2024-05-06T00:10:56Z

torchtune/modules/peft/lora.py

+                dim=1,
+            ).to(self.weight.dtype)
+            mag_norm_scale = (self.m / weight_norm - 1).view(1, -1)
+            return mag_norm_scale * out + mag_norm_scale * lora_out


Sorry maybe I am still missing the point but I don't think this actually matches the version that was ultimately added to PEFT. Ref

ebsmothers · 2024-05-06T00:18:03Z

torchtune/modules/peft/lora.py

+            weight_norm = torch.linalg.norm(
+                self.weight
+                + (self.alpha / self.rank)
+                * (self.lora_a.weight.T @ self.lora_b.weight.T).T,


Do we need all the transposes here? Isn't this just self.lora_b.weight @ self.lora_a.weight?

ebsmothers · 2024-05-06T00:18:26Z

torchtune/modules/peft/lora.py

@@ -67,6 +72,7 @@ def __init__(
        self.dropout = nn.Dropout(p=dropout)
        self.lora_a = nn.Linear(in_features=in_dim, out_features=rank, bias=False)
        self.lora_b = nn.Linear(in_features=rank, out_features=out_dim, bias=False)
+        self.m = nn.Parameter(torch.ones(1, out_dim)) if self.use_dora else None


This may necessitate some extra logic for checkpoint loading

rohan-varma

In addition to @ebsmothers comments, let's have comprehensive testing to ensure we have implementation parity compared to a well-known implementation, such as the one offered by HF PEFT. This will ensure we have the appropriate correctness guarantees in place.

Thanks so much for working on this!

rohan-varma · 2024-05-06T08:02:53Z

tests/torchtune/modules/peft/test_lora.py

@@ -97,6 +110,12 @@ def test_forward(self, inputs, lora_linear, out_dim) -> None:
        assert actual.shape == (BSZ, SEQ_LEN, out_dim)
        torch.testing.assert_close(actual.mean(), expected, atol=1e-4, rtol=1e-6)

+    def test_dora_forward(self, inputs, dora_linear, out_dim) -> None:
+        expected = torch.tensor(EXPECTED_VAL)


Does this mean the expected val of DoRA is the same as LoRA? Why is this, and intuitively, I'm not sure if I understand if the result is the exact same, how DoRA results in different training than LoRA? Pretty sure I'm missing something basic here but would be good to clarify.

Here is the explanation:
$$DORA = W*x + (\frac{m}{weight norm} -1)*W *x + \frac{m}{weight norm} * {lora}_b({lora}_a(x)) + {scaling}$$

The m vector is initialized as self.m == weight norm. So the ratio $\frac{m}{weight norm}$ is 1 for the first iteration.
so LORA == DORA for the first pass.

Better explanation from the author: huggingface/peft#1474 (comment)

torchtune/modules/peft/lora.py

rohan-varma · 2024-05-06T08:03:50Z

torchtune/modules/peft/lora.py

@@ -32,6 +33,8 @@ class LoRALinear(nn.Module, AdapterModule):
        rank (int): rank of the low-rank approximation
        alpha (float): scaling factor for the low-rank approximation
        dropout (float): dropout probability. Default: 0.0
+        use_dora (bool): whether to use DORA (weight-Decomposed Low-Rank Adaptation).
+            Default: False


how do we want to expose DoRA? I see following the pattern of how we enabled QLoRA and just passing in a use_dora flag. Curious about the tradeoffs compared to a DoraLinear layer though. @ebsmothers any thoughts?

Yeah this is a good question. I do like the analogy with QLoRA and using a flag as in this PR matches what's done in PEFT too. But one place the analogy with QLoRA breaks down is that DoRA is actually introducing a new parameter, which QLoRA does not do. This can potentially make stuff like checkpointing a bit trickier. So we may need to think about this a bit more.

Prakyathkantharaju · 2024-05-07T04:38:35Z

torchtune/modules/peft/lora.py

@@ -111,6 +118,19 @@ def adapter_params(self) -> List[str]:
        adapter_params = ["lora_a.weight", "lora_b.weight"]
        return adapter_params

+    def dora_init(self) -> None:


Hello @rohan-varma and @ebsmothers, I am wondering if there is a more efficient approach to the problem I am facing. I have followed the PEFT (Parameter Efficient Fine-Tuning) method but I am encountering an issue with the initialization of the self.m (magnitude vector) in torch tune because I require the weights. Currently, I am using the flag (self.dora_initialized) to address this issue, but I believe this may not be the best solution. I would appreciate any suggestions you may have.

Prakyathkantharaju · 2024-05-07T04:44:57Z

Thanks for opening the PR! I am still not sure about the correctness of the implementation though. Can you run forward on the same input tensor and confirm you get the same results with a known correct implementation (e.g. the one from PEFT referenced in your L138 of lora.py).

There are some other forward-looking considerations as well: specifically how we expose DoRA in our higher-level model builders (could potentially be similar to what we do for QLoRA), how we will merge weights when DoRA is applied, determining to what extent we want to support enabling and disabling DoRA adapters (this functionality is used in e.g. our DPO recipe). But for now the main thing is to make sure the linear component itself is correct and well-tested.

Hello @ebsmothers , There were some bugs in my initial commit, I have fixed those bugs ( detach missing, too many transposes and -1 in the mag_norm_scale calculation). I also added a dora_init function, which will initialize the self.m vector, similar to PEFT, However, I think this method might not be the best approach. If you have a better idea for initialization, please let me know.

ebsmothers · 2024-05-22T23:49:04Z

Hi @Prakyathkantharaju sorry for the delay in responding here. I think beyond just exposing the DoRA logic in LoRALinear we probably want to think about the overall design, interaction with other parts of the library, and thorough testing (basically some of the points in my comment here and @rohan-varma's comment here). Since it's a fair amount of effort to do all of this, I am gonna tag in @calvinpelletier to help out here. Let me know if this works for you, we'd love to have your collaboration on design and code reviews.

Prakyathkantharaju · 2024-05-25T02:59:09Z

Hello and thank you for your response. I apologize for not updating you on this issue for a while. I am currently working on comparing the performance of the Dora implementation with the PEFT. This was an aspect that was missing from the request raised by @ebsmothers and @rohan-varma. As you suggested, I welcome input from @calvinpelletier and anyone else who is willing to help move this forward. Let me know if you have any other requests.

Here are the scripts I am using to generate the PEFT model: https://gist.github.com/Prakyathkantharaju/53777b5997b9fc14ba6f40c9b5788b6a

Here is the comparison between the PEFT Dora and the Torchtune Dora loss: https://api.wandb.ai/links/continuous-optimization/991283uj.

ebsmothers · 2024-05-28T13:46:24Z

recipes/configs/llama3/8B_Dora_single_device.yml

+  _component_: torchtune.models.llama3.lora_llama3_8b
+  lora_attn_modules: ['q_proj', 'v_proj', 'k_proj']
+  apply_lora_to_mlp: True
+  apply_lora_to_output: False
+  lora_rank: 8
+  lora_alpha: 16


This isn't using DoRA?

I have updated the configs and recipe. Now, there is no specific Dora recipe, and everything is done using the Lora recipe, where I check if use_dora is defined and perform a specific class.

ebsmothers · 2024-05-28T13:46:28Z

recipes/dora_finetune_single_device.py

+log = utils.get_logger("DEBUG")
+
+
+class LoRAFinetuneRecipeSingleDevice(FTRecipeInterface):


Why do we need an entirely new recipe for DoRA? In my mind this is analogous to QLoRA: an extension of LoRA that shouldn't require fundamental changes to the training loop. Am I missing something here?

Update. Now, everything is done through Lora recipe now.

ebsmothers · 2024-05-28T14:03:27Z

torchtune/modules/peft/lora.py

+    Initialize DORA m to ones.
+    """
+    nn.init.zeros_(x)


This isn't accurate?

Updated with ones. These weights are updated later in the init_dora code. I wanted to keep the initialization of the new weights very similar to Lora, so I followed this format.

ebsmothers · 2024-05-28T14:04:10Z

torchtune/modules/peft/lora.py

+    def _dora_weight_norm(self) -> Tensor:
+        if self._quantize_base:
+            # Convert NF4Tensor to regular Tensor for computation TODO(prakyath): Fix this.
+            weight = to_regular_tensor(self.weight)


This isn't defined?

Updated, now the weight norm is calculated by dequantizing the weights. Please let me know if you have a better/faster way to do this ?

Prakyathkantharaju · 2024-06-02T18:59:01Z

Hello everyone,

I apologize for the delayed response, and I appreciate your review of my changes.

I have addressed the comments made by @ebsmothers and updated the structure of how Dora is initialized. Here are the details on how Dora is initialized (I have kept the initialization as similar to QLora as possible):

I added a partial Dora class, similar to QLora, where Dora is initialized by the use_dora option. You can find the link to the change here.
I updated the Lora recipe with Dora initialization. You can find the link to the changes here.
I updated the llama-3 lora initialization with the Dora-specific option. The links to the changed files are here.

Additionally, I would like feedback from @rohan-varma @kostmo and @ebsmothers. I also added a clamp function to avoid making the denominator 0 (link to the code here). This is not done in the peft or other Dora implementations. If you feel this is not necessary, then I can remove it.

Please let me know if you need any changes. I am willing to work on them. Moreover, if you have any additional feedback, I am happy to incorporate it as well.

calvinpelletier

Hi @Prakyathkantharaju, thanks for contributing! I left some comments.

One thing that you are currently missing is updating the merging logic at torchtune/module/peft/peft_utils.py::get_merged_lora_ckpt.

I'll work on setting up a direct comparison to the huggingface implementation to verify that the loss graph, memory usage, and training speed are similar.

torchtune/_recipe_registry.py

calvinpelletier · 2024-06-05T23:59:54Z

torchtune/models/llama3/_model_builders.py

+
+dora_llama3_8b.__doc__ = """
+Builder for creating a Llama3 model with DORA enabled. Base model weights in linear layers
+that DORA is applied to are quantized per the Dora paper: https://arxiv.org/abs/2402.09353.


nit: proper capitalization of DoRA in comments and docstrings

I have fixed these; I will check again before the merge to ensure that all the instances of dora/Dora are changed to DoRA.

recipes/configs/llama3/8B_dora_single_device.yaml

calvinpelletier · 2024-06-06T04:04:18Z

recipes/configs/llama3/8B_dora_single_device.yaml

+# Tokenizer
+tokenizer:
+  _component_: torchtune.models.llama3.llama3_tokenizer
+  path: /teamspace/studios/this_studio/models/Meta-Llama-3-8b-Instruct/original/tokenizer.model


no rush on this, but before merging make sure to change the paths to /tmp/, the metric logger to DiskLogger, etc.

I have updated the logging to disk logger; I will update this path to /tmp/ before the merge.

calvinpelletier · 2024-06-06T04:57:46Z

torchtune/modules/peft/lora.py

+    @property
+    def _dora_weight_norm(self) -> Tensor:


this should be a regular function instead of a property IMO

I update this to be a function.

torchtune/modules/peft/peft_utils.py

calvinpelletier · 2024-06-06T05:30:57Z

torchtune/modules/peft/lora.py

+        norm = torch.linalg.norm(result, dim=1)
+
+        # Clamp the norm to avoid division by zero
+        # TODO(Prakyath): Check with torchtune team whether this should be a parameter ?


what is this question referring to?

The comment was referring to the clamp line here. I have made the minimum of the clamp 1e-6. I was thinking this could be a parameter configured in the yaml file. FYI: I added this clamp to ensure that the denominator of this line is not 0. This is not present in the peft.

Prakyathkantharaju · 2024-06-11T03:34:36Z

@calvinpelletier Thank you for reviewing. You are right, I have to implement torchtune/module/peft/peft_utils.py::get_merged_lora_ckpt However, for dora merging, since it involves a weight normalization (link) I need the linear layer weight in addition to the lora_a and lora_b weights. For this, I need to change the recipe, and I need to write a new merge function. I want to clarify this before I start editing the recipe since, in the previous conversation, there was hesitation in updating the recipe. If you have a different approach in mind, I am open to that as well.
Regards,
PK

ebsmothers · 2024-06-25T01:18:15Z

Hi @Prakyathkantharaju apologies for the delay here. There were a fair number of changes to get this enabled across the repo, so we had @calvinpelletier take a stab at it. You can take a look at the changes in #1115, we would love to get any feedback you have on it. Thanks for bearing with us here!

RdoubleA · 2024-08-21T17:01:39Z

covered by #1115

Prakyathkantharaju added 4 commits May 1, 2024 02:58

update lora.py with dora parameters

5374cd8

updated dora parameter initialization

aefb8cb

updated ones initialization and test

dffb2a3

updated dora update based on author recommendataion

cbafe85

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 5, 2024

Prakyathkantharaju mentioned this pull request May 5, 2024

[FR] (Q)DoRA #893

Closed

ebsmothers reviewed May 6, 2024

View reviewed changes

rohan-varma reviewed May 6, 2024

View reviewed changes

fixed bugs and magnitude intialization

f76e076

Prakyathkantharaju commented May 7, 2024

View reviewed changes

Prakyathkantharaju added 2 commits May 27, 2024 20:41

working dora training with - default true

fe08a06

updated config with seed and the print for testing and comparision

4a58dba

ebsmothers requested changes May 28, 2024

View reviewed changes

Prakyathkantharaju added 5 commits May 31, 2024 22:53

update dora initialization and llama3 loading

f096e4b

removed dora specific recipe and merge recipes

a55a962

fixed model loading bugs and tested training

9110f0e

update the lora finetune recipe for lora init

733aff4

updated llama3 docstring with use_dora

0ad3b84

Prakyathkantharaju requested review from ebsmothers and rohan-varma June 2, 2024 18:59

calvinpelletier reviewed Jun 6, 2024

View reviewed changes

changed property to function and fixed typos in the docs

7b4b8a4

Prakyathkantharaju requested a review from calvinpelletier June 11, 2024 03:35

RdoubleA closed this Aug 21, 2024

RdoubleA mentioned this pull request Aug 21, 2024

DoRA #1115

Merged

10 tasks

		log = utils.get_logger("DEBUG")


		class LoRAFinetuneRecipeSingleDevice(FTRecipeInterface):

[WIP] Addition of Dora #936

[WIP] Addition of Dora #936

Conversation

Prakyathkantharaju commented May 5, 2024

Context

Changelog

Test plan

pytorch-bot bot commented May 5, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/936

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohan-varma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Prakyathkantharaju commented May 7, 2024

ebsmothers commented May 22, 2024

Prakyathkantharaju commented May 25, 2024 • edited Loading

Choose a reason for hiding this comment

Prakyathkantharaju Jun 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Prakyathkantharaju commented Jun 2, 2024

calvinpelletier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Prakyathkantharaju commented Jun 11, 2024

ebsmothers commented Jun 25, 2024

RdoubleA commented Aug 21, 2024

pytorch-bot bot commented May 5, 2024 •

edited

Loading

Prakyathkantharaju commented May 25, 2024 •

edited

Loading

Prakyathkantharaju Jun 2, 2024 •

edited

Loading