-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DoRA #1474
Implement DoRA #1474
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@BenjaminBossan Hi Benjamin, may I have your email, I am the author of DoRA, and would like to help with integrating DoRA into PEFT. |
Thanks @nbasyl. You can contact me at |
Might make sense to refactor this in the future to avoid a signficant amount of code duplication.
@nbasyl Did you reach out yet? I didn't get any mail so far. |
@BenjaminBossan Sorry for the late reply, was consulting with my manager regrading this, I just sent out the email, please check! |
Great this is being added, is the PR at a point where I can test it out? Seems like I just add use_dora=True in the Lora config? (btw, does LoRA alpha become redundant then if use_dora is True)? |
@RonanKMcGovern There are still a few smaller kinks to work out, so don't test unless you're ready to test again once the next commits roll in. LoRA alpha should still be relevant when DoRA is being used. |
Just gave this a quick spin and seems to do better on ppl than LoRA with the same r. Thanks for the nice work. |
Thanks, sounds fantastic. If you can share anything like scripts, that would be great. |
Hi @nbasyl do you think the concept of differential LR by Lora+ (https://arxiv.org/abs/2402.12354) can be integrated into DoRa? Both paper came out almost exact same time and they dont seem mutually exclusive. |
@aliencaocao I skimmed the paper and code and I think there is no technical limitation to combining the two. I don't know if the gains from the two approaches would be additive though. |
very nice, is there a pr for this yet for HF? |
I'm using the huggingface trainer and this LoRA config for Qwen models:
|
The code is published, there is a function here that you can simply call on your LoRA model and optimizer class: So far, the PEFT repo does not contain any code related to training directly, as we want to keep it agnostic with regard to the training method. If there is a big interest though, we could think about adding this function as a helper function for convenience. But that's a different topic than this PR ;) |
Hi, one way to further reduce the GPU cost (# of trainable parameters) as well as the training time is to only finetune the magnitude component of certain modules while finetuning both the magnitude and directional component for the remaining modules. You can refer to Sec.5.6 of the paper for more details. This feature can be added in another PR to give user more flexibility. |
so is double the cost to be expected here because there is double the trainable params? |
@BenjaminBossan, I have already implemented this feature. Once our code is released (should be released soon), you can take a look at it and see if you can integrate this functionality with another pull request (PR). Or we can discuss via email to start working on this new PR earlier. |
Thanks for the feedback. On my puny 2060 with bloomz-560m and fp16, I got 15-20% slowdown during training with DoRA enabled, same rank. So there seem to be pretty big differences based on the exact model or settings. @nbasyl do you have some numbers that you could share?
It would make the whole thing more complex. Are the gains big enough to be worth it? I'll definitely take a closer look and check what else we can do to improve runtime performance while keeping the code flexible and maintainable. |
BTW, this merge is causing some breaking changes on packages using the code of peft, such as unsloth: unslothai/unsloth#201
|
I have tried finetuning LLaMA-7B/13B on 3090/4090/V100/A100 and only also got around 20% slow down. I am assuming that such drastic slowdown for @aliencaocao is probably caused by the optimization problem of deepspeed which is used by the LLaVA code base. |
The gain here is not the accuracy improvement, but in the reduction of trainable parameters. You can refer to this table. |
According to my experiment, DoRA is ~2x slower than LoRA, and I am able to achieve lower loss with DoRA. But it is a little bit not worth it compared to training with QLoRA on a larger model because that gives a much lower loss, the GPU memory usage is slighter larger and it is only 20%-30% slower. |
@nbasyl Okay, this suggestion sounds good for the purpose of saving on the number of trainable parameters (even if runtime would probably not change much). I'm not sure yet how to configure this so that it would work with most or all model architectures, we'd probably have to add a new config argument for that. If your code is to be released soon, we can wait for that and add it to PEFT afterwards. When it comes to VeRA, would DVoRA help with reducing the overhead of calculating the weight norm (since W0 and BA are fixed) or is it the same cost as DoRA? @peterjc123 Thanks for providing your settings. I tried to replicate (using Qwen1.5-0.5B for memory reasons) and got ~30% (2060) to ~40% (T4) slower training with DoRA activated.
I'll investigate if we can make DoRA work with bnb. It could be a bit tricky when it comes to calculating the weight norm, let's see. |
I created a PR to support DoRA with bnb (QDoRA): #1518 I tested it on a small use case and it worked. If someone wants to give it a spin and report the results, that would be fantastic. |
when launching llava training (using ZeRO2with a zero 2 config, I get a deadlock:
ZeRO3with a zero 3 config, I mysteriously do not (??) get a deadlock:
|
With bf16=True, your implementation OOM when batch = 2 and the configuration I posted above. It also throws the following warning.
And then, I debugged your code and found out that the dtype of -> weight_norm = self._get_weight_norm(weight, lora_weight, scaling)
(Pdb) s
--Call--
> peft/src/peft/tuners/lora/layer.py(172)_get_weight_norm()
-> def _get_weight_norm(self, weight, lora_weight, scaling) -> torch.Tensor:
(Pdb) n
> peft/src/peft/tuners/lora/layer.py(174)_get_weight_norm()
-> weight = weight + scaling * lora_weight
(Pdb) p weight.dtype
torch.bfloat16
(Pdb) p scaling
0.25
(Pdb) p lora_weight.dtype
torch.bfloat16
(Pdb) n
> peft/src/peft/tuners/lora/layer.py(175)_get_weight_norm()
-> weight_norm = torch.linalg.norm(weight, dim=1)
(Pdb) p weight.dtype
torch.bfloat16
(Pdb) n
> peft/tuners/lora/layer.py(176)_get_weight_norm()
-> return weight_norm
(Pdb) p weight_norm.dtype
torch.float32 As you can see, weight_norm = torch.linalg.norm(weight, dim=1).to(weight.dtype) After fixing this, it is still super slow, yields about |
Very strange, typically PEFT has issues with ZeRO3, not ZeRO2. I don't know enough about DS to tell what could be the cause of the deadlock here.
Thanks for investigating, I pushed this change to the PR.
That's unfortunate, but not unexpected. The issue is that we need to have an additional dequantization step for DoRA, as we have to calculate the weight norm of the quantized weight + LoRA. I couldn't come up with a way to avoid this or somehow cache the results. If you or someone else has an idea, please let me know. |
Add DoRA (Weight-Decomposed Low-Rank Adaptation). https://arxiv.org/abs/2402.09353 To use this with LoRA, add use_dora=True to the LoraConfig. Currently only supports nn.Linear layers, not other types or quantized linear layers like bnb.
lora_weight = lora_B.weight @ lora_A.weight | ||
magnitude = self.lora_magnitude_vector[active_adapter] | ||
weight = self.get_base_layer().weight | ||
weight_norm = self._get_weight_norm(weight, lora_weight, scaling) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be the following? There are a few other places that we need to do the same, otherwise we get mismatching dimensions.
weight_norm = self._get_weight_norm(transpose(weight, self.fan_in_fan_out), lora_weight, scaling)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. Indeed, models that use Conv1D like GPT2 wouldn't work right now. I created a PR to fix this: #1588.
https://arxiv.org/abs/2402.09353
State:
Linear
layer supported, not conv or embedding