Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated LoKr implementation with lycoris support #2133

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

yaswanth19
Copy link
Contributor

@yaswanth19 yaswanth19 commented Oct 6, 2024

For context refer this #1935

@BenjaminBossan Here is an early draft PR. Is this how you have envisioned when you meant by a separate implementation of Lokr.

@yaswanth19 yaswanth19 changed the title LOKR latest implementation with lycoris support Updated LoKr implementation with lycoris support Oct 6, 2024
@BenjaminBossan
Copy link
Member

BenjaminBossan commented Oct 7, 2024

Thanks a lot for this PR. I haven't done an in-depth review yet, but from a first skim, this looks good already. A few points:

  1. Let's ensure that we don't add any new dependencies. The dependency should be optional but it can be added as a dev dependency.
  2. About the name, I wonder if we should go with LycorisLoKr rather than LoKrv2. It's more typing, true, but "v2" is just not very informative. If we stick with "v2", I'd go with LoKrV2Config rather than LoKrConfigv2, etc. I'm also open to other suggestions. WDYT?
  3. Did you already have time to test how the two implementations compare to one another?
  4. Please fix the merge conflict.

@yaswanth19
Copy link
Contributor Author

About the name, I wonder if we should go with LycorisLoKr rather than LoKrv2. It's more typing, true, but "v2" is just not very informative. If we stick with "v2", I'd go with LoKrV2Config rather than LoKrConfigv2, etc. I'm also open to other suggestions. WDYT?

@BenjaminBossan I feel we should go with LoKrV2 as this is the latest version and we providing support to use lycoris. If we rename it LycorisLoKr it kind of means we are implementing a API for lycoris even though it is a major functionality.

Did you already have time to test how the two implementations compare to one another?

I am confused on how to test it as the problem is lycoris version has too many flag for weights initialization, and I tried but couldn't find the combination for similar initialization of weight for both PEFT LoKr and lycoris Lokr. So tried training it on MNIST with a simple MLP but then the weights are differing as we are getting difference peformance .

Since I haven't implemented all the features in Lokrv2 so right now I haven't performed the sanity check of comparing results between upstream and LoKrV2.

@BenjaminBossan
Copy link
Member

I feel we should go with LoKrV2 as this is the latest version and we providing support to use lycoris. If we rename it LycorisLoKr it kind of means we are implementing a API for lycoris even though it is a major functionality.

Well, we can decide later on the naming, but I think LycorisLokr would still be a more descriptive name.

I am confused on how to test it as the problem is lycoris version has too many flag for weights initialization, and I tried but couldn't find the combination for similar initialization of weight for both PEFT LoKr and lycoris Lokr. So tried training it on MNIST with a simple MLP but then the weights are differing as we are getting difference peformance .

Okay, probably we won't manage to get 100% of the same results. It would probably still be worth it to have a script that compares the performance between the two and checks if they roughly match. It could be one of the existing examples in PEFT or the example you came up with.

@yaswanth19
Copy link
Contributor Author

@BenjaminBossan Please do a first review of this PR. I have trained a simple MLP on MNIST dataset under same config and attached the loss curve plots below to compare different implementations.
image
image

Few things from my side:

  • I haven't implemented the weight decompose and can be done in a separate PR if majority of community want it.
  • Things done in this PR from the list (Issue 1935) - fixed rank dropout implementation ,full matrix tuning and fixed maths (not multiplying against the vector, but only the scalar) is WiP.
  • Training loss curve is a bit concerning and I need to investigate it.
  • Added LycorisLokr to existing testcases and will be writing some new test cases too.

@BenjaminBossan
Copy link
Member

I have trained a simple MLP on MNIST dataset under same config

  • Training loss curve is a bit concerning and I need to investigate it.

Let's try to figure this out first. Did you make some progress on investigating this? If you could share your script, I can also take a look.

@yaswanth19
Copy link
Contributor Author

yaswanth19 commented Oct 21, 2024

Let's try to figure this out first. Did you make some progress on investigating this? If you could share your script, I can also take a look.

This is the notebook I am using for testing: https://colab.research.google.com/drive/12T9CZvSAPcVPi5usXtkbY5G1pvNgDjH7?usp=sharing

EDIT: @BenjaminBossan The loss curves are looking much better now. Please have a look in the notebook.

Query: In the PEFT LoKr implementation we have init_weights which is used to initialize weights of lokr w1 matrix with zeros. In LYCORIS implementation we are initializing all w1 matrices randomly. The reason me asking this is that when I make the code changes for use_scalar parameter, I am getting the delta as 0 because according to PEFT implementation in this case both w1 and w2 would be 0 matrices.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the progress on this PR. I added some comments but haven't run an in-depth review yet.

fixed rank dropout implementation ,full matrix tuning and fixed maths (not multiplying against the vector, but only the scalar) is WiP.

Could you extend a little bit, just so I'm clear what is changed where? Thanks.

Regarding replication of the results for the different implementations: I could make good progress based on your notebook. Some of the issues where simply caused by not always correctly setting the seed right before initializing the model.

After doing some fixes, I could get PEFT v1 and v2 to return the identical results. There is still a difference to LyCORIS but I think I've found the cause. In PEFT, at the start, we initialize w1 to zeros and w2_a and w2_b randomly. LyCORIS, however, initializes w2_b to zeros and w1 and w2_a randomly. Let's change PEFT v2 to use the same approach and check if this will result in the same outputs.

image

I've uploaded the notebook here.

import torch
import torch.nn as nn
import torch.nn.functional as F
from lycoris.functional import lokr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not assume that users have lycoris installed. Since lokr is not used very often, let's just import it locally where needed, but let's avoid importing it at the module level.

Also, something I wonder: Right now, only lokr.weight_gen is being used. Can we also use lokr.make_kron and lokr.make_kron? Regarding bypass_forward_diff, that could be difficult, but I'm not quite sure how it's used in LyCORIS so I'm not sure.

src/peft/tuners/lokrv2/__init__.py Show resolved Hide resolved
@yaswanth19
Copy link
Contributor Author

Could you extend a little bit, just so I'm clear what is changed where?

@BenjaminBossan
Copy link
Member

Thanks for explaining. So this is really in the PEFT code and not so much resolved because we're using lycoris, right? So we could make the same fixes to the existing LoKr implementation too (of course in a separate PR).

Did you check my notebook about reproducibility? I think we should adjust the initialization of parameters to be in line with lycoris and then hopefully see the same results.

@yaswanth19
Copy link
Contributor Author

yaswanth19 commented Oct 23, 2024

So this is really in the PEFT code and not so much resolved because we're using lycoris, right? So we could make the same fixes to the existing LoKr implementation too (of course in a separate PR).

Yes @BenjaminBossan , This could be done. But now I am a bit hesitant on a separate implementation; We mainly have three functional APIs weight_gen which would be used to initialize weights which can be optimized in lycoris in future , make_kron and diff_weight which will not be changed mostly as they are doing fundamental things. I couldn't understand the bypass_forwad_diff method accurately but I think it is mostly calculating w*delta.

Most of the optimizations/changes are done at module level like weight_decompose, rs_lora, unbalanced_factorization. Even though some of these are minor ones we can't utilize them using the current approach of funcitonal API. WDYT?

In this case wouldn't it be better to rewrite the original LoKr and periodically(year or two) update the component. For us to effectively use use_upstream feature; author needs to have full feature parity between Functional and Module which is not the case currently.

Did you check my notebook about reproducibility? I think we should adjust the initialization of parameters to be in line with lycoris and then hopefully see the same results.

Yes, I have made the required changes and now we have similar initilalization to lycoris but not same random weights 😿 and I am having hard time reproducing same weights both locally and on colab even after setting seed extensicely.

Colab link: https://colab.research.google.com/drive/1YxlaT9G_jotkoTrEMQIQwNX87lehcMHu?usp=sharing

@BenjaminBossan
Copy link
Member

Thanks a lot @yaswanth19 for sharing your thoughts on this. When reviewing the PR, I was a bit astonished how little of the lycoris package we're actually using.

As to the question of whether it's worth it to have this separate LoKr implementation or if we should rather fix the existing one: Depending on lycoris has the advantage that each time there is a fix or improvement there, we automatically benefit from it, which is nice. However, checking the lycoris/functional/lokr.py file, there was no change in the past 4 months (besides the docs), so I'm not sure how it plays out in practice. And as mentioned, we're not using a lot of it in the first place.

The disadvantage in my eyes is that we create more confusion for users, have a higher complexity in our code base, and to avoid breaking backwards compatibility, we would need to maintain the existing LoKr implementation anyway.

If your work on this PR has helped you identify the issues with the existing LoKr implementation, I would be super happy if you could work on fixing those. If you do that, there is even less necessity for the alternative implementation. But it could be that I'm missing some arguments, so if @bghira or @sayakpaul have different opinions, please let me know.

In case we decide to go with fixing the existing implementation instead of adding the new one, I would propose that we elaborate the test notebook, e.g. to be a standalone script. Then we can use it to compare the results from the lycoris implementation to PEFT and ensure that they're close. This could be run daily or weekly on CI. Of course, we would need to extend the script to cover a few more cases, like conv layers and some of the edge cases discussed in the initial issue.

Yes, I have made the required changes and now we have similar initilalization to lycoris but not same random weights 😿 and I am having hard time reproducing same weights both locally and on colab even after setting seed extensicely.

Don't worry too much about that. It would be exceedingly hard to get the exact same outcome using PEFT and lycoris, because we would need to ensure that each call involving randomness is executed exactly the same and in the same order.

@yaswanth19
Copy link
Contributor Author

@bghira @sayakpaul A gentle ping for your thoughts on the above discussion.

@sayakpaul
Copy link
Member

If your work on this PR has helped you identify the issues with the existing LoKr implementation, I would be super happy if you could work on fixing those. If you do that, there is even less necessity for the alternative implementation.

I am not opposed to the idea, but I guess it depends on the amount of work and the effectiveness of the implementation. From what I understand we're already having to touch many lines of code in the existing implementation, so might as well just fix them at this level instead of a separate implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants