-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is the adapter layer merged in gpt-neox-20b_peft? #250
Comments
I think the main reason is to save vRAM, and also an adapter lora model maybe is enough for a single task |
Hi @ohashi56225 |
Thank you for your response! I understand the reason for merging the LoRA layers more clearly now, but I have two new questions that have arisen:
I apologize if my questions are out of line since I am not familiar with the PEFT and LoRA implementations of Hugging Face. |
Hi @ohashi56225
In fact when we "merge" the LoRA layers directly to the base model, it means that we are transferring the knowledge that has been learned by the LoRA layers directly on the base model. Precisely, this also means that once you have merged the LoRA layers, you can push these new "merged" weights on the Hub (or locally) and directly load it and benefit from the SFT fine-tuned performance. An example can be found here: https://huggingface.co/edbeeching/gpt-neo-125M-imdb-lora-adapter-merged , as you can see it, there no more adapters anymore in this repository.
It is indeed possible, but the model will convergence quite slowly as the model needs to : Let me know if anything else is unclead! |
Thank you for your answer! However, I realized that my question may have been unclear, leading to an answer that didn't quite address my intended meaning. First of all, is my understanding correct that the current implementation follows the steps shown in the image below? If so, my first question means that if the addition of new LoRA layers in step 3 is affecting the review generation ability of the merged model. Specifically, since the parameters of these LoRA layers are initialized randomly (which I believe corresponds to line 112 of gpt-neo-20b_sentiment_peft.py), they need to be trained from scratch during RL . |
@ohashi56225 I had exactly the same questions as you. I also think your understanding of the 3steps procedure is correct. However, I think I understand why simply fine-tuning the adapter weights without merging them works less well. It is related to the KL divergence in the PPO training. In fact, in https://github.com/lvwerra/trl/blob/b5cce0d13e95c8e21eb9d57177930253d9092a02/examples/sentiment/scripts/gpt-neox-20b_peft/gpt-neo-20b_sentiment_peft.py#L209, the ref_model=None. In this case, and since model is a PEFT_model, the ref_model simply becomes the PEFT_model with the adapters removed https://github.com/lvwerra/trl/blob/b5cce0d13e95c8e21eb9d57177930253d9092a02/trl/trainer/ppo_trainer.py#L559. In our case, if there is no merging step, this corresponds to taking as the ref_model the gpt-neo-125m model without any adapters and thus not adapted to imdb. With the merging step, this amounts to taking as ref_model the gpt-neo-125M-imdb-lora-adapter-merged model with better quality ref_logprobs. A bug fix that should make the merging obsolete would be to explicitly choose the adapted ref_model rather than letting ref_model=None. That's my current understanding from reading the code, but experimental validation is certainly desirable. |
Hey @Sanster @alexrame Actually when lora weights gets merged, the Lora layers 'disappear', and this is what I meant by yo can use the merged model as a standalone model !
I think that this is totally valid and this is my empirical understanding of things! |
@younesbelkada @alexrame Therefore, I think it's appropriate to close this issue. Thank you once again! |
Thanks for the nice explanations. So should we indicate the ref_model = gpt-neo-125M-imdb-lora-adapter-merged instead of leaving it to None? |
If your base model is "gpt-neo-125M-imdb-lora-adapter-merged", then you can either set ref_model = gpt-neo-125M-imdb-lora-adapter-merged or None, they both are equivalent. yet, the None option is preferred because it's more efficient (you don't need to load the same model twice). In contrast, if you have not applied the merge procedure, the ref model should be explicitly called. |
In the gpt-neox-20b_peft example, step 1 is the SFT with lora, step 2 is the merging of the trained lora adapters' weights into the base model, and step 3 is loading the merged model for PPO training with lora. I am unclear on why the models are merged in step 2.
Is it necessary to merge the adapter layer, or can we simply load both the base model and lora weights in step 3 (gpt-neo-20b_sentiment_peft.py) instead of loading the merged model? The code to load both models in step 3 would look like this:
The text was updated successfully, but these errors were encountered: