-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora #286
Comments
The issue is with these lines of code. It messes with the model state_dict, so the second time it's called from the save_pretrained() method it returns None. As I understand it, now one doesn't have to touch them outside of the library internals. Try to remove them and see if the model is saved as normal old_state_dict = model.state_dict
model.state_dict = (
lambda self, *_, **__: get_peft_model_state_dict(
self, old_state_dict()
)
).__get__(model, type(model)) |
I confirmed removing those line fixes the issue in alpaca-lora. It is probably safe to close this issue as cause seems to be in alpaca-lora not here? |
Thanks @s4rduk4r, for suggesting removing the lines related to |
Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part. same issue with this comment [https://github.com/tloen/alpaca-lora/issues/319#issuecomment-1505313341] |
Hello, the correct way to save the intermediate checkpoints for PEFT when using Trainer would be to use Callbacks. An example is shown here: https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb
|
@pacman100 Used the same way to save peft models which works well for LLaMA-7b model. However, when it comes to Bloomz-7b-mt, the adapter_model.bin is corrupted (only 18k). The only difference between Bloom and LLaMA is that I observed a WARNING when training Bloom with LoRA, as below:
I used torch==1.13.1 and DeepSpeed ZeRO3. Not sure if this is the reason. BTW, I modified the built-in
|
With r=8 and alpha=16, I can save the LoRA weights of LLaMA-7b successfully. However, when increasing r to 32 and alpha to 64, we obtained the empty adapter_model.bin. This is really weird. |
@wxjiao, have you able to solved it? looks like large adapter lead to empty adapter_model.bin when saved. I got into this when using LoRA+Zero3 for 30B, 65B , same code works fine for 7B. |
@justinphan3110 No, just gave it up. It took me too much time to debug. At the beginning, I thought there should be something wrong with |
This looks good. Will try soon. Thanks!
…On Tue, Apr 25, 2023 at 3:28 AM Long Phan ***@***.***> wrote:
@wxjiao <https://github.com/wxjiao>, this may helps:
https://github.com/lm-sys/FastChat/blob/ceeaaa40adb20790e6b08209250d35eb42cc8451/fastchat/train/train_lora.py#L64
—
Reply to this email directly, view it on GitHub
<#286 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHMYL7IM3GDTAIYZGWGGVXLXC3H5FANCNFSM6AAAAAAWYDRDDA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I have just tried on a Llama30B+LoRA but still have an empty saved adapter_model model. Lmk if it works for you or any new insight you have from it. |
have you able to figured it out @wxjiao w? I can try open an issue |
I have the same problem(the value of adpater_model.bin is [ ] ) when use peftModel with zero3 after deepspeed.initialize. |
I dont know if this is the RIGHT way, but this simple modification at L275 gives me a fucntional - model.save_pretrained(output_dir)
+ model.save_pretrained(output_dir, state_dict=old_state_dict()) |
Thanks man @s4rduk4r !!! You saved my day. |
I recently found that when fine-tuning using alpaca-lora,
model.save_pretrained()
will save aadapter_model.bin
that is only 443 B.This seems to be happening after peft@
75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495
.Normally
adapter_model.bin
should be > 16 MB. And while the 443 Badapter_model.bin
is loaded, the model behaves like not fine-tuned at all. In contrast, loading other checkpoints from the same training works as expected.I'm not sure if this is an issue to
peft
or not, or is this a duplication of other issues, but just leaving this for reference.I've been testing with multiple versions of
peft
:072da6d9d62
works382b178911edff38c1ff619bbac2ba556bd2276b
works75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495
not working445940fb7b5d38390ffb6707e2a989e89fff03b5
not working1a6151b91fcdcc25326b9807d7dbf54e091d506c
not working1117d4772109a098787ce7fc297cb6cd641de6eb
not workingSteps to reproduce:
$ ls -alh lora-alpaca total 16K drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr 9 12:55 . drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr 9 12:54 .. -rw-rw-r-- 1 ubuntu ubuntu 350 Apr 9 12:55 adapter_config.json -rw-rw-r-- 1 ubuntu ubuntu 443 Apr 9 12:55 adapter_model.bin
(
adapter_model.bin
should normally be around 16 MB)Running on Lambda Cloud A10 instance.
The text was updated successfully, but these errors were encountered: