Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with saving LoRA (adapter_model.bin) on latest peft from git #317

Closed
mcmonkey4eva opened this issue Apr 15, 2023 · 7 comments
Closed

Comments

@mcmonkey4eva
Copy link

mcmonkey4eva commented Apr 15, 2023

Setup

using get_peft_model with CAUSAL_LM, transformers.Trainer(...), then lora_model.save_pretrained(lora_file_path) to train a LoRA on LLaMA (int8).

The Issue

When saving at the end, adapter_model.bin is an empty pickle (443 bytes, contains a 6 byte data entry).

This is, of course, wrong, there should be data in there. Prior versions of peft saved full proper files with actual content in them.

Checkpoints saved midway through by save_steps in the transformers trainer seem to contain full valid data (but in a different format).

See oobabooga/text-generation-webui#1098 (comment) for more discussion of the issue externally.

Relevant technical details

pip install git+https://github.com/huggingface/peft has an error (at time of writing that's commit b21559e ) but pip install peft==0.2.0 does not, indicating likely an error sourcing from a recent change.

Relevant source code replicating this issue: https://github.com/mcmonkey4eva/text-generation-webui/blob/lora-trainer-improvements-3/modules/training.py

Side Note

Likely a separate topic, but users have said that on peft==0.2.0 they're seeing huge VRAM spikes when save_pretrained is ran. I haven't seen this myself yet but it may indicate need more broadly for more thorough validation of the saving code.
EDIT: Yeah okay indeed separate topic, other user on github reported the VRAM spike part is actually caused by bitsandbytes rather than peft.

(Also it's a bit strange that pickles are being used - a separate project entirely but those should definitely be replaced to safetensors files)

@mcmonkey4eva
Copy link
Author

Update: so, after some testing, torch.save(trainer.model.state_dict(), f"{lora_file_path}/adapter_model.bin") is able to save a valid file.
But, lora_model.save_pretrained(lora_file_path, state_dict=trainer.model.state_dict()) does not.

That both, (A) hopefully helps narrow down where the issue may lie, and (B) gives a functional workaround for until it's properly fixed.

@pacman100
Copy link
Contributor

Hello @mcmonkey4eva, I am using the latest main branch and running the following example: https://github.com/huggingface/peft/blob/main/examples/int8_training/Finetune_opt_bnb_peft.ipynb

I am unable to reproduce the above issue:
Screenshot 2023-04-18 at 11 46 01 AM

@pacman100
Copy link
Contributor

Could you please share minimal code that we can run to reproduce the above issue?

@pacman100
Copy link
Contributor

As per this tloen/alpaca-lora#293, uninstalling and reinstalling fixed this it seems

@pacman100
Copy link
Contributor

Please also see this #286

@pacman100
Copy link
Contributor

Above PR should have the fixes. Note that there were no issues with PEFT and they were related to alpaca-lora

@mcmonkey4eva
Copy link
Author

That's perfect, that fixed it. Thank you so much for taking the time to investigate and get it fixed for everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants