Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora #286

Closed
zetavg opened this issue Apr 9, 2023 · 15 comments

Comments

@zetavg
Copy link

zetavg commented Apr 9, 2023

I recently found that when fine-tuning using alpaca-lora, model.save_pretrained() will save a adapter_model.bin that is only 443 B.

This seems to be happening after peft@75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495.

Normally adapter_model.bin should be > 16 MB. And while the 443 B adapter_model.bin is loaded, the model behaves like not fine-tuned at all. In contrast, loading other checkpoints from the same training works as expected.

drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr  9 12:55 .
drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr  9 12:54 ..
-rw-rw-r-- 1 ubuntu ubuntu  350 Apr  9 12:55 adapter_config.json
-rw-rw-r-- 1 ubuntu ubuntu  443 Apr  9 12:55 adapter_model.bin
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr  9 12:06 checkpoint-400
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr  9 12:06 checkpoint-600
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr  9 12:07 checkpoint-800

I'm not sure if this is an issue to peft or not, or is this a duplication of other issues, but just leaving this for reference.

I've been testing with multiple versions of peft:

  • 072da6d9d62 works
  • 382b178911edff38c1ff619bbac2ba556bd2276b works
  • 75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495 not working
  • 445940fb7b5d38390ffb6707e2a989e89fff03b5 not working
  • 1a6151b91fcdcc25326b9807d7dbf54e091d506c not working
  • 1117d4772109a098787ce7fc297cb6cd641de6eb not working

Steps to reproduce:

conda create python=3.8 -n test
conda activate test
git clone https://github.com/tloen/alpaca-lora.git
cd alpaca-lora
pip install -r requirements.txt

# to workaround AttributeError: bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats
cd /home/ubuntu/miniconda3/envs/test/lib/python3.8/site-packages/bitsandbytes/
mv libbitsandbytes_cpu.so libbitsandbytes_cpu.so.bak
cp libbitsandbytes_cuda121.so libbitsandbytes_cpu.so
cd -
conda install cudatoolkit

# alpaca_data_cleaned_first_100.json is alpaca_data_cleaned.json with only the first 100 items, setting --val_set_size 0 because there're not enough data to build the test set
python finetune.py --base_model 'decapoda-research/llama-7b-hf' --data_path '/data/datasets/alpaca_data_cleaned_first_100.json' --output_dir './lora-alpaca' --val_set_size 0
$ ls -alh lora-alpaca
total 16K
drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr  9 12:55 .
drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr  9 12:54 ..
-rw-rw-r-- 1 ubuntu ubuntu  350 Apr  9 12:55 adapter_config.json
-rw-rw-r-- 1 ubuntu ubuntu  443 Apr  9 12:55 adapter_model.bin

(adapter_model.bin should normally be around 16 MB)

Running on Lambda Cloud A10 instance.

@s4rduk4r
Copy link
Contributor

The issue is with these lines of code. It messes with the model state_dict, so the second time it's called from the save_pretrained() method it returns None. As I understand it, now one doesn't have to touch them outside of the library internals. Try to remove them and see if the model is saved as normal

    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

@richardklafter
Copy link

I confirmed removing those line fixes the issue in alpaca-lora. It is probably safe to close this issue as cause seems to be in alpaca-lora not here?

@zetavg
Copy link
Author

zetavg commented Apr 11, 2023

Thanks @s4rduk4r, for suggesting removing the lines related to model.state_dict. I haven't confirmed it by myself, but as @richardklafter's confirmation and I found the author of alpaca-lora had also suggested removing those lines of code to fix another issue, I agree that we can close this and move the discussion to why those lines codes are added.

@zetavg zetavg closed this as completed Apr 11, 2023
@dawnranger
Copy link

I confirmed removing those line fixes the issue in alpaca-lora. It is probably safe to close this issue as cause seems to be in alpaca-lora not here?

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

same issue with this comment [https://github.com/tloen/alpaca-lora/issues/319#issuecomment-1505313341]

@pacman100
Copy link
Contributor

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

same issue with this comment [https://github.com/https://github.com/tloen/alpaca-lora/issues/319#issuecomment-1505313341]

Hello, the correct way to save the intermediate checkpoints for PEFT when using Trainer would be to use Callbacks. An example is shown here: https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb

from transformers import Seq2SeqTrainer, TrainerCallback, TrainingArguments, TrainerState, TrainerControl
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR


class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control


trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    # compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

@wxjiao
Copy link

wxjiao commented Apr 18, 2023

@pacman100 Used the same way to save peft models which works well for LLaMA-7b model. However, when it comes to Bloomz-7b-mt, the adapter_model.bin is corrupted (only 18k). The only difference between Bloom and LLaMA is that I observed a WARNING when training Bloom with LoRA, as below:

2023-04-18 18:38:26.377 /usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py:1432: 
UserWarning: Positional args are being deprecated, use kwargs instead. 
Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.

I used torch==1.13.1 and DeepSpeed ZeRO3. Not sure if this is the reason.

BTW, I modified the built-in run_clm.py to support LoRA rather than using alpaca-lora. So it should not be the consequence of the following lines:

    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

@wxjiao
Copy link

wxjiao commented Apr 20, 2023

@pacman100 Used the same way to save peft models which works well for LLaMA-7b model. However, when it comes to Bloomz-7b-mt, the adapter_model.bin is corrupted (only 18k). The only difference between Bloom and LLaMA is that I observed a WARNING when training Bloom with LoRA, as below:

2023-04-18 18:38:26.377 /usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py:1432: 
UserWarning: Positional args are being deprecated, use kwargs instead. 
Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.

I used torch==1.13.1 and DeepSpeed ZeRO3. Not sure if this is the reason.

BTW, I modified the built-in run_clm.py to support LoRA rather than using alpaca-lora. So it should not be the consequence of the following lines:

    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

With r=8 and alpha=16, I can save the LoRA weights of LLaMA-7b successfully. However, when increasing r to 32 and alpha to 64, we obtained the empty adapter_model.bin. This is really weird.

@justinphan3110
Copy link

justinphan3110 commented Apr 23, 2023

@wxjiao, have you able to solved it? looks like large adapter lead to empty adapter_model.bin when saved. I got into this when using LoRA+Zero3 for 30B, 65B , same code works fine for 7B.

@wxjiao
Copy link

wxjiao commented Apr 24, 2023

@justinphan3110 No, just gave it up. It took me too much time to debug. At the beginning, I thought there should be something wrong with get_peft_model_state_dict, but it cannot explain the success of saving llama-7b lora. I printed the first elements in state_dict for both base model and lora in my training script, and found the keys were there but missing the values (i.e., only [ ]). I guess there is some incompatibility between PEFT and Zero3. I'll just wait.

@wxjiao
Copy link

wxjiao commented Apr 25, 2023 via email

@justinphan3110
Copy link

justinphan3110 commented Apr 25, 2023

I have just tried on a Llama30B+LoRA but still have an empty saved adapter_model model. Lmk if it works for you or any new insight you have from it.

@justinphan3110
Copy link

have you able to figured it out @wxjiao w? I can try open an issue

@microbenh
Copy link

microbenh commented May 12, 2023

I have the same problem(the value of adpater_model.bin is [ ] ) when use peftModel with zero3 after deepspeed.initialize.

@Maxwell-Lyu
Copy link

Maxwell-Lyu commented May 18, 2023

I dont know if this is the RIGHT way, but this simple modification at L275 gives me a fucntional adapter_model.bin with the correct size:

- model.save_pretrained(output_dir)
+ model.save_pretrained(output_dir, state_dict=old_state_dict())

@adeepak7
Copy link

The issue is with these lines of code. It messes with the model state_dict, so the second time it's called from the save_pretrained() method it returns None. As I understand it, now one doesn't have to touch them outside of the library internals. Try to remove them and see if the model is saved as normal

    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

Thanks man @s4rduk4r !!! You saved my day.

asbiaidw5 added a commit to asbiaidw5/axolotl that referenced this issue Aug 6, 2024
udukisile9k added a commit to udukisile9k/axolotl that referenced this issue Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants