model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora #286

zetavg · 2023-04-09T13:16:08Z

I recently found that when fine-tuning using alpaca-lora, model.save_pretrained() will save a adapter_model.bin that is only 443 B.

This seems to be happening after peft@75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495.

Normally adapter_model.bin should be > 16 MB. And while the 443 B adapter_model.bin is loaded, the model behaves like not fine-tuned at all. In contrast, loading other checkpoints from the same training works as expected.

drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr  9 12:55 .
drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr  9 12:54 ..
-rw-rw-r-- 1 ubuntu ubuntu  350 Apr  9 12:55 adapter_config.json
-rw-rw-r-- 1 ubuntu ubuntu  443 Apr  9 12:55 adapter_model.bin
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr  9 12:06 checkpoint-400
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr  9 12:06 checkpoint-600
drwxr-xr-x 2 ubuntu ubuntu 4.0K Apr  9 12:07 checkpoint-800

I'm not sure if this is an issue to peft or not, or is this a duplication of other issues, but just leaving this for reference.

I've been testing with multiple versions of peft:

072da6d9d62 works
382b178911edff38c1ff619bbac2ba556bd2276b works
75808eb2a6e7b4c3ed8aec003b6eeb30a2db1495 not working
445940fb7b5d38390ffb6707e2a989e89fff03b5 not working
1a6151b91fcdcc25326b9807d7dbf54e091d506c not working
1117d4772109a098787ce7fc297cb6cd641de6eb not working

Steps to reproduce:

conda create python=3.8 -n test
conda activate test
git clone https://github.com/tloen/alpaca-lora.git
cd alpaca-lora
pip install -r requirements.txt

# to workaround AttributeError: bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats
cd /home/ubuntu/miniconda3/envs/test/lib/python3.8/site-packages/bitsandbytes/
mv libbitsandbytes_cpu.so libbitsandbytes_cpu.so.bak
cp libbitsandbytes_cuda121.so libbitsandbytes_cpu.so
cd -
conda install cudatoolkit

# alpaca_data_cleaned_first_100.json is alpaca_data_cleaned.json with only the first 100 items, setting --val_set_size 0 because there're not enough data to build the test set
python finetune.py --base_model 'decapoda-research/llama-7b-hf' --data_path '/data/datasets/alpaca_data_cleaned_first_100.json' --output_dir './lora-alpaca' --val_set_size 0

$ ls -alh lora-alpaca
total 16K
drwxrwxr-x 2 ubuntu ubuntu 4.0K Apr  9 12:55 .
drwxrwxr-x 7 ubuntu ubuntu 4.0K Apr  9 12:54 ..
-rw-rw-r-- 1 ubuntu ubuntu  350 Apr  9 12:55 adapter_config.json
-rw-rw-r-- 1 ubuntu ubuntu  443 Apr  9 12:55 adapter_model.bin

(adapter_model.bin should normally be around 16 MB)

Running on Lambda Cloud A10 instance.

The text was updated successfully, but these errors were encountered:

s4rduk4r · 2023-04-10T09:46:52Z

The issue is with these lines of code. It messes with the model state_dict, so the second time it's called from the save_pretrained() method it returns None. As I understand it, now one doesn't have to touch them outside of the library internals. Try to remove them and see if the model is saved as normal

    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

richardklafter · 2023-04-11T17:47:19Z

I confirmed removing those line fixes the issue in alpaca-lora. It is probably safe to close this issue as cause seems to be in alpaca-lora not here?

zetavg · 2023-04-11T18:35:22Z

Thanks @s4rduk4r, for suggesting removing the lines related to model.state_dict. I haven't confirmed it by myself, but as @richardklafter's confirmation and I found the author of alpaca-lora had also suggested removing those lines of code to fix another issue, I agree that we can close this and move the discussion to why those lines codes are added.

dawnranger · 2023-04-13T12:05:05Z

I confirmed removing those line fixes the issue in alpaca-lora. It is probably safe to close this issue as cause seems to be in alpaca-lora not here?

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

same issue with this comment [https://github.com/tloen/alpaca-lora/issues/319#issuecomment-1505313341]

see huggingface/peft#286 (comment)

pacman100 · 2023-04-18T07:46:04Z

Hi, I comment them and model.save_pretrained() successfully saved adapter_model.bin. But, in each eval, the code saved the complete model (including the frozen part, e.g.,~6.58G). Before commenting, the code only saved LoRA part.

same issue with this comment [https://github.com/https://github.com/tloen/alpaca-lora/issues/319#issuecomment-1505313341]

Hello, the correct way to save the intermediate checkpoints for PEFT when using Trainer would be to use Callbacks. An example is shown here: https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb

from transformers import Seq2SeqTrainer, TrainerCallback, TrainingArguments, TrainerState, TrainerControl
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR


class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control


trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    # compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

wxjiao · 2023-04-18T11:59:47Z

@pacman100 Used the same way to save peft models which works well for LLaMA-7b model. However, when it comes to Bloomz-7b-mt, the adapter_model.bin is corrupted (only 18k). The only difference between Bloom and LLaMA is that I observed a WARNING when training Bloom with LoRA, as below:

2023-04-18 18:38:26.377 /usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py:1432: 
UserWarning: Positional args are being deprecated, use kwargs instead. 
Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.

I used torch==1.13.1 and DeepSpeed ZeRO3. Not sure if this is the reason.

BTW, I modified the built-in run_clm.py to support LoRA rather than using alpaca-lora. So it should not be the consequence of the following lines:

    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

wxjiao · 2023-04-20T13:17:33Z

@pacman100 Used the same way to save peft models which works well for LLaMA-7b model. However, when it comes to Bloomz-7b-mt, the adapter_model.bin is corrupted (only 18k). The only difference between Bloom and LLaMA is that I observed a WARNING when training Bloom with LoRA, as below:
2023-04-18 18:38:26.377 /usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py:1432: 
UserWarning: Positional args are being deprecated, use kwargs instead. 
Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
I used torch==1.13.1 and DeepSpeed ZeRO3. Not sure if this is the reason.

BTW, I modified the built-in run_clm.py to support LoRA rather than using alpaca-lora. So it should not be the consequence of the following lines:
    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

With r=8 and alpha=16, I can save the LoRA weights of LLaMA-7b successfully. However, when increasing r to 32 and alpha to 64, we obtained the empty adapter_model.bin. This is really weird.

justinphan3110 · 2023-04-23T20:41:04Z

@wxjiao, have you able to solved it? looks like large adapter lead to empty adapter_model.bin when saved. I got into this when using LoRA+Zero3 for 30B, 65B , same code works fine for 7B.

wxjiao · 2023-04-24T01:18:11Z

@justinphan3110 No, just gave it up. It took me too much time to debug. At the beginning, I thought there should be something wrong with get_peft_model_state_dict, but it cannot explain the success of saving llama-7b lora. I printed the first elements in state_dict for both base model and lora in my training script, and found the keys were there but missing the values (i.e., only [ ]). I guess there is some incompatibility between PEFT and Zero3. I'll just wait.

wxjiao · 2023-04-25T01:04:53Z

This looks good. Will try soon. Thanks!

…

On Tue, Apr 25, 2023 at 3:28 AM Long Phan ***@***.***> wrote: @wxjiao <https://github.com/wxjiao>, this may helps: https://github.com/lm-sys/FastChat/blob/ceeaaa40adb20790e6b08209250d35eb42cc8451/fastchat/train/train_lora.py#L64 — Reply to this email directly, view it on GitHub <#286 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHMYL7IM3GDTAIYZGWGGVXLXC3H5FANCNFSM6AAAAAAWYDRDDA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

justinphan3110 · 2023-04-25T01:07:08Z

I have just tried on a Llama30B+LoRA but still have an empty saved adapter_model model. Lmk if it works for you or any new insight you have from it.

justinphan3110 · 2023-04-30T19:30:46Z

have you able to figured it out @wxjiao w? I can try open an issue

microbenh · 2023-05-12T05:55:26Z

I have the same problem(the value of adpater_model.bin is [ ] ) when use peftModel with zero3 after deepspeed.initialize.

Maxwell-Lyu · 2023-05-18T06:33:26Z

I dont know if this is the RIGHT way, but this simple modification at L275 gives me a fucntional adapter_model.bin with the correct size:

- model.save_pretrained(output_dir)
+ model.save_pretrained(output_dir, state_dict=old_state_dict())

adeepak7 · 2023-11-18T19:29:06Z

The issue is with these lines of code. It messes with the model state_dict, so the second time it's called from the save_pretrained() method it returns None. As I understand it, now one doesn't have to touch them outside of the library internals. Try to remove them and see if the model is saved as normal
    old_state_dict = model.state_dict
    model.state_dict = (
        lambda self, *_, **__: get_peft_model_state_dict(
            self, old_state_dict()
        )
    ).__get__(model, type(model))

Thanks man @s4rduk4r !!! You saved my day.

see huggingface/peft#286 (comment)

s1530129650 mentioned this issue Apr 11, 2023

Why add the following code snippet since it would bring some errors? tloen/alpaca-lora#319

Open

zetavg closed this as completed Apr 11, 2023

winglian added a commit to axolotl-ai-cloud/axolotl that referenced this issue Apr 15, 2023

fix issue with completed model being empty

902dd0a

see huggingface/peft#286 (comment)

pacman100 mentioned this issue Apr 18, 2023

Bug with saving LoRA (adapter_model.bin) on latest peft from git #317

Closed

pacman100 mentioned this issue Apr 18, 2023

fix issues to be compatible with latest peft tloen/alpaca-lora#359

Closed

SiriusWy mentioned this issue May 8, 2023

为什么我训练的lora的adapter_model.bin最后只有443字节 PhoebusSi/Alpaca-CoT#111

Closed

NanoCode012 mentioned this issue May 8, 2023

Checkpoints are the full base_model and not just the lora model #353

Closed

dkqkxx mentioned this issue May 21, 2023

fix: load_best_model_at_end error when load_in_8bit is True huggingface/transformers#23443

Merged

5 tasks

abdoelsayed2016 mentioned this issue May 21, 2023

After fine-tuning, the model repeat the answer and not stop tloen/alpaca-lora#467

Open

huangxd- mentioned this issue May 22, 2023

7B的模型在运行run_pt.py之后生成的pytorch_model.bin有13G ymcui/Chinese-LLaMA-Alpaca#379

Closed

ma1112 mentioned this issue Jun 20, 2023

Issues with 4-bit LORA training h2oai/h2ogpt#307

Closed

irajmoradi mentioned this issue Jun 21, 2023

Think I found issue mikeizbicki/modulus-magnus-linguae#40

Open

nepeee mentioned this issue Jul 16, 2023

Checkpoint saving broken with the latest version of huggingface johnsmith0031/alpaca_lora_4bit#135

Closed

Running123 mentioned this issue Jul 22, 2023

lora训练保存的adapter_model.bin很小，只有443字节 ymcui/Chinese-LLaMA-Alpaca#711

Closed

5 tasks

Zhang-Each mentioned this issue Jan 10, 2024

微调后的结果无法达到预期 zjukg/KoPA#34

Closed

asbiaidw5 added a commit to asbiaidw5/axolotl that referenced this issue Aug 6, 2024

fix issue with completed model being empty

00e11f9

see huggingface/peft#286 (comment)

udukisile9k added a commit to udukisile9k/axolotl that referenced this issue Aug 29, 2024

fix issue with completed model being empty

35c5048

see huggingface/peft#286 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora #286

model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora #286

zetavg commented Apr 9, 2023 •

edited

Loading

s4rduk4r commented Apr 10, 2023

richardklafter commented Apr 11, 2023

zetavg commented Apr 11, 2023

dawnranger commented Apr 13, 2023

pacman100 commented Apr 18, 2023

wxjiao commented Apr 18, 2023 •

edited

Loading

wxjiao commented Apr 20, 2023

justinphan3110 commented Apr 23, 2023 •

edited

Loading

wxjiao commented Apr 24, 2023

wxjiao commented Apr 25, 2023 via email

justinphan3110 commented Apr 25, 2023 •

edited

Loading

justinphan3110 commented Apr 30, 2023

microbenh commented May 12, 2023 •

edited

Loading

Maxwell-Lyu commented May 18, 2023 •

edited

Loading

adeepak7 commented Nov 18, 2023

model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora #286

model.save_pretrained() produced a corrupted adapter_model.bin (only 443 B) with alpaca-lora #286

Comments

zetavg commented Apr 9, 2023 • edited Loading

s4rduk4r commented Apr 10, 2023

richardklafter commented Apr 11, 2023

zetavg commented Apr 11, 2023

dawnranger commented Apr 13, 2023

pacman100 commented Apr 18, 2023

wxjiao commented Apr 18, 2023 • edited Loading

wxjiao commented Apr 20, 2023

justinphan3110 commented Apr 23, 2023 • edited Loading

wxjiao commented Apr 24, 2023

wxjiao commented Apr 25, 2023 via email

justinphan3110 commented Apr 25, 2023 • edited Loading

justinphan3110 commented Apr 30, 2023

microbenh commented May 12, 2023 • edited Loading

Maxwell-Lyu commented May 18, 2023 • edited Loading

adeepak7 commented Nov 18, 2023

zetavg commented Apr 9, 2023 •

edited

Loading

wxjiao commented Apr 18, 2023 •

edited

Loading

justinphan3110 commented Apr 23, 2023 •

edited

Loading

justinphan3110 commented Apr 25, 2023 •

edited

Loading

microbenh commented May 12, 2023 •

edited

Loading

Maxwell-Lyu commented May 18, 2023 •

edited

Loading