[`Trainer`] Correct behavior of `_load_best_model` for PEFT models #24103

younesbelkada · 2023-06-08T07:40:30Z

What does this PR do?

This PR fixes the bugs related with PEFT models and load_best_model_at_end. It also refactors a bit the current logic to extend it generally to all LoRA models, not only 8-bit base models + LoRA.

Repro script

from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
from transformers import TrainingArguments

dataset = load_dataset("imdb", split="train")

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

args = TrainingArguments(
    max_steps=1,
    save_steps=1,
    eval_steps=1,
    evaluation_strategy="steps",
    per_device_train_batch_size=1,
    resume_from_checkpoint=True,
    output_dir="test_trainer",
    load_best_model_at_end=True,
)

trainer = SFTTrainer(
    "EleutherAI/gpt-neo-125m",
    train_dataset=dataset,
    eval_dataset=dataset,
    dataset_text_field="text",
    peft_config=peft_config,
    max_seq_length=128,
    args=args,
)
trainer.train()

cc @sgugger @pacman100

HuggingFaceDocBuilderDev · 2023-06-08T07:57:59Z

The documentation is not available anymore as the PR was closed or merged.

pacman100

Thank you @younesbelkada for simplifying trainer usage with PEFT in terms of saving/loading as this has been a reason for numerous issues 🚀. Left few comments.

pacman100 · 2023-06-08T08:19:54Z

src/transformers/trainer.py

@@ -2177,11 +2177,18 @@ def _load_best_model(self):
        logger.info(f"Loading best model from {self.state.best_model_checkpoint} (score: {self.state.best_metric}).")
        best_model_path = os.path.join(self.state.best_model_checkpoint, WEIGHTS_NAME)
        best_safe_model_path = os.path.join(self.state.best_model_checkpoint, SAFE_WEIGHTS_NAME)
+        adapter_model_path = os.path.join(self.state.best_model_checkpoint, "adapter_model.bin")


it can also be safetensor ckpts, right?

Maybe adding best_safe_adapter_model_path should serve the purpose?

perfect, will refactor that a bit

pacman100 · 2023-06-08T08:28:14Z

src/transformers/trainer.py

                        else:
                            # We can't do pure 8bit training using transformers.
                            logger.warning("Could not loading a quantized checkpoint.")
+                            has_been_loaded = False


should this be removed now?

I think this is needed so that it can be used in the block below for the check, otherwise it will throw an error similar as #24096

AH sorry I see what you meant, yes will remove it

proposed something in bf31c5e

- add ST format as well

sgugger · 2023-06-08T12:42:12Z

src/transformers/trainer.py

+        best_adapter_model_path = os.path.join(self.state.best_model_checkpoint, "adapter_model.bin")
+        best_safe_adapter_model_path = os.path.join(self.state.best_model_checkpoint, "adapter_model.safetensors")


Those two should be in constants (like WEIGHTS_NAME) as they are now used several time across the file.

makes sense, just added it!

…24103) * v1 * some refactor - add ST format as well * fix * add `ADAPTER_WEIGHTS_NAME` & `ADAPTER_SAFE_WEIGHTS_NAME`

…uggingface#24103) * v1 * some refactor - add ST format as well * fix * add `ADAPTER_WEIGHTS_NAME` & `ADAPTER_SAFE_WEIGHTS_NAME`

v1

2242aa8

younesbelkada requested review from sgugger and pacman100 June 8, 2023 07:41

younesbelkada changed the title ~~[Trainer] Correct behavior of _load_best_model~~ [Trainer] Correct behavior of _load_best_model for PEFT models Jun 8, 2023

pacman100 reviewed Jun 8, 2023

View reviewed changes

some refactor

44be183

- add ST format as well

younesbelkada requested a review from pacman100 June 8, 2023 09:33

younesbelkada mentioned this pull request Jun 8, 2023

Exception when saving weights from QLORA due to UnboundLocalError #24096

Closed

4 tasks

fix

bf31c5e

sgugger approved these changes Jun 8, 2023

View reviewed changes

add ADAPTER_WEIGHTS_NAME & ADAPTER_SAFE_WEIGHTS_NAME

f01a06d

younesbelkada merged commit 2200bf7 into huggingface:main Jun 8, 2023

younesbelkada deleted the trainer-resume-fix branch June 8, 2023 13:38

sgugger pushed a commit that referenced this pull request Jun 8, 2023

[Trainer] Correct behavior of _load_best_model for PEFT models (#…

53e1f5c

…24103) * v1 * some refactor - add ST format as well * fix * add `ADAPTER_WEIGHTS_NAME` & `ADAPTER_SAFE_WEIGHTS_NAME`

younesbelkada mentioned this pull request Jun 21, 2023

Checkpoints are the full base_model and not just the lora model huggingface/peft#353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Trainer`] Correct behavior of `_load_best_model` for PEFT models #24103

[`Trainer`] Correct behavior of `_load_best_model` for PEFT models #24103

younesbelkada commented Jun 8, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 8, 2023 •

edited

Loading

pacman100 left a comment

pacman100 Jun 8, 2023

pacman100 Jun 8, 2023

younesbelkada Jun 8, 2023

pacman100 Jun 8, 2023

younesbelkada Jun 8, 2023

younesbelkada Jun 8, 2023

younesbelkada Jun 8, 2023

sgugger Jun 8, 2023

younesbelkada Jun 8, 2023

		best_adapter_model_path = os.path.join(self.state.best_model_checkpoint, "adapter_model.bin")
		best_safe_adapter_model_path = os.path.join(self.state.best_model_checkpoint, "adapter_model.safetensors")

[Trainer] Correct behavior of _load_best_model for PEFT models #24103

[Trainer] Correct behavior of _load_best_model for PEFT models #24103

Conversation

younesbelkada commented Jun 8, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jun 8, 2023 • edited Loading

pacman100 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[`Trainer`] Correct behavior of `_load_best_model` for PEFT models #24103

[`Trainer`] Correct behavior of `_load_best_model` for PEFT models #24103

younesbelkada commented Jun 8, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 8, 2023 •

edited

Loading