Training Resumes with Increased Loss Despite Checkpoint Loading #33336

waldnebel · 2024-09-06T04:50:24Z

Problem: When resuming the training of a BERT model with the Hugging Face Trainer from a checkpoint, the loss value increases again in the second run, even though the checkpoint is loaded correctly and the global_step, optimizer state, and scheduler state are restored.

Troubleshooting Steps Taken:

Manually Setting global_step:
Set the global_step manually in the Trainer after loading the checkpoint.

Result: Problem not resolved.

Overriding the train() method:
Created a new class MyTrainer inheriting from Trainer and overrode the train() method to set the global_step when loading a checkpoint.

Result: Problem not resolved.

Removing resume_from_checkpoint:
Removed the resume_from_checkpoint argument from the trainer.train() call and manually loaded the global_step, optimizer state, and scheduler state.

Result: Problem not resolved.

Resetting/Deactivating the Learning Rate Scheduler:
Reinitialized the scheduler after loading the checkpoint, skipped the step() call, or completely deactivated the scheduler.

Result: Problem not resolved. The learning rate is still set to 0.0.

Manually Setting the Learning Rate:
Manually set the learning rate of the parameter groups in the optimizer after loading the checkpoint.

Result: Problem not resolved. The scheduler resets the learning rate back to 0.0.

Explicitly Calculating num_training_steps:
Explicitly calculated the number of training steps and stored it in the num_training_steps variable, using it when initializing the scheduler.

Result: Problem not resolved.

Manually Moving trainer_state.json into Checkpoint Subfolder:
Moved the trainer_state.json file after trainer.save_state() using shutil.move() into the checkpoint-XXXX subfolder.

Result: Problem not resolved.

Manually Setting TrainerState Attributes:
Instead of overwriting the entire trainer.state variable, the global_step and epoch attributes were set individually after loading the checkpoint.

Result: Problem not resolved.

Key Observations:

The error only occurs when resuming training from a checkpoint. The first run works flawlessly.

The training data, environment, and hardware remain identical between runs.

The global_step is loaded correctly and set in the Trainer.

The optimizer and scheduler states are loaded correctly.

The TrainerState is loaded correctly.

Manually setting the learning rate has no effect, the scheduler resets it.

Suspicions:

The Trainer might internally reset the global_step or the optimizer state after the checkpoint is loaded.

There might be a bug in the Trainer class that prevents the correct restoration of the training state.

Question for Hugging Face Transformers:

Why does the loss value increase when resuming training from a checkpoint in the second run, even though the checkpoint is loaded correctly? Are there any known issues with the Trainer regarding the restoration of the training state, especially the learning rate and scheduler? We have tried numerous troubleshooting steps (listed above) without success. We suspect a potential bug within the Trainer itself. Could you provide guidance or insights on how to resolve this issue?

Additional Information:

Model: BertForMaskedLM

Trainer: transformers.Trainer

Scheduler: The default scheduler of the Trainer (linear warmup scheduler with subsequent linear decay)

Optimizer: AdamW

Dataset: A custom text dataset

Code: The relevant code is provided above.

Logs: Detailed logs of the training process can be provided if needed.

Goal:

We want to be able to resume training from checkpoints without the loss value increasing again and the training having to start from scratch. We hope that the Hugging Face Transformers team can help us solve this problem.

File:

import os
import shutil
import torch
import warnings
import logging
import json 
from transformers import BertForMaskedLM, BertTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset
from config import load_config
from datetime import datetime

# Suppress specific FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning, module="accelerate")

# Format the logging output
logging.basicConfig(filename='training.log', level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')

class MyTrainer(Trainer):
    def train(self, **kwargs):
        """
        Overridden train() method that adds additional logging information.
        """
        # Log the global_step and epoch before training
        logging.info(f"Global Step before training: {self.state.global_step}")
        logging.info(f"Epoch before training: {self.state.epoch}")

        # Call the train() method of the superclass
        super().train(**kwargs)

        # Log the global_step and epoch after training
        logging.info(f"Global Step after training: {self.state.global_step}")
        logging.info(f"Epoch after training: {self.state.epoch}")

    def _inner_training_loop(
        self, batch_size=None, args=None, resume_from_checkpoint=None, trial=None, ignore_keys_for_eval=None
    ):
        """
        Override _inner_training_loop() to log the global_step, epoch,
        and learning rate after each step.
        """
        output = super()._inner_training_loop(batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)

        # Log the global_step, epoch, and learning rate after each training step
        logging.debug(f"Global Step after step: {self.state.global_step}")
        logging.debug(f"Epoch after step: {self.state.epoch}")
        logging.debug(f"Learning rate after step: {self.lr_scheduler.get_last_lr()}")

        return output

def train_with_books(config_params, run_id):
    """
    Function to train BERT with the books in the books folder.
    """
    model_name = "bert-base-german-cased"
    model_path = config_params["bert_model_path"]  
    trained_model_path = config_params["trained_model_path"]
    checkpoint_path = config_params["checkpoint_path"]  
    log_path = os.path.join(config_params["logs_path"], f"run-{run_id}") 
    global_step_file = os.path.join(checkpoint_path, "global_step.txt") 

    # Initialize logging and output configuration
    logging.info(f"Model Path: {model_path}")
    logging.info(f"Trained Model Path: {trained_model_path}")
    logging.info(f"Checkpoint Path: {checkpoint_path}")
    logging.info(f"Log Path: {log_path}")

    trainer_state = None # Define trainer_state before the if-block
    optimizer_state = None
    scheduler_state = None
    global_step = 0
    epoch = 0 

    # 1. Check if a checkpoint exists
    if os.path.exists(checkpoint_path) and any(fname.startswith("checkpoint") for fname in os.listdir(checkpoint_path)):
        latest_checkpoint = max([f for f in os.listdir(checkpoint_path) if f.startswith("checkpoint")], key=lambda x: int(x.split('-')[-1]))
        checkpoint_path = os.path.join(checkpoint_path, latest_checkpoint)
        logging.info(f"Loading checkpoint from {checkpoint_path}...")

        # Load the optimizer and scheduler state from the checkpoint
        optimizer_path = os.path.join(checkpoint_path, "optimizer.pt")
        scheduler_path = os.path.join(checkpoint_path, "scheduler.pt")
        if os.path.exists(optimizer_path) and os.path.exists(scheduler_path):
            optimizer_state = torch.load(optimizer_path)
            scheduler_state = torch.load(scheduler_path)
            logging.info(f"Optimizer and scheduler state loaded from checkpoint: {optimizer_path}, {scheduler_path}")
        else:
            logging.warning(f"Optimizer and/or scheduler state not found in checkpoint.")

        # Load the model from the checkpoint
        model = BertForMaskedLM.from_pretrained(checkpoint_path)

        # Load global step from the checkpoint
        with open(global_step_file, "r") as f:
            global_step = int(f.read())
        logging.info(f"Global step loaded from {global_step_file}: {global_step}") 

        # Load epoch from the trainer state in the checkpoint (if available)
        trainer_state_path = os.path.join(checkpoint_path, "trainer_state.json")
        if os.path.exists(trainer_state_path):
            with open(trainer_state_path, 'r') as f:
                trainer_state = json.load(f) 
            epoch = trainer_state.get("epoch", 0)
            logging.info(f"Epoch loaded from trainer state: {epoch}")
        else:
            logging.warning(f"Trainer state not found in checkpoint.")
    # 2. Check if a trained model exists
    elif os.path.exists(trained_model_path):
        logging.info(f"Loading trained model from {trained_model_path}...")
        model = BertForMaskedLM.from_pretrained(trained_model_path, local_files_only=True, ignore_mismatched_sizes=True)
    # 3. Check if the original model exists
    elif os.path.exists(model_path):
        logging.info(f"Loading original model from {model_path}...")
        model = BertForMaskedLM.from_pretrained(model_path, local_files_only=True, ignore_mismatched_sizes=True)
    # 4. Download the model from Hugging Face
    else:
        logging.info(f"Downloading model {model_name} from Hugging Face...")
        model = BertForMaskedLM.from_pretrained(model_name, cache_dir=None, ignore_mismatched_sizes=True)
        model.save_pretrained(model_path)
        logging.info(f"Model saved to {model_path}.")

    tokenizer = BertTokenizer.from_pretrained(model_name, cache_dir=model_path)

    # Load books dataset
    books_dataset = load_dataset('text', data_files=f"{config_params['books_dataset_path']}/*.txt")

    def tokenize_function(examples):
        return tokenizer(
            examples["text"],
            padding="max_length",
            truncation=True,
            max_length=128
        )

    # Tokenize the dataset
    tokenized_datasets = books_dataset.map(
        tokenize_function,
        batched=True,
        num_proc=4, 
        remove_columns=["text"]
    )

    # Training Arguments
    training_args = TrainingArguments(
        output_dir=checkpoint_path,
        overwrite_output_dir=True,
        num_train_epochs=3,
        per_device_train_batch_size=192, 
        save_steps=10_000,
        save_total_limit=2,
        fp16=True,
        gradient_accumulation_steps=2, 
        logging_dir=log_path,  
        logging_steps=20,
        report_to="tensorboard",
        save_strategy="steps",  
        save_safetensors=False,
        dataloader_num_workers=4 
    )

    # Data Collator for Masked Language Modeling
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=True, mlm_probability=0.15
    )

    # Define the trainer
    trainer = MyTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        data_collator=data_collator
    )

    # Set global step in the trainer
    trainer.state.global_step = global_step
    # Set epoch in the trainer
    trainer.state.epoch = epoch

    # Initialize optimizer and scheduler
    trainer.create_optimizer_and_scheduler(num_training_steps=trainer.state.max_steps)
    logging.info(f"Optimizer and scheduler initialized.")

    # Load optimizer and scheduler state after creating the trainer
    if optimizer_state is not None and scheduler_state is not None:
        trainer.optimizer.load_state_dict(optimizer_state)
        trainer.lr_scheduler.load_state_dict(scheduler_state)
        logging.info(f"Optimizer and scheduler state loaded")

        # Manually set learning rate
        for i, param_group in enumerate(trainer.optimizer.param_groups):
            param_group['lr'] = 5e-5 
            logging.info(f"Learning rate for parameter group {i} manually set: {param_group['lr']}") 

    # Set TrainerState attributes (only if loaded from checkpoint)
    if trainer_state is not None:
        trainer.state.global_step = trainer_state.get("global_step", 0)
        trainer.state.epoch = trainer_state.get("epoch", 0)
        logging.info(f"TrainerState attributes set in the trainer.")

    # Debug information before training
    logging.info(f"Trainer state before training: {trainer.state}") 
    logging.info(f"Optimizer: {trainer.optimizer}")
    logging.info(f"LR Scheduler: {trainer.lr_scheduler}")

    # Start training
    logging.info("Starting training ...")
    trainer.train()  

    # Save global step after training
    with open(global_step_file, "w") as f:
        f.write(str(trainer.state.global_step))
    logging.info(f"Global step after training saved to {global_step_file}: {trainer.state.global_step}")

    # Debug information after training
    logging.info(f"Trainer state after training: {trainer.state}") 

    # Save the final model
    trainer.save_model(trained_model_path)
    logging.info(f"Final model saved to {trained_model_path}.") 

    # Save the current checkpoint in a subfolder
    checkpoint_save_path = os.path.join(checkpoint_path, f"checkpoint-{trainer.state.global_step}")
    os.makedirs(checkpoint_save_path, exist_ok=True) 
    trainer.save_model(checkpoint_save_path)
    trainer.save_state()
    # Move trainer_state.json to the subfolder
    trainer_state_source = os.path.join(checkpoint_path, "trainer_state.json")
    trainer_state_dest = os.path.join(checkpoint_save_path, "trainer_state.json")
    if os.path.exists(trainer_state_source):
        shutil.move(trainer_state_source, trainer_state_dest)
        logging.info(f"TrainerState saved to {trainer_state_dest}.")

    # Save optimizer and scheduler state in the checkpoint
    torch.save(trainer.optimizer.state_dict(), os.path.join(checkpoint_save_path, "optimizer.pt"))
    torch.save(trainer.lr_scheduler.state_dict(), os.path.join(checkpoint_save_path, "scheduler.pt"))
    logging.info(f"Optimizer and scheduler state saved to {checkpoint_save_path}.") 

    # Output relevant variables
    logging.info(f"trainer.state.global_step: {trainer.state.global_step}")
    checkpoint_files = [f for f in os.listdir(checkpoint_save_path)]
    logging.info(f"Checkpoint files in {checkpoint_save_path}: {checkpoint_files}")

Terminal Output:

Run:
{‘loss’: 2.8581, ‘grad_norm’: 2.897771120071411, ‘learning_rate’: 4.682539682539683e-05, ‘epoch’: 0.19}
{‘loss’: 2.1545, ‘grad_norm’: 2.8640339374542236, ‘learning_rate’: 4.3650793650793655e-05, ‘epoch’: 0.38}
{‘loss’: 2.0396, ‘grad_norm’: 2.7819628715515137, ‘learning_rate’: 4.047619047619048e-05, ‘epoch’: 0.57}
{‘loss’: 1.9786, ‘grad_norm’: 2.644606828689575, ‘learning_rate’: 3.730158730158731e-05, ‘epoch’: 0.76}
{‘loss’: 1.9553, ‘grad_norm’: 2.7417070865631104, ‘learning_rate’: 3.412698412698413e-05, ‘epoch’: 0.95}
{‘loss’: 1.8961, ‘grad_norm’: 2.6237854957580566, ‘learning_rate’: 3.095238095238095e-05, ‘epoch’: 1.14}
{‘loss’: 1.8793, ‘grad_norm’: 2.5830185413360596, ‘learning_rate’: 2.777777777777778e-05, ‘epoch’: 1.33}
{‘loss’: 1.8715, ‘grad_norm’: 2.652275800704956, ‘learning_rate’: 2.4603174603174602e-05, ‘epoch’: 1.52}
{‘loss’: 1.8362, ‘grad_norm’: 2.6065754890441895, ‘learning_rate’: 2.1428571428571428e-05, ‘epoch’: 1.71}
{‘loss’: 1.8474, ‘grad_norm’: 2.6352243423461914, ‘learning_rate’: 1.8253968253968254e-05, ‘epoch’: 1.9}
{‘loss’: 1.8197, ‘grad_norm’: 2.56719708442688, ‘learning_rate’: 1.5079365079365079e-05, ‘epoch’: 2.09}
{‘loss’: 1.826, ‘grad_norm’: 2.5195322036743164, ‘learning_rate’: 1.1904761904761905e-05, ‘epoch’: 2.27}
{‘loss’: 1.8074, ‘grad_norm’: 2.614032506942749, ‘learning_rate’: 8.73015873015873e-06, ‘epoch’: 2.46}
{‘loss’: 1.8029, ‘grad_norm’: 2.600111246109009, ‘learning_rate’: 5.555555555555556e-06, ‘epoch’: 2.65}
{‘loss’: 1.7879, ‘grad_norm’: 2.4874589443206787, ‘learning_rate’: 2.3809523809523808e-06, ‘epoch’: 2.84}
{‘train_runtime’: 142.9413, ‘train_samples_per_second’: 846.431, ‘train_steps_per_second’: 2.204, ‘train_loss’: 1.9500562516469804, ‘epoch’: 2.99}

Run:
{‘loss’: 1.453, ‘grad_norm’: 2.474428415298462, ‘learning_rate’: 4.682539682539683e-05, ‘epoch’: 0.19}
{‘loss’: 1.406, ‘grad_norm’: 2.5064773559570312, ‘learning_rate’: 4.3650793650793655e-05, ‘epoch’: 0.38}
{‘loss’: 1.4152, ‘grad_norm’: 2.5167486667633057, ‘learning_rate’: 4.047619047619048e-05, ‘epoch’: 0.57}
{‘loss’: 1.426, ‘grad_norm’: 2.4449574947357178, ‘learning_rate’: 3.730158730158731e-05, ‘epoch’: 0.76}
{‘loss’: 1.4592, ‘grad_norm’: 2.5427122116088867, ‘learning_rate’: 3.412698412698413e-05, ‘epoch’: 0.95}
{‘loss’: 1.4357, ‘grad_norm’: 2.4496681690216064, ‘learning_rate’: 3.095238095238095e-05, ‘epoch’: 1.14}
{‘loss’: 1.4684, ‘grad_norm’: 2.4780757427215576, ‘learning_rate’: 2.777777777777778e-05, ‘epoch’: 1.33}
{‘loss’: 1.5027, ‘grad_norm’: 2.5224385261535645, ‘learning_rate’: 2.4603174603174602e-05, ‘epoch’: 1.52}
{‘loss’: 1.5133, ‘grad_norm’: 2.5421390533447266, ‘learning_rate’: 2.1428571428571428e-05, ‘epoch’: 1.71}
{‘loss’: 1.5651, ‘grad_norm’: 2.5934836864471436, ‘learning_rate’: 1.8253968253968254e-05, ‘epoch’: 1.9}
{‘loss’: 1.562, ‘grad_norm’: 2.5455050468444824, ‘learning_rate’: 1.5079365079365079e-05, ‘epoch’: 2.09}
{‘loss’: 1.6139, ‘grad_norm’: 2.580508232116699, ‘learning_rate’: 1.1904761904761905e-05, ‘epoch’: 2.27}
{‘loss’: 1.631, ‘grad_norm’: 2.7025833129882812, ‘learning_rate’: 8.73015873015873e-06, ‘epoch’: 2.46}
{‘loss’: 1.6631, ‘grad_norm’: 2.669140338897705, ‘learning_rate’: 5.555555555555556e-06, ‘epoch’: 2.65}
{‘loss’: 1.6425, ‘grad_norm’: 2.4610960483551025, ‘learning_rate’: 2.3809523809523808e-06, ‘epoch’: 2.84}
{‘train_runtime’: 157.4714, ‘train_samples_per_second’: 768.33, ‘train_steps_per_second’: 2.0, ‘train_loss’: 1.5153770507328095, ‘epoch’: 2.99}

Logfile Output:

2024-09-05 17:16:34,568 - INFO - Model Path: models/bert_model
2024-09-05 17:16:34,568 - INFO - Trained Model Path: models/trained_bert
2024-09-05 17:16:34,568 - INFO - Checkpoint Path: training/checkpoints
2024-09-05 17:16:34,569 - INFO - Log Path: logs/run-20240905_171634
2024-09-05 17:16:34,569 - INFO - Loading original model from models/bert_model...
2024-09-05 17:16:36,409 - INFO - Optimizer and scheduler initialized.
2024-09-05 17:16:36,409 - INFO - Trainer state before training: TrainerState(epoch=0, global_step=0, max_steps=0, logging_steps=500, eval_steps=500, save_steps=500, train_batch_size=None, num_train_epochs=0, num_input_tokens_seen=0, total_flos=0, log_history=[], best_metric=None, best_model_checkpoint=None, is_local_process_zero=True, is_world_process_zero=True, is_hyper_param_search=False, trial_name=None, trial_params=None, stateful_callbacks={'TrainerControl': {'args': {'should_training_stop': False, 'should_epoch_stop': False, 'should_save': False, 'should_evaluate': False, 'should_log': False}, 'attributes': {}}})
2024-09-05 17:16:36,409 - INFO - Optimizer: AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    initial_lr: 5e-05
    lr: 0.0
    maximize: False
    weight_decay: 0.0

Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    initial_lr: 5e-05
    lr: 0.0
    maximize: False
    weight_decay: 0.0
)
2024-09-05 17:16:36,409 - INFO - LR Scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7fc8adbb95e0>
2024-09-05 17:16:36,409 - INFO - Starting training ...
2024-09-05 17:16:36,409 - INFO - Global Step before training: 0
2024-09-05 17:16:36,409 - INFO - Epoch before training: 0
2024-09-05 17:18:59,573 - INFO - Global Step after training: 315
2024-09-05 17:18:59,573 - INFO - Epoch after training: 2.985781990521327
2024-09-05 17:18:59,573 - INFO - Global step after training saved to training/checkpoints/global_step.txt: 315
2024-09-05 17:18:59,573 - INFO - Trainer state after training: TrainerState(epoch=2.985781990521327, global_step=315, max_steps=315, logging_steps=20, eval_steps=500, save_steps=10000, train_batch_size=192, num_train_epochs=3, num_input_tokens_seen=0, total_flos=7935313554653184.0, log_history=[{'loss': 2.8581, 'grad_norm': 2.897771120071411, 'learning_rate': 4.682539682539683e-05, 'epoch': 0.1895734597156398, 'step': 20}, {'loss': 2.1545, 'grad_norm': 2.8640339374542236, 'learning_rate': 4.3650793650793655e-05, 'epoch': 0.3791469194312796, 'step': 40}, {'loss': 2.0396, 'grad_norm': 2.7819628715515137, 'learning_rate': 4.047619047619048e-05, 'epoch': 0.5687203791469194, 'step': 60}, {'loss': 1.9786, 'grad_norm': 2.644606828689575, 'learning_rate': 3.730158730158731e-05, 'epoch': 0.7582938388625592, 'step': 80}, {'loss': 1.9553, 'grad_norm': 2.7417070865631104, 'learning_rate': 3.412698412698413e-05, 'epoch': 0.9478672985781991, 'step': 100}, {'loss': 1.8961, 'grad_norm': 2.6237854957580566, 'learning_rate': 3.095238095238095e-05, 'epoch': 1.1374407582938388, 'step': 120}, {'loss': 1.8793, 'grad_norm': 2.5830185413360596, 'learning_rate': 2.777777777777778e-05, 'epoch': 1.3270142180094786, 'step': 140}, {'loss': 1.8715, 'grad_norm': 2.652275800704956, 'learning_rate': 2.4603174603174602e-05, 'epoch': 1.5165876777251186, 'step': 160}, {'loss': 1.8362, 'grad_norm': 2.6065754890441895, 'learning_rate': 2.1428571428571428e-05, 'epoch': 1.7061611374407581, 'step': 180}, {'loss': 1.8474, 'grad_norm': 2.6352243423461914, 'learning_rate': 1.8253968253968254e-05, 'epoch': 1.8957345971563981, 'step': 200}, {'loss': 1.8197, 'grad_norm': 2.56719708442688, 'learning_rate': 1.5079365079365079e-05, 'epoch': 2.085308056872038, 'step': 220}, {'loss': 1.826, 'grad_norm': 2.5195322036743164, 'learning_rate': 1.1904761904761905e-05, 'epoch': 2.2748815165876777, 'step': 240}, {'loss': 1.8074, 'grad_norm': 2.614032506942749, 'learning_rate': 8.73015873015873e-06, 'epoch': 2.4644549763033177, 'step': 260}, {'loss': 1.8029, 'grad_norm': 2.600111246109009, 'learning_rate': 5.555555555555556e-06, 'epoch': 2.654028436018957, 'step': 280}, {'loss': 1.7879, 'grad_norm': 2.4874589443206787, 'learning_rate': 2.3809523809523808e-06, 'epoch': 2.843601895734597, 'step': 300}, {'train_runtime': 142.9413, 'train_samples_per_second': 846.431, 'train_steps_per_second': 2.204, 'total_flos': 7935313554653184.0, 'train_loss': 1.9500562516469804, 'epoch': 2.985781990521327, 'step': 315}], best_metric=None, best_model_checkpoint=None, is_local_process_zero=True, is_world_process_zero=True, is_hyper_param_search=False, trial_name=None, trial_params=None, stateful_callbacks={'TrainerControl': {'args': {'should_training_stop': True, 'should_epoch_stop': False, 'should_save': True, 'should_evaluate': False, 'should_log': False}, 'attributes': {}}})
2024-09-05 17:18:59,836 - INFO - Final model saved to models/trained_bert.
2024-09-05 17:19:00,123 - INFO - TrainerState saved to training/checkpoints/checkpoint-315/trainer_state.json.
2024-09-05 17:19:00,734 - INFO - Optimizer and scheduler state saved to training/checkpoints/checkpoint-315.
2024-09-05 17:19:00,734 - INFO - trainer.state.global_step: 315
2024-09-05 17:19:00,734 - INFO - Checkpoint files in training/checkpoints/checkpoint-315: ['config.json', 'trainer_state.json', 'pytorch_model.bin', 'rng_state.pth', 'scheduler.pt', 'training_args.bin', 'optimizer.pt', 'generation_config.json']
2024-09-05 17:19:30,385 - INFO - Model Path: models/bert_model
2024-09-05 17:19:30,385 - INFO - Trained Model Path: models/trained_bert
2024-09-05 17:19:30,385 - INFO - Checkpoint Path: training/checkpoints
2024-09-05 17:19:30,385 - INFO - Log Path: logs/run-20240905_171930
2024-09-05 17:19:30,385 - INFO - Loading checkpoint from training/checkpoints/checkpoint-315...
2024-09-05 17:19:30,856 - INFO - Optimizer and scheduler state loaded from checkpoint: training/checkpoints/checkpoint-315/optimizer.pt, training/checkpoints/checkpoint-315/scheduler.pt
2024-09-05 17:19:30,883 - INFO - Global step loaded from training/checkpoints/global_step.txt: 315
2024-09-05 17:19:30,883 - INFO - Epoch loaded from trainer state: 2.985781990521327
2024-09-05 17:19:31,694 - INFO - Optimizer and scheduler initialized.
2024-09-05 17:19:31,695 - INFO - Optimizer and scheduler state loaded
2024-09-05 17:19:31,695 - INFO - Learning rate for parameter group 0 manually set: 5e-05
2024-09-05 17:19:31,695 - INFO - Learning rate for parameter group 1 manually set: 5e-05
2024-09-05 17:19:31,695 - INFO - TrainerState attributes set in the trainer.
2024-09-05 17:19:31,696 - INFO - Trainer state before training: TrainerState(epoch=2.985781990521327, global_step=315, max_steps=0, logging_steps=500, eval_steps=500, save_steps=500, train_batch_size=None, num_train_epochs=0, num_input_tokens_seen=0, total_flos=0, log_history=[], best_metric=None, best_model_checkpoint=None, is_local_process_zero=True, is_world_process_zero=True, is_hyper_param_search=False, trial_name=None, trial_params=None, stateful_callbacks={'TrainerControl': {'args': {'should_training_stop': False, 'should_epoch_stop': False, 'should_save': False, 'should_evaluate': False, 'should_log': False}, 'attributes': {}}})
2024-09-05 17:19:31,696 - INFO - Optimizer: AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    initial_lr: 5e-05
    lr: 5e-05
    maximize: False
    weight_decay: 0.0

Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    initial_lr: 5e-05
    lr: 5e-05
    maximize: False
    weight_decay: 0.0
)
2024-09-05 17:19:31,696 - INFO - LR Scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7fc958bd9ac0>
2024-09-05 17:19:31,696 - INFO - Starting training ...
2024-09-05 17:19:31,696 - INFO - Global Step before training: 315
2024-09-05 17:19:31,696 - INFO - Epoch before training: 2.985781990521327
2024-09-05 17:22:09,404 - INFO - Global Step after training: 315
2024-09-05 17:22:09,404 - INFO - Epoch after training: 2.985781990521327
2024-09-05 17:22:09,404 - INFO - Global step after training saved to training/checkpoints/global_step.txt: 315
2024-09-05 17:22:09,404 - INFO - Trainer state after training: TrainerState(epoch=2.985781990521327, global_step=315, max_steps=315, logging_steps=20, eval_steps=500, save_steps=10000, train_batch_size=192, num_train_epochs=3, num_input_tokens_seen=0, total_flos=7935313554653184.0, log_history=[{'loss': 1.453, 'grad_norm': 2.474428415298462, 'learning_rate': 4.682539682539683e-05, 'epoch': 0.1895734597156398, 'step': 20}, {'loss': 1.406, 'grad_norm': 2.5064773559570312, 'learning_rate': 4.3650793650793655e-05, 'epoch': 0.3791469194312796, 'step': 40}, {'loss': 1.4152, 'grad_norm': 2.5167486667633057, 'learning_rate': 4.047619047619048e-05, 'epoch': 0.5687203791469194, 'step': 60}, {'loss': 1.426, 'grad_norm': 2.4449574947357178, 'learning_rate': 3.730158730158731e-05, 'epoch': 0.7582938388625592, 'step': 80}, {'loss': 1.4592, 'grad_norm': 2.5427122116088867, 'learning_rate': 3.412698412698413e-05, 'epoch': 0.9478672985781991, 'step': 100}, {'loss': 1.4357, 'grad_norm': 2.4496681690216064, 'learning_rate': 3.095238095238095e-05, 'epoch': 1.1374407582938388, 'step': 120}, {'loss': 1.4684, 'grad_norm': 2.4780757427215576, 'learning_rate': 2.777777777777778e-05, 'epoch': 1.3270142180094786, 'step': 140}, {'loss': 1.5027, 'grad_norm': 2.5224385261535645, 'learning_rate': 2.4603174603174602e-05, 'epoch': 1.5165876777251186, 'step': 160}, {'loss': 1.5133, 'grad_norm': 2.5421390533447266, 'learning_rate': 2.1428571428571428e-05, 'epoch': 1.7061611374407581, 'step': 180}, {'loss': 1.5651, 'grad_norm': 2.5934836864471436, 'learning_rate': 1.8253968253968254e-05, 'epoch': 1.8957345971563981, 'step': 200}, {'loss': 1.562, 'grad_norm': 2.5455050468444824, 'learning_rate': 1.5079365079365079e-05, 'epoch': 2.085308056872038, 'step': 220}, {'loss': 1.6139, 'grad_norm': 2.580508232116699, 'learning_rate': 1.1904761904761905e-05, 'epoch': 2.2748815165876777, 'step': 240}, {'loss': 1.631, 'grad_norm': 2.7025833129882812, 'learning_rate': 8.73015873015873e-06, 'epoch': 2.4644549763033177, 'step': 260}, {'loss': 1.6631, 'grad_norm': 2.669140338897705, 'learning_rate': 5.555555555555556e-06, 'epoch': 2.654028436018957, 'step': 280}, {'loss': 1.6425, 'grad_norm': 2.4610960483551025, 'learning_rate': 2.3809523809523808e-06, 'epoch': 2.843601895734597, 'step': 300}, {'train_runtime': 157.4714, 'train_samples_per_second': 768.33, 'train_steps_per_second': 2.0, 'total_flos': 7935313554653184.0, 'train_loss': 1.5153770507328095, 'epoch': 2.985781990521327, 'step': 315}], best_metric=None, best_model_checkpoint=None, is_local_process_zero=True, is_world_process_zero=True, is_hyper_param_search=False, trial_name=None, trial_params=None, stateful_callbacks={'TrainerControl': {'args': {'should_training_stop': True, 'should_epoch_stop': False, 'should_save': True, 'should_evaluate': False, 'should_log': False}, 'attributes': {}}})
2024-09-05 17:22:09,693 - INFO - Final model saved to models/trained_bert.
2024-09-05 17:22:09,975 - INFO - TrainerState saved to training/checkpoints/checkpoint-315/checkpoint-315/trainer_state.json.
2024-09-05 17:22:10,570 - INFO - Optimizer and scheduler state saved to training/checkpoints/checkpoint-315/checkpoint-315.
2024-09-05 17:22:10,570 - INFO - trainer.state.global_step: 315
2024-09-05 17:22:10,571 - INFO - Checkpoint files in training/checkpoints/checkpoint-315/checkpoint-315: ['config.json', 'trainer_state.json', 'pytorch_model.bin', 'rng_state.pth', 'scheduler.pt', 'training_args.bin', 'optimizer.pt', 'generation_config.json']

The text was updated successfully, but these errors were encountered:

LysandreJik · 2024-09-06T16:08:08Z

cc @SunMarc @muellerzr

MekkCyber · 2024-10-10T21:00:09Z

Hello @waldnebel, sorry I couldn't reproduce the issue, when I train the model for the first time I have the following output on Tinystories dataset :

{'loss': 2.1048, 'grad_norm': 2.1349170207977295, 'learning_rate': 4.9939613526570054e-05, 'epoch': 0.0}                                           
{'loss': 1.2131, 'grad_norm': 1.8152062892913818, 'learning_rate': 4.98792270531401e-05, 'epoch': 0.01}                                            
{'loss': 1.0123, 'grad_norm': 1.638547658920288, 'learning_rate': 4.981884057971015e-05, 'epoch': 0.01}                                            
{'loss': 0.9023, 'grad_norm': 1.6262452602386475, 'learning_rate': 4.9758454106280194e-05, 'epoch': 0.01}                                          
{'loss': 0.844, 'grad_norm': 1.5186938047409058, 'learning_rate': 4.9698067632850245e-05, 'epoch': 0.02}                                           
{'loss': 0.7894, 'grad_norm': 1.5138647556304932, 'learning_rate': 4.963768115942029e-05, 'epoch': 0.02}                                           
{'loss': 0.7573, 'grad_norm': 1.4615181684494019, 'learning_rate': 4.957729468599034e-05, 'epoch': 0.03}                                           
{'loss': 0.7181, 'grad_norm': 1.3986672163009644, 'learning_rate': 4.9516908212560386e-05, 'epoch': 0.03}                                          
{'loss': 0.699, 'grad_norm': 1.4726829528808594, 'learning_rate': 4.945652173913044e-05, 'epoch': 0.03}                                            
{'loss': 0.6633, 'grad_norm': 1.3185405731201172, 'learning_rate': 4.939613526570048e-05, 'epoch': 0.04}                                           
{'loss': 0.6406, 'grad_norm': 1.4341990947723389, 'learning_rate': 4.933574879227053e-05, 'epoch': 0.04}                                           
{'loss': 0.6193, 'grad_norm': 1.332947850227356, 'learning_rate': 4.9275362318840584e-05, 'epoch': 0.04}                                           
{'loss': 0.6131, 'grad_norm': 1.3363772630691528, 'learning_rate': 4.9214975845410636e-05, 'epoch': 0.05}                                          
{'loss': 0.5963, 'grad_norm': 1.282954216003418, 'learning_rate': 4.915458937198068e-05, 'epoch': 0.05}                                            
{'loss': 0.569, 'grad_norm': 1.2400481700897217, 'learning_rate': 4.909420289855073e-05, 'epoch': 0.05}                                            
{'loss': 0.5711, 'grad_norm': 1.253767728805542, 'learning_rate': 4.9033816425120776e-05, 'epoch': 0.06}                                           
{'loss': 0.5554, 'grad_norm': 1.2472410202026367, 'learning_rate': 4.897342995169083e-05, 'epoch': 0.06}
...............
{'loss': 0.1659, 'grad_norm': 0.6614893078804016, 'learning_rate': 1.9323671497584542e-07, 'epoch': 2.99}                                          
{'loss': 0.1671, 'grad_norm': 0.658399760723114, 'learning_rate': 1.3285024154589373e-07, 'epoch': 2.99}                                           
{'loss': 0.1707, 'grad_norm': 0.7341009378433228, 'learning_rate': 7.246376811594203e-08, 'epoch': 3.0}                                            
{'loss': 0.1699, 'grad_norm': 0.6466972827911377, 'learning_rate': 1.5096618357487923e-08, 'epoch': 3.0}

When I load from the checkpoint the loss starts from 0.17 :

Optimizer and scheduler state loaded from checkpoint: ~/issues/reproduce/qsddfqeid/checkpoint-16560/optimizer.pt, ~/issues/reproduce/qsddfqeid/checkpoint-16560/scheduler.pt
Global step loaded from ~/issues/reproduce/qsddfqeid/global_step.txt: 16560
Epoch loaded from trainer state: 2.9997282854813876
Optimizer and scheduler initialized.
Optimizer and scheduler state loaded
Learning rate for parameter group 0 manually set: 5e-05
Learning rate for parameter group 1 manually set: 5e-05
TrainerState attributes set in the trainer.
Trainer state before training: TrainerState(epoch=2.9997282854813876, global_step=16560, max_steps=0, logging_steps=500, eval_steps=500, save_steps=500, train_batch_size=None, num_train_epochs=0, num_input_tokens_seen=0, total_flos=0, log_history=[], best_metric=None, best_model_checkpoint=None, is_local_process_zero=True, is_world_process_zero=True, is_hyper_param_search=False, trial_name=None, trial_params=None, stateful_callbacks={'TrainerControl': {'args': {'should_training_stop': False, 'should_epoch_stop': False, 'should_save': False, 'should_evaluate': False, 'should_log': False}, 'attributes': {}}})
Optimizer: AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    initial_lr: 5e-05
    lr: 5e-05
    maximize: False
    weight_decay: 0.0
Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    initial_lr: 5e-05
    lr: 5e-05
    maximize: False
    weight_decay: 0.0
)
LR Scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7f5eb1c31870>
Starting training ...
Global Step before training: 16560
Epoch before training: 2.9997282854813876
{'loss': 0.1734, 'grad_norm': 0.6871393322944641, 'learning_rate': 4.9939613526570054e-05, 'epoch': 0.0}                                           
{'loss': 0.1757, 'grad_norm': 0.6644070148468018, 'learning_rate': 4.98792270531401e-05, 'epoch': 0.01}                                            
{'loss': 0.1758, 'grad_norm': 0.7409001588821411, 'learning_rate': 4.981884057971015e-05, 'epoch': 0.01}                                           
{'loss': 0.1735, 'grad_norm': 0.681877613067627, 'learning_rate': 4.9758454106280194e-05, 'epoch': 0.01}                                           
{'loss': 0.176, 'grad_norm': 0.9430302381515503, 'learning_rate': 4.9698067632850245e-05, 'epoch': 0.02}                                           
{'loss': 0.1797, 'grad_norm': 0.7256011962890625, 'learning_rate': 4.963768115942029e-05, 'epoch': 0.02

github-actions · 2024-11-04T08:05:20Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

dhruvbird · 2024-12-05T06:26:45Z

I recently ran into this myself, and the awesome @nbroad1881 helped me figure out what was wrong. In my case, I was calling:

model = AutoModel.from_pretrained(checkpoint_path)
trainer = Trainer(model, ...)
trainer.train(resume_from_checkpoint=checkpoint_path)

But, I should have said:

model = AutoModel.from_pretrained(base_model_path) # NOT checkpoint_path
trainer = Trainer(model, ...)
trainer.train(resume_from_checkpoint=checkpoint_path)

tl;dr is that the right way to resume from checkpoint is to load the base model weights (and not the model state from checkpoint) and instead pass in the checkpoint name to trainer.train(...).

LysandreJik added PyTorch Anything PyTorch trainer labels Sep 9, 2024

huggingface deleted a comment from github-actions bot Oct 6, 2024

ArthurZucker mentioned this issue Oct 6, 2024

Accelerate x Trainer issue tracker: #33345

Open

43 tasks

github-actions bot closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Resumes with Increased Loss Despite Checkpoint Loading #33336

Training Resumes with Increased Loss Despite Checkpoint Loading #33336

waldnebel commented Sep 6, 2024 •

edited

Loading

LysandreJik commented Sep 6, 2024

MekkCyber commented Oct 10, 2024

github-actions bot commented Nov 4, 2024

dhruvbird commented Dec 5, 2024 •

edited

Loading

Training Resumes with Increased Loss Despite Checkpoint Loading #33336

Training Resumes with Increased Loss Despite Checkpoint Loading #33336

Comments

waldnebel commented Sep 6, 2024 • edited Loading

LysandreJik commented Sep 6, 2024

MekkCyber commented Oct 10, 2024

github-actions bot commented Nov 4, 2024

dhruvbird commented Dec 5, 2024 • edited Loading

waldnebel commented Sep 6, 2024 •

edited

Loading

dhruvbird commented Dec 5, 2024 •

edited

Loading