-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Resumes with Increased Loss Despite Checkpoint Loading #33336
Comments
Hello @waldnebel, sorry I couldn't reproduce the issue, when I train the model for the first time I have the following output on Tinystories dataset :
When I load from the checkpoint the loss starts from 0.17 :
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I recently ran into this myself, and the awesome @nbroad1881 helped me figure out what was wrong. In my case, I was calling:
But, I should have said:
tl;dr is that the right way to resume from checkpoint is to load the base model weights (and not the model state from checkpoint) and instead pass in the checkpoint name to |
Problem: When resuming the training of a BERT model with the Hugging Face Trainer from a checkpoint, the loss value increases again in the second run, even though the checkpoint is loaded correctly and the global_step, optimizer state, and scheduler state are restored.
Troubleshooting Steps Taken:
Manually Setting global_step:
Set the global_step manually in the Trainer after loading the checkpoint.
Result: Problem not resolved.
Overriding the train() method:
Created a new class MyTrainer inheriting from Trainer and overrode the train() method to set the global_step when loading a checkpoint.
Result: Problem not resolved.
Removing resume_from_checkpoint:
Removed the resume_from_checkpoint argument from the trainer.train() call and manually loaded the global_step, optimizer state, and scheduler state.
Result: Problem not resolved.
Resetting/Deactivating the Learning Rate Scheduler:
Reinitialized the scheduler after loading the checkpoint, skipped the step() call, or completely deactivated the scheduler.
Result: Problem not resolved. The learning rate is still set to 0.0.
Manually Setting the Learning Rate:
Manually set the learning rate of the parameter groups in the optimizer after loading the checkpoint.
Result: Problem not resolved. The scheduler resets the learning rate back to 0.0.
Explicitly Calculating num_training_steps:
Explicitly calculated the number of training steps and stored it in the num_training_steps variable, using it when initializing the scheduler.
Result: Problem not resolved.
Manually Moving trainer_state.json into Checkpoint Subfolder:
Moved the trainer_state.json file after trainer.save_state() using shutil.move() into the checkpoint-XXXX subfolder.
Result: Problem not resolved.
Manually Setting TrainerState Attributes:
Instead of overwriting the entire trainer.state variable, the global_step and epoch attributes were set individually after loading the checkpoint.
Result: Problem not resolved.
Key Observations:
The error only occurs when resuming training from a checkpoint. The first run works flawlessly.
The training data, environment, and hardware remain identical between runs.
The global_step is loaded correctly and set in the Trainer.
The optimizer and scheduler states are loaded correctly.
The TrainerState is loaded correctly.
Manually setting the learning rate has no effect, the scheduler resets it.
Suspicions:
The Trainer might internally reset the global_step or the optimizer state after the checkpoint is loaded.
There might be a bug in the Trainer class that prevents the correct restoration of the training state.
Question for Hugging Face Transformers:
Why does the loss value increase when resuming training from a checkpoint in the second run, even though the checkpoint is loaded correctly? Are there any known issues with the Trainer regarding the restoration of the training state, especially the learning rate and scheduler? We have tried numerous troubleshooting steps (listed above) without success. We suspect a potential bug within the Trainer itself. Could you provide guidance or insights on how to resolve this issue?
Additional Information:
Model: BertForMaskedLM
Trainer: transformers.Trainer
Scheduler: The default scheduler of the Trainer (linear warmup scheduler with subsequent linear decay)
Optimizer: AdamW
Dataset: A custom text dataset
Code: The relevant code is provided above.
Logs: Detailed logs of the training process can be provided if needed.
Goal:
We want to be able to resume training from checkpoints without the loss value increasing again and the training having to start from scratch. We hope that the Hugging Face Transformers team can help us solve this problem.
File:
Terminal Output:
Run:
{‘loss’: 2.8581, ‘grad_norm’: 2.897771120071411, ‘learning_rate’: 4.682539682539683e-05, ‘epoch’: 0.19}
{‘loss’: 2.1545, ‘grad_norm’: 2.8640339374542236, ‘learning_rate’: 4.3650793650793655e-05, ‘epoch’: 0.38}
{‘loss’: 2.0396, ‘grad_norm’: 2.7819628715515137, ‘learning_rate’: 4.047619047619048e-05, ‘epoch’: 0.57}
{‘loss’: 1.9786, ‘grad_norm’: 2.644606828689575, ‘learning_rate’: 3.730158730158731e-05, ‘epoch’: 0.76}
{‘loss’: 1.9553, ‘grad_norm’: 2.7417070865631104, ‘learning_rate’: 3.412698412698413e-05, ‘epoch’: 0.95}
{‘loss’: 1.8961, ‘grad_norm’: 2.6237854957580566, ‘learning_rate’: 3.095238095238095e-05, ‘epoch’: 1.14}
{‘loss’: 1.8793, ‘grad_norm’: 2.5830185413360596, ‘learning_rate’: 2.777777777777778e-05, ‘epoch’: 1.33}
{‘loss’: 1.8715, ‘grad_norm’: 2.652275800704956, ‘learning_rate’: 2.4603174603174602e-05, ‘epoch’: 1.52}
{‘loss’: 1.8362, ‘grad_norm’: 2.6065754890441895, ‘learning_rate’: 2.1428571428571428e-05, ‘epoch’: 1.71}
{‘loss’: 1.8474, ‘grad_norm’: 2.6352243423461914, ‘learning_rate’: 1.8253968253968254e-05, ‘epoch’: 1.9}
{‘loss’: 1.8197, ‘grad_norm’: 2.56719708442688, ‘learning_rate’: 1.5079365079365079e-05, ‘epoch’: 2.09}
{‘loss’: 1.826, ‘grad_norm’: 2.5195322036743164, ‘learning_rate’: 1.1904761904761905e-05, ‘epoch’: 2.27}
{‘loss’: 1.8074, ‘grad_norm’: 2.614032506942749, ‘learning_rate’: 8.73015873015873e-06, ‘epoch’: 2.46}
{‘loss’: 1.8029, ‘grad_norm’: 2.600111246109009, ‘learning_rate’: 5.555555555555556e-06, ‘epoch’: 2.65}
{‘loss’: 1.7879, ‘grad_norm’: 2.4874589443206787, ‘learning_rate’: 2.3809523809523808e-06, ‘epoch’: 2.84}
{‘train_runtime’: 142.9413, ‘train_samples_per_second’: 846.431, ‘train_steps_per_second’: 2.204, ‘train_loss’: 1.9500562516469804, ‘epoch’: 2.99}
Run:
{‘loss’: 1.453, ‘grad_norm’: 2.474428415298462, ‘learning_rate’: 4.682539682539683e-05, ‘epoch’: 0.19}
{‘loss’: 1.406, ‘grad_norm’: 2.5064773559570312, ‘learning_rate’: 4.3650793650793655e-05, ‘epoch’: 0.38}
{‘loss’: 1.4152, ‘grad_norm’: 2.5167486667633057, ‘learning_rate’: 4.047619047619048e-05, ‘epoch’: 0.57}
{‘loss’: 1.426, ‘grad_norm’: 2.4449574947357178, ‘learning_rate’: 3.730158730158731e-05, ‘epoch’: 0.76}
{‘loss’: 1.4592, ‘grad_norm’: 2.5427122116088867, ‘learning_rate’: 3.412698412698413e-05, ‘epoch’: 0.95}
{‘loss’: 1.4357, ‘grad_norm’: 2.4496681690216064, ‘learning_rate’: 3.095238095238095e-05, ‘epoch’: 1.14}
{‘loss’: 1.4684, ‘grad_norm’: 2.4780757427215576, ‘learning_rate’: 2.777777777777778e-05, ‘epoch’: 1.33}
{‘loss’: 1.5027, ‘grad_norm’: 2.5224385261535645, ‘learning_rate’: 2.4603174603174602e-05, ‘epoch’: 1.52}
{‘loss’: 1.5133, ‘grad_norm’: 2.5421390533447266, ‘learning_rate’: 2.1428571428571428e-05, ‘epoch’: 1.71}
{‘loss’: 1.5651, ‘grad_norm’: 2.5934836864471436, ‘learning_rate’: 1.8253968253968254e-05, ‘epoch’: 1.9}
{‘loss’: 1.562, ‘grad_norm’: 2.5455050468444824, ‘learning_rate’: 1.5079365079365079e-05, ‘epoch’: 2.09}
{‘loss’: 1.6139, ‘grad_norm’: 2.580508232116699, ‘learning_rate’: 1.1904761904761905e-05, ‘epoch’: 2.27}
{‘loss’: 1.631, ‘grad_norm’: 2.7025833129882812, ‘learning_rate’: 8.73015873015873e-06, ‘epoch’: 2.46}
{‘loss’: 1.6631, ‘grad_norm’: 2.669140338897705, ‘learning_rate’: 5.555555555555556e-06, ‘epoch’: 2.65}
{‘loss’: 1.6425, ‘grad_norm’: 2.4610960483551025, ‘learning_rate’: 2.3809523809523808e-06, ‘epoch’: 2.84}
{‘train_runtime’: 157.4714, ‘train_samples_per_second’: 768.33, ‘train_steps_per_second’: 2.0, ‘train_loss’: 1.5153770507328095, ‘epoch’: 2.99}
Logfile Output:
The text was updated successfully, but these errors were encountered: