You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all,
in the first 1M steps I would get this warning message, but it seemed insignificant. Now I am in the second phase, I wonder if the best checkpoint option is the best, or I should follow the considerations in the warning. If you see the screenshot of the distance log, it looks like pausing training makes a big disturbance.
I would be grateful for any advice worded so a beginning python student could implement :)
(Epoch 8499: 0% 0/163 [00:00<00:00, -1157802.67it/s]/content/miniconda/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py:135: UserWarning: You're resuming from a checkpoint that ended before the epoch ended. This can cause unreliable results if further training is done. Consider using an end-of-epoch checkpoint or enabling fault-tolerant training: https://pytorch-lightning.readthedocs.io/en/stable/advanced/fault_tolerant_training.html rank_zero_warn(
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi all,
in the first 1M steps I would get this warning message, but it seemed insignificant. Now I am in the second phase, I wonder if the
best checkpoint
option is the best, or I should follow the considerations in the warning. If you see the screenshot of the distance log, it looks like pausing training makes a big disturbance.I would be grateful for any advice worded so a beginning python student could implement :)
(
Epoch 8499: 0% 0/163 [00:00<00:00, -1157802.67it/s]/content/miniconda/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py:135: UserWarning: You're resuming from a checkpoint that ended before the epoch ended. This can cause unreliable results if further training is done. Consider using an end-of-epoch checkpoint or enabling fault-tolerant training: https://pytorch-lightning.readthedocs.io/en/stable/advanced/fault_tolerant_training.html rank_zero_warn(
Beta Was this translation helpful? Give feedback.
All reactions