Last Checkpoint causing "unreliable results"? #104

AffeZwei · 2022-07-19T13:08:58Z

AffeZwei
Jul 19, 2022

Hi all,
in the first 1M steps I would get this warning message, but it seemed insignificant. Now I am in the second phase, I wonder if the best checkpoint option is the best, or I should follow the considerations in the warning. If you see the screenshot of the distance log, it looks like pausing training makes a big disturbance.
I would be grateful for any advice worded so a beginning python student could implement :)

(Epoch 8499: 0% 0/163 [00:00<00:00, -1157802.67it/s]/content/miniconda/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py:135: UserWarning: You're resuming from a checkpoint that ended before the epoch ended. This can cause unreliable results if further training is done. Consider using an end-of-epoch checkpoint or enabling fault-tolerant training: https://pytorch-lightning.readthedocs.io/en/stable/advanced/fault_tolerant_training.html rank_zero_warn(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Last Checkpoint causing "unreliable results"? #104

{{title}}

Replies: 0 comments

Select a reply

Last Checkpoint causing "unreliable results"? #104

AffeZwei Jul 19, 2022

Replies: 0 comments

AffeZwei
Jul 19, 2022