Resuming training #230
Replies: 3 comments 1 reply
-
I think this may have happened to me before, and it could have been that the checkpoint didn't save properly on the previous version. I think I deleted the final version (5 in your case) and tried again. Have a look inside the version_5 folder to check you have the same checkpoint files as in the earlier version folders |
Beta Was this translation helpful? Give feedback.
-
hey Tim! Thanks - yeah it looks like version 5 saved a blank checkpoints file - though removing version 5 doesn't seem to fix the problem. I've tried removing each of the folders and resuming the training, but it always wants to start from epoch 0. I'm guessing there was some catastrophic failure, which may have corrupted somethings - though TensorBoard reads the checkpoint files just fine. |
Beta Was this translation helpful? Give feedback.
-
I think you can also try passing the folder of the most recent version with the checkpoint files directly to --ckpt rather than letting it search by itself |
Beta Was this translation helpful? Give feedback.
-
Hello!
I am working in collabs these past few days, and have been resuming the training the same way via the Resume Training block.
When I resumed this morning, it seems to have started back at epoch 0, though it made a new version number correctly. Am I correct to assume that this means it has restarted the training completely?
I'm a bit of a collabs n00b, please be patient with me!
Thanks!
K
Beta Was this translation helpful? Give feedback.
All reactions