Resuming training #230

kaseypocius · 2023-05-25T12:39:30Z

kaseypocius
May 25, 2023

Hello!

I am working in collabs these past few days, and have been resuming the training the same way via the Resume Training block.

When I resumed this morning, it seems to have started back at epoch 0, though it made a new version number correctly. Am I correct to assume that this means it has restarted the training completely?

I'm a bit of a collabs n00b, please be patient with me!

Thanks!
K

timmb · 2023-05-25T15:33:51Z

timmb
May 25, 2023

I think this may have happened to me before, and it could have been that the checkpoint didn't save properly on the previous version. I think I deleted the final version (5 in your case) and tried again. Have a look inside the version_5 folder to check you have the same checkpoint files as in the earlier version folders

0 replies

kaseypocius · 2023-05-25T17:31:10Z

kaseypocius
May 25, 2023
Author

hey Tim! Thanks - yeah it looks like version 5 saved a blank checkpoints file - though removing version 5 doesn't seem to fix the problem. I've tried removing each of the folders and resuming the training, but it always wants to start from epoch 0. I'm guessing there was some catastrophic failure, which may have corrupted somethings - though TensorBoard reads the checkpoint files just fine.

0 replies

timmb · 2023-05-25T17:45:35Z

timmb
May 25, 2023

I think you can also try passing the folder of the most recent version with the checkpoint files directly to --ckpt rather than letting it search by itself

1 reply

kaseypocius May 25, 2023
Author

hey! yeah, that seems to have worked. Looks like the auto-search is just wonky. Thanks for the help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming training #230

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Resuming training #230

kaseypocius May 25, 2023

Replies: 3 comments · 1 reply

timmb May 25, 2023

kaseypocius May 25, 2023 Author

timmb May 25, 2023

kaseypocius May 25, 2023 Author

kaseypocius
May 25, 2023

Replies: 3 comments 1 reply

timmb
May 25, 2023

kaseypocius
May 25, 2023
Author

timmb
May 25, 2023

kaseypocius May 25, 2023
Author