Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resume_from_checkpoint should not start from scratch if ckpt is not found #7072

Closed
devashishd12 opened this issue Apr 17, 2021 · 5 comments · Fixed by #7075 or #9952
Closed

resume_from_checkpoint should not start from scratch if ckpt is not found #7072

devashishd12 opened this issue Apr 17, 2021 · 5 comments · Fixed by #7075 or #9952
Labels
bug Something isn't working good first issue Good for newcomers help wanted Open to be worked on priority: 2 Low priority task working as intended Working as intended

Comments

@devashishd12
Copy link

devashishd12 commented Apr 17, 2021

🐛 Bug

If the checkpoint file is not found at the location provided in resume_from_checkpoint argument in pl.Trainer, the training starts from scratch after displaying a UserWarning that is easy to miss.

To Reproduce

Use the following BoringModel.

Expected behavior

Should raise a FileNotFoundError and not start training from scratch.

@devashishd12 devashishd12 added bug Something isn't working help wanted Open to be worked on labels Apr 17, 2021
@awaelchli
Copy link
Contributor

Not sure if there is a real reason why we have warning instead of error. Want to give this a try? Contributions are welcome.

@awaelchli awaelchli added the good first issue Good for newcomers label Apr 17, 2021
@awaelchli awaelchli added priority: 2 Low priority task priority: 1 Medium priority task and removed priority: 2 Low priority task labels Apr 18, 2021
@carmocca carmocca added priority: 2 Low priority task working as intended Working as intended and removed priority: 1 Medium priority task labels Apr 23, 2021
@ia-davidpichler
Copy link

I actually preferred the old semantics of this parameter as it makes my logic for an interruptible training job easier. Also the docs weren't updated to reflect this change: https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#init

@carmocca
Copy link
Contributor

carmocca commented Aug 5, 2021

Also the docs weren't updated to reflect this change: https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#init

@ia-davidpichler Would you be interested in opening a PR with the doc fix? 😄

@russellbrooks
Copy link

russellbrooks commented Oct 12, 2021

Agreed that the prior behavior is better for accommodating interruptible training jobs (e.g. AWS Spot) and this feels like a regression in capability.

Whether to error or warn should be exposed as an additional parameter, with it defaulting to something like error_on_checkpoint_missing=False for consistency with docs:

If there is no checkpoint file at the path, start from scratch. If resuming from mid-epoch checkpoint, training will start from the beginning of the next epoch.

@carmocca
Copy link
Contributor

The discussion for the change is in the PR that closed this: #7075

I'll update the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Open to be worked on priority: 2 Low priority task working as intended Working as intended
Projects
None yet
6 participants