resume_from_checkpoint should not start from scratch if ckpt is not found #7072

devashishd12 · 2021-04-17T14:28:56Z

🐛 Bug

If the checkpoint file is not found at the location provided in resume_from_checkpoint argument in pl.Trainer, the training starts from scratch after displaying a UserWarning that is easy to miss.

To Reproduce

Use the following BoringModel.

Expected behavior

Should raise a FileNotFoundError and not start training from scratch.

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-04-17T17:52:31Z

Not sure if there is a real reason why we have warning instead of error. Want to give this a try? Contributions are welcome.

ia-davidpichler · 2021-07-28T22:04:34Z

I actually preferred the old semantics of this parameter as it makes my logic for an interruptible training job easier. Also the docs weren't updated to reflect this change: https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#init

carmocca · 2021-08-05T15:03:00Z

Also the docs weren't updated to reflect this change: https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#init

@ia-davidpichler Would you be interested in opening a PR with the doc fix? 😄

russellbrooks · 2021-10-12T17:10:27Z

Agreed that the prior behavior is better for accommodating interruptible training jobs (e.g. AWS Spot) and this feels like a regression in capability.

Whether to error or warn should be exposed as an additional parameter, with it defaulting to something like error_on_checkpoint_missing=False for consistency with docs:

If there is no checkpoint file at the path, start from scratch. If resuming from mid-epoch checkpoint, training will start from the beginning of the next epoch.

carmocca · 2021-10-14T16:00:56Z

The discussion for the change is in the PR that closed this: #7075

I'll update the docs.

devashishd12 added bug Something isn't working help wanted Open to be worked on labels Apr 17, 2021

awaelchli added the good first issue Good for newcomers label Apr 17, 2021

vballoli mentioned this issue Apr 17, 2021

Changes resume_from_checkpoint warning to error #7075

Merged

10 tasks

awaelchli added priority: 2 Low priority task priority: 1 Medium priority task and removed priority: 2 Low priority task labels Apr 18, 2021

edenlightning assigned edenlightning and unassigned edenlightning Apr 19, 2021

carmocca added priority: 2 Low priority task working as intended Working as intended and removed priority: 1 Medium priority task labels Apr 23, 2021

carmocca closed this as completed in #7075 Apr 28, 2021

carmocca mentioned this issue Oct 15, 2021

Update resume_from_checkpoint docs #9952

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resume_from_checkpoint should not start from scratch if ckpt is not found #7072

resume_from_checkpoint should not start from scratch if ckpt is not found #7072

devashishd12 commented Apr 17, 2021 •

edited

Loading

awaelchli commented Apr 17, 2021

ia-davidpichler commented Jul 28, 2021

carmocca commented Aug 5, 2021

russellbrooks commented Oct 12, 2021 •

edited

Loading

carmocca commented Oct 14, 2021

resume_from_checkpoint should not start from scratch if ckpt is not found #7072

resume_from_checkpoint should not start from scratch if ckpt is not found #7072

Comments

devashishd12 commented Apr 17, 2021 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

awaelchli commented Apr 17, 2021

ia-davidpichler commented Jul 28, 2021

carmocca commented Aug 5, 2021

russellbrooks commented Oct 12, 2021 • edited Loading

carmocca commented Oct 14, 2021

devashishd12 commented Apr 17, 2021 •

edited

Loading

russellbrooks commented Oct 12, 2021 •

edited

Loading