-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[see #10061 instead] Unify checkpoint load paths #9693
[see #10061 instead] Unify checkpoint load paths #9693
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is looking really promising @jjenniferdai ! unifying these paths will make the API consistent and help us simplify the trainer internals
|
||
# restore optimizers, etc. | ||
self.checkpoint_connector.restore_training_state() | ||
if self.state.fn == TrainerFn.FITTING: | ||
self.checkpoint_connector.restore_training_state() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restore training state includes things which can be resumed even if not fitting, such as the loop state.
imo we shouldn't add the check for fitting here, but rather inside the select parts inside of restore_training_state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(above comment?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we do restore loops there and now some other attributes as well. I'd suggest waiting for this one to get merged: #9413
for more information, see https://pre-commit.ci
|
||
def _fit_impl( | ||
self, | ||
model: "pl.LightningModule", | ||
train_dataloaders: Optional[Union[TRAIN_DATALOADERS, LightningDataModule]] = None, | ||
val_dataloaders: Optional[EVAL_DATALOADERS] = None, | ||
datamodule: Optional[LightningDataModule] = None, | ||
ckpt_path: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this also be typed as _PATH
?
if self.state.fn == TrainerFn.FITTING: | ||
self.checkpoint_connector.restore_training_state() | ||
|
||
self.checkpoint_connector.resume_end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b question: why is this bumped up to here vs in _pre_training_routine
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that this calls resume_start
for all (not only fitting), similarly resume_end
for all as well
Hey @jjenniferdai, Would you mind resolving the conflicts ? Best, |
Dear @@jjenniferdai, Any updates ? Best, |
sorry all for the git issues :( please see #10061 instead |
What does this PR do?
Fixes #9405
Does your PR introduce any breaking changes? If yes, please list them.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃