-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not-yet-existing resume_from_checkpoint for auto-resubmit #4366
Labels
checkpointing
Related to checkpointing
feature
Is an improvement or enhancement
help wanted
Open to be worked on
Comments
tarepan
added
feature
Is an improvement or enhancement
help wanted
Open to be worked on
labels
Oct 26, 2020
tarepan
added a commit
to tarepan/pytorch-lightning
that referenced
this issue
Oct 27, 2020
tarepan
added a commit
to tarepan/pytorch-lightning
that referenced
this issue
Oct 27, 2020
tarepan
added a commit
to tarepan/pytorch-lightning
that referenced
this issue
Oct 27, 2020
tarepan
added a commit
to tarepan/pytorch-lightning
that referenced
this issue
Oct 27, 2020
8 tasks
Add draft pull request for discussion. |
tarepan
added a commit
to tarepan/pytorch-lightning
that referenced
this issue
Oct 30, 2020
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
wait for merge. |
Borda
added a commit
that referenced
this issue
Jan 5, 2021
…4402) * Add empty resume_from_checkpoint acceptance #4366 * Fix general error catch with focused file check * Add fsspec HTTP extras Add fsspec's HTTPFileSystem support through http extras. pl has supported remote http file (e.g. #2925), so this commit do not add new functionality. * Fix potential too much logging in DDP * Add PR changelog * Add well-written argument explanation Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix DDP-compatible restore logging Notify from where the states are restored. This feature temporally deleted as a result of PR review. With succeeding review, added with DDP compatibility. * Fix utility import pathes * Refactor load step commentaries * Refactor hpc ckpt suffix acquisition * Refactor restore/hpc_load match * Refactor hpc load trial * Refactor checkpoint dir check * Refactor unneeded function nest * Refactor nested If * Refactor duplicated cache clear * Refactor attempt flow with if/elif * Fix pip8 * Refactor hook commentary Co-authored-by: chaton <thomas@grid.ai> * Fix pep8 * Refactor hpc load checkpoint path acquisition * Fix pip8 * Fix typo Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix typo Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix doc Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Refactor None Union type with Optional * Fix build-doc CI failure debuged in #5329 * Fix fsspec import during build-doc #5329 * Fix test epoch Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix test with latest test models * . Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
The PR is merged. |
Borda
pushed a commit
that referenced
this issue
Jan 6, 2021
…4402) * Add empty resume_from_checkpoint acceptance #4366 * Fix general error catch with focused file check * Add fsspec HTTP extras Add fsspec's HTTPFileSystem support through http extras. pl has supported remote http file (e.g. #2925), so this commit do not add new functionality. * Fix potential too much logging in DDP * Add PR changelog * Add well-written argument explanation Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix DDP-compatible restore logging Notify from where the states are restored. This feature temporally deleted as a result of PR review. With succeeding review, added with DDP compatibility. * Fix utility import pathes * Refactor load step commentaries * Refactor hpc ckpt suffix acquisition * Refactor restore/hpc_load match * Refactor hpc load trial * Refactor checkpoint dir check * Refactor unneeded function nest * Refactor nested If * Refactor duplicated cache clear * Refactor attempt flow with if/elif * Fix pip8 * Refactor hook commentary Co-authored-by: chaton <thomas@grid.ai> * Fix pep8 * Refactor hpc load checkpoint path acquisition * Fix pip8 * Fix typo Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix typo Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix doc Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Refactor None Union type with Optional * Fix build-doc CI failure debuged in #5329 * Fix fsspec import during build-doc #5329 * Fix test epoch Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix test with latest test models * . Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Roger Shieh <sh.rog@protonmail.ch> (cherry picked from commit b0051e8)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
checkpointing
Related to checkpointing
feature
Is an improvement or enhancement
help wanted
Open to be worked on
🚀 Feature
Accept Not-yet-existing
resume_from_checkpoint
inTrainer
for automatic training resume / auto-resubmit.Motivation
In cloud ML training services (e.g. Google AI platform training, AWS SageMaker, AWS Batch), there are Job auto-retry feature.
If we can specify checkpoint path, Job auto-retry can be used for training resume / resubmit.
Unfortunately, PyTorch-Lightning cannot specify Non-(yet-)existing file as
resume_from_checkpoint
argument of Trainer, it simply raise an error.The motivation of this feature request is enabling training resume through Not-yet-existing
resume_from_checkpoint
.(This feature looks similar to auto-resubmit of pl's SLURM. but I am totally newbie about it, it could be nonsense.)
Pitch
current checkpoint restore process:
https://github.com/PyTorchLightning/pytorch-lightning/blob/3abfec896212ea85e45d6ac3ccb323ef242d16de/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L57-L60
It use (existing)
resume_from_checkpoint
.If specified path's file is empty, it raise an error.
What I hope is
It means that if checkpoint_path (
resume_from_checkpoint
) file do not exist, simply ignore it and start training from scratch.In this case, training start normally from scratch, then pl save checkpoints.
If we set save-path == resume_from_checkpoint, latest checkpoint file exist in resume_from_checkpoint path.
When job auto-retry is triggered, because now checkpoint file exists in
resume_from_checkpoint
, in retried job pl load checkpoint fromresume_from_checkpoint
, so training properly resume.Alternatives
Use
hpc_save
&hpc_load
's resume system for normal training.As far as I read the codes, "HPC weights load" (for slurm...?) enable auto-resubmit based on directory (not file) + file name rule (
hpc_ckpt_{ckpt_number}.ckpt
).If we accept checkpoint directory (e.g. resume_from_checkpoint_dir), same mechanism can be used for resume/resubmit.
The text was updated successfully, but these errors were encountered: