Not-yet-existing resume_from_checkpoint for auto-resubmit #4366

tarepan · 2020-10-26T08:23:58Z

🚀 Feature

Accept Not-yet-existing resume_from_checkpoint in Trainer for automatic training resume / auto-resubmit.

Motivation

In cloud ML training services (e.g. Google AI platform training, AWS SageMaker, AWS Batch), there are Job auto-retry feature.
If we can specify checkpoint path, Job auto-retry can be used for training resume / resubmit.
Unfortunately, PyTorch-Lightning cannot specify Non-(yet-)existing file as resume_from_checkpoint argument of Trainer, it simply raise an error.
The motivation of this feature request is enabling training resume through Not-yet-existing resume_from_checkpoint.
(This feature looks similar to auto-resubmit of pl's SLURM. but I am totally newbie about it, it could be nonsense.)

Pitch

current checkpoint restore process:
https://github.com/PyTorchLightning/pytorch-lightning/blob/3abfec896212ea85e45d6ac3ccb323ef242d16de/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L57-L60
It use (existing) resume_from_checkpoint.
If specified path's file is empty, it raise an error.

What I hope is

         1. HPC weights. 
         2. if no HPC weights ""try to"" restore checkpoint_path weights  
         3. otherwise don't restore weights

It means that if checkpoint_path (resume_from_checkpoint) file do not exist, simply ignore it and start training from scratch.
In this case, training start normally from scratch, then pl save checkpoints.
If we set save-path == resume_from_checkpoint, latest checkpoint file exist in resume_from_checkpoint path.
When job auto-retry is triggered, because now checkpoint file exists in resume_from_checkpoint, in retried job pl load checkpoint from resume_from_checkpoint, so training properly resume.

Alternatives

Use hpc_save & hpc_load's resume system for normal training.
As far as I read the codes, "HPC weights load" (for slurm...?) enable auto-resubmit based on directory (not file) + file name rule (hpc_ckpt_{ckpt_number}.ckpt).
If we accept checkpoint directory (e.g. resume_from_checkpoint_dir), same mechanism can be used for resume/resubmit.

The text was updated successfully, but these errors were encountered:

tarepan · 2020-10-28T00:10:30Z

Add draft pull request for discussion.

stale · 2020-11-27T00:44:26Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

tarepan · 2020-12-04T03:44:57Z

wait for merge.

…4402) * Add empty resume_from_checkpoint acceptance #4366 * Fix general error catch with focused file check * Add fsspec HTTP extras Add fsspec's HTTPFileSystem support through http extras. pl has supported remote http file (e.g. #2925), so this commit do not add new functionality. * Fix potential too much logging in DDP * Add PR changelog * Add well-written argument explanation Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix DDP-compatible restore logging Notify from where the states are restored. This feature temporally deleted as a result of PR review. With succeeding review, added with DDP compatibility. * Fix utility import pathes * Refactor load step commentaries * Refactor hpc ckpt suffix acquisition * Refactor restore/hpc_load match * Refactor hpc load trial * Refactor checkpoint dir check * Refactor unneeded function nest * Refactor nested If * Refactor duplicated cache clear * Refactor attempt flow with if/elif * Fix pip8 * Refactor hook commentary Co-authored-by: chaton <thomas@grid.ai> * Fix pep8 * Refactor hpc load checkpoint path acquisition * Fix pip8 * Fix typo Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix typo Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix doc Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Refactor None Union type with Optional * Fix build-doc CI failure debuged in #5329 * Fix fsspec import during build-doc #5329 * Fix test epoch Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix test with latest test models * . Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>

tarepan · 2021-01-05T02:06:58Z

The PR is merged.
Thanks for all efforts of contributers.

…4402) * Add empty resume_from_checkpoint acceptance #4366 * Fix general error catch with focused file check * Add fsspec HTTP extras Add fsspec's HTTPFileSystem support through http extras. pl has supported remote http file (e.g. #2925), so this commit do not add new functionality. * Fix potential too much logging in DDP * Add PR changelog * Add well-written argument explanation Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix DDP-compatible restore logging Notify from where the states are restored. This feature temporally deleted as a result of PR review. With succeeding review, added with DDP compatibility. * Fix utility import pathes * Refactor load step commentaries * Refactor hpc ckpt suffix acquisition * Refactor restore/hpc_load match * Refactor hpc load trial * Refactor checkpoint dir check * Refactor unneeded function nest * Refactor nested If * Refactor duplicated cache clear * Refactor attempt flow with if/elif * Fix pip8 * Refactor hook commentary Co-authored-by: chaton <thomas@grid.ai> * Fix pep8 * Refactor hpc load checkpoint path acquisition * Fix pip8 * Fix typo Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix typo Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix doc Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Refactor None Union type with Optional * Fix build-doc CI failure debuged in #5329 * Fix fsspec import during build-doc #5329 * Fix test epoch Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix test with latest test models * . Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Roger Shieh <sh.rog@protonmail.ch> (cherry picked from commit b0051e8)

tarepan added feature Is an improvement or enhancement help wanted Open to be worked on labels Oct 26, 2020

SeanNaren added the checkpointing Related to checkpointing label Oct 26, 2020

tarepan added a commit to tarepan/pytorch-lightning that referenced this issue Oct 27, 2020

Add empty resume_from_checkpoint acceptance Lightning-AI#4366

b1027c3

tarepan added a commit to tarepan/pytorch-lightning that referenced this issue Oct 27, 2020

Add empty resume_from_checkpoint acceptance Lightning-AI#4366

bdca4b7

tarepan added a commit to tarepan/pytorch-lightning that referenced this issue Oct 27, 2020

Add empty resume_from_checkpoint acceptance Lightning-AI#4366

2c31b77

tarepan added a commit to tarepan/pytorch-lightning that referenced this issue Oct 27, 2020

Add empty resume_from_checkpoint acceptance Lightning-AI#4366

28d9920

tarepan mentioned this issue Oct 28, 2020

Add non-existing resume_from_checkpoint acceptance for auto-resubmit #4402

Merged

8 tasks

tarepan added a commit to tarepan/pytorch-lightning that referenced this issue Oct 30, 2020

Add empty resume_from_checkpoint acceptance Lightning-AI#4366

43f989d

stale bot added the won't fix This will not be worked on label Nov 27, 2020

stale bot closed this as completed Dec 4, 2020

SeanNaren reopened this Dec 9, 2020

stale bot removed the won't fix This will not be worked on label Dec 9, 2020

tarepan closed this as completed Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not-yet-existing resume_from_checkpoint for auto-resubmit #4366

Not-yet-existing resume_from_checkpoint for auto-resubmit #4366

tarepan commented Oct 26, 2020 •

edited

Loading

tarepan commented Oct 28, 2020

stale bot commented Nov 27, 2020

tarepan commented Dec 4, 2020

tarepan commented Jan 5, 2021

Not-yet-existing resume_from_checkpoint for auto-resubmit #4366

Not-yet-existing resume_from_checkpoint for auto-resubmit #4366

Comments

tarepan commented Oct 26, 2020 • edited Loading

🚀 Feature

Motivation

Pitch

Alternatives

tarepan commented Oct 28, 2020

stale bot commented Nov 27, 2020

tarepan commented Dec 4, 2020

tarepan commented Jan 5, 2021

tarepan commented Oct 26, 2020 •

edited

Loading