progressive restoring of trainer state #7652

awaelchli · 2021-05-22T01:26:00Z

What does this PR do?

Includes the changes from #7955, target branch will be changed to master.

The restore flow before:

call to trainer.fit()

model.setup("fit")  # trainer calls setup
call_configure_sharded_model(model)
accelerator.setup(model) 
...

dispatch() #  _run_train, etc.
restore() # restore everything at once (model, callbacks, optimizers, loop state etc.)

The restore flow now:

call to Trainer.fit()

# check if/what we needs to resume, load the checkpoint file
resume_start() 

model.setup("fit")  # trainer calls setup

# model weights get restored as soon as model is setup
restore_datamodule() # calls datamodule.on_load_checkpoint()
restore_model() # calls model.on_load_checkpoint()
restore_callbacks() # calls callback.on_load_checkpoint()

call_configure_sharded_model(model)
accelerator.setup(model) 

pre_dispatch()

# restore training state as soon as everything
restore_training_state()  # restore optimizer, precision, loop progress

dispatch() #  _run_train, etc.

resume_end() # delete checkpoint file in memory

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Follow up to this PR

Make plugin decide if we want to restore model weights before or after call to configure_sharded_model hook as per #7535 (comment)

if TrainingTypePlugin.restore_before_configure_sharded_model:
    restore_model() # also calls model.on_load_checkpoint()

call_configure_sharded_model(model)

if not TrainingTypePlugin.restore_before_configure_sharded_model:
    restore_model() # also calls model.on_load_checkpoint()

pep8speaks · 2021-05-22T01:26:06Z

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-06-17 07:49:49 UTC

codecov · 2021-05-22T01:27:17Z

Codecov Report

Merging #7652 (02c3d54) into master (bc2c2db) will decrease coverage by 0%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #7652    +/-   ##
=======================================
- Coverage      92%     91%    -0%     
=======================================
  Files         207     207            
  Lines       13374   13485   +111     
=======================================
+ Hits        12246   12317    +71     
- Misses       1128    1168    +40

pytorch_lightning/trainer/connectors/checkpoint_connector.py

pytorch_lightning/plugins/training_type/training_type_plugin.py

pytorch_lightning/trainer/connectors/checkpoint_connector.py

tests/models/test_hooks.py

for more information, see https://pre-commit.ci

…ume-9

tchaton

LGTM !

tests/models/test_hooks.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

awaelchli mentioned this pull request May 24, 2021

[bug]Resuming From Checkpoint for FP16 failure (Single GPU) #7535

Closed

awaelchli added checkpointing Related to checkpointing distributed Generic distributed-related topic feature Is an improvement or enhancement labels May 24, 2021

awaelchli changed the base branch from master to feature/resume-6-2 June 9, 2021 14:35

awaelchli changed the base branch from feature/resume-6-2 to master June 9, 2021 14:36

awaelchli mentioned this pull request Jun 9, 2021

split restore_training_state into logical parts [1 / 2] #7901

Merged

11 tasks

ananthsub reviewed Jun 10, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/checkpoint_connector.py Outdated Show resolved Hide resolved

pytorch_lightning/plugins/training_type/training_type_plugin.py Outdated Show resolved Hide resolved

awaelchli mentioned this pull request Jun 10, 2021

refactor checkpoint loading for training type plugins #7928

Merged

11 tasks

awaelchli added 4 commits June 12, 2021 14:36

deprecate

a2560f1

test

88f2015

tests

b0c0b07

ypf

0f17119

awaelchli mentioned this pull request Jun 12, 2021

deprecate hpc_load() and integrate it with restore() #7955

Merged

11 tasks

awaelchli changed the base branch from master to feature/resume-8 June 12, 2021 12:55

all

3aef4e4

awaelchli force-pushed the feature/resume branch from 95a7cb4 to 3aef4e4 Compare June 12, 2021 13:03

clean up

3cc54b8

awaelchli force-pushed the feature/resume branch from 6d453f6 to 3cc54b8 Compare June 12, 2021 13:09

awaelchli added 4 commits June 12, 2021 15:14

clean up

0fa9807

test hook calls

f62cd51

space

09dd67d

Merge branch 'feature/resume-8' into feature/resume-9

ce2887e

Merge branch 'master' into feature/resume-9

c94fd78

mergify bot removed the has conflicts label Jun 16, 2021

add guard to restore_datamodule

a987f1d

awaelchli commented Jun 16, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/checkpoint_connector.py Outdated Show resolved Hide resolved

awaelchli added 3 commits June 16, 2021 09:47

rm duplicate comment

c8ef693

Merge branch 'master' into feature/resume-9

6abf23a

add hook test

6a38c9b

awaelchli commented Jun 16, 2021

View reviewed changes

tests/models/test_hooks.py Outdated Show resolved Hide resolved

awaelchli and others added 4 commits June 16, 2021 14:22

comment

d208b5c

[pre-commit.ci] auto fixes from pre-commit.com hooks

25918e7

for more information, see https://pre-commit.ci

blank

3d9539d

Merge remote-tracking branch 'origin/feature/resume' into feature/res…

f018bf8

…ume-9

awaelchli requested a review from carmocca June 16, 2021 12:37

mergify bot added the has conflicts label Jun 16, 2021

awaelchli added 2 commits June 16, 2021 15:43

merge tests

b45c335

Merge branch 'master' into feature/resume-9

0b4f7a7

mergify bot removed the has conflicts label Jun 16, 2021

clarify how many batches need to run

7e38deb

carmocca approved these changes Jun 16, 2021

View reviewed changes

carmocca added this to the v1.4 milestone Jun 16, 2021

tchaton approved these changes Jun 16, 2021

View reviewed changes

tchaton added the ready PRs ready to be merged label Jun 16, 2021

awaelchli enabled auto-merge (squash) June 16, 2021 21:18

carmocca reviewed Jun 17, 2021

View reviewed changes

tests/models/test_hooks.py Outdated Show resolved Hide resolved

justusschock approved these changes Jun 17, 2021

View reviewed changes

Update tests/models/test_hooks.py

02c3d54

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

awaelchli merged commit eebdc91 into master Jun 17, 2021

awaelchli deleted the feature/resume branch June 17, 2021 08:13

awaelchli mentioned this pull request Jul 21, 2021

fix restoring finetune callbacks after accelerator setup on training resume #8501

Merged

12 tasks

carmocca mentioned this pull request Nov 6, 2023

convert_module in BitsandbytesPrecision is called before configure_model #18936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

progressive restoring of trainer state #7652

progressive restoring of trainer state #7652

awaelchli commented May 22, 2021 •

edited

Loading

pep8speaks commented May 22, 2021 •

edited

Loading

codecov bot commented May 22, 2021 •

edited

Loading

tchaton left a comment

progressive restoring of trainer state #7652

progressive restoring of trainer state #7652

Conversation

awaelchli commented May 22, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

Follow up to this PR

pep8speaks commented May 22, 2021 • edited Loading

Comment last updated at 2021-06-17 07:49:49 UTC

codecov bot commented May 22, 2021 • edited Loading

Codecov Report

tchaton left a comment

Choose a reason for hiding this comment

awaelchli commented May 22, 2021 •

edited

Loading

pep8speaks commented May 22, 2021 •

edited

Loading

codecov bot commented May 22, 2021 •

edited

Loading