Add required states for resumed ModelCheckpoint GC #10995

ORippler · 2021-12-08T13:49:19Z

What does this PR do?

Fixes #4911
Related: #5090

Currently, when resuming training the internal states required for continued ModelCheckpointing are neither saved nor restored. This leads to the fact that k new checkpoints are always generated due to this check. These new checkpoints are properly gced/compared to against, but the old ones are not.

Note that this PR does not handle overrides of monitor, dirpath or mode, as also referred to in #4911

Does your PR introduce any breaking changes? If yes, please list them.

It might be that resuming training fails now if it did not fail before, if the paths were changed in the mean time (refer also #4911). I did not check/test for this, but confirmed that resumption of GC now works properly

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃
cc @carmocca @awaelchli @ninginthecloud @jjenniferdai

justusschock

@ORippler thanks for this fix. To avoid regression again, we need a test for this.

Do you think the following does reflect this issue sufficiently (If so, feel free to take it and commit it directly to your branch):

def test_model_checkpoint_attributes(tmpdir):
    seed_everything()
    model = LogInTwoMethods()

    epochs = 2
    checkpoint_callback = ModelCheckpoint(monitor=None, dirpath=tmpdir, save_top_k=-1, save_last=True)
    trainer = Trainer(
        default_root_dir=tmpdir,
        callbacks=[checkpoint_callback],
        limit_train_batches=10,
        limit_val_batches=10,
        max_epochs=epochs,
        logger=False,
    )

    trainer.fit(model)

    checkpoint = torch.load(os.path.join(tmpdir, 'last.ckpt'))['callbacks'][checkpoint_callback.state_key]
    for k in ("best_models, kth_best_model_path", "kth_value", "last_model_path"):
        assert checkpoint[k] == getattr(checkpoint_callback, k)

pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

ORippler · 2021-12-08T16:40:31Z

@ORippler thanks for this fix. To avoid regression again, we need a test for this.

Do you think the following does reflect this issue sufficiently (If so, feel free to take it and commit it directly to your branch):

def test_model_checkpoint_attributes(tmpdir):
    seed_everything()
    model = LogInTwoMethods()

    epochs = 2
    checkpoint_callback = ModelCheckpoint(monitor=None, dirpath=tmpdir, save_top_k=-1, save_last=True)
    trainer = Trainer(
        default_root_dir=tmpdir,
        callbacks=[checkpoint_callback],
        limit_train_batches=10,
        limit_val_batches=10,
        max_epochs=epochs,
        logger=False,
    )

    trainer.fit(model)

    checkpoint = torch.load(os.path.join(tmpdir, 'last.ckpt'))['callbacks'][checkpoint_callback.state_key]
    for k in ("best_models, kth_best_model_path", "kth_value", "last_model_path"):
        assert checkpoint[k] == getattr(checkpoint_callback, k)

How would this integrate with the different functionality tests for ModelCheckpoint the overall testing framework ? I see many tests checking whether attributes are written to the ckpt properly, but not whether they are loaded. For example here, we check for current_score being written to disk/to the ckpt, but the current on_load_checkpoint never sets this attribute again.

Is this not something we want to test also? Am a bit confused here.

justusschock · 2021-12-08T16:49:59Z

@ORippler that's true. I suppose for end-to-end testing you would have to extend that test by resuming it at a freshly created trainer and examining the properties there.

@carmocca do we want to test for different parametrizations of the callback here?

Note that we do not yet check for proper loading/reinstantiation of ModelCheckpooint based on the ckpt written to disk

for more information, see https://pre-commit.ci

ORippler · 2021-12-09T09:59:08Z

@ORippler that's true. I suppose for end-to-end testing you would have to extend that test by resuming it at a freshly created trainer and examining the properties there.

@carmocca do we want to test for different parametrizations of the callback here?

I added your test and expanded it to check whether a freshly instantiated ModelCheckpoint also loads the properties.
This is still not a fully functional test though.

Off-Note: Do we have a test that compares for equivalence of results generated by one continuous training run and an interrupted one that is resumed by passing the checkpoint to trainer.fit ? This would be nice to have imo

justusschock · 2021-12-09T11:13:01Z

@ORippler we have https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/trainer/test_trainer.py#L399 which doesn't check the results. However, this is on purpose as the checkpoint does not include any random state and thus continuing from the checkpoint doesn't have to yield the exact same results (different random states when using the global rng for example) until now (this is currently in development).

cc @tchaton to add a similar test once fault tolerance is ready

tchaton · 2021-12-09T11:42:57Z

Hey @ORippler,

Yes, we have multiple tests checking the weights are the same before and after for Fault Tolerance. Here they are: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/utilities/test_auto_restart.py

tests/checkpointing/test_model_checkpoint.py

pytorch_lightning/callbacks/model_checkpoint.py

`ModelCheckpoint` is configured to save after every epoch, but `trainer.fit` is called with `max_steps = 1` Note there may be a better way of doing this, where `ModelCheckpoint` is called after `training_step`

for more information, see https://pre-commit.ci

codecov · 2021-12-16T16:08:53Z

Codecov Report

Merging #10995 (3d7994a) into master (e19d93f) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff            @@
##           master   #10995    +/-   ##
========================================
- Coverage      92%      88%    -4%     
========================================
  Files         177      177            
  Lines       16502    16560    +58     
========================================
- Hits        15173    14604   -569     
- Misses       1329     1956   +627

for more information, see https://pre-commit.ci

pytorch_lightning/callbacks/model_checkpoint.py

tests/checkpointing/test_model_checkpoint.py

* First save, then load ckpt. * Instantiate ModelCheckpoint twice.

awaelchli

LGTM

tests/checkpointing/test_model_checkpoint.py

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

carmocca · 2022-03-21T15:24:42Z

pytorch_lightning/callbacks/model_checkpoint.py

+            "best_k_models": self.best_k_models,
+            "kth_best_model_path": self.kth_best_model_path,
+            "kth_value": self.kth_value,
+            "last_model_path": self.last_model_path,


I just noticed an issue with doing this.

Since we save each "ModelCheckpoint" mode sequentally, these attributes will not be correct depending on the order if more than 1 mode triggers a save for the same global step:

https://github.com/PyTorchLightning/pytorch-lightning/blob/fe940e195dceb18eb9f3bd512cea56ae3405d464/pytorch_lightning/callbacks/model_checkpoint.py#L366-L373

Currently, a "top-k" checkpoint will not include the last_model_path path even if it's saved right after for this global step.

I'm not sure what would be the best solution here. I think we should start recommending multiple ModelCheckpoint instances as a best practice because these interactions between flags can be unintuitive.

cc @awaelchli @ananthsub @jjenniferdai
Related to #4335 and #11805 (comment)

Add required states for resumed ModelCheckpoint GC

2014dfb

justusschock added the callback: model checkpoint label Dec 8, 2021

justusschock added this to the 1.5.x milestone Dec 8, 2021

justusschock marked this pull request as ready for review December 8, 2021 13:58

justusschock requested review from Borda, carmocca, kaushikb11, tchaton and williamFalcon as code owners December 8, 2021 13:58

justusschock reviewed Dec 8, 2021

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

Add backwards compatibility with legacy cktps

77ee16d

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

ORippler added 2 commits December 9, 2021 10:50

Add test to check if attrs are written to ckpt

8d35f0a

Note that we do not yet check for proper loading/reinstantiation of ModelCheckpooint based on the ckpt written to disk

Test if attributes are restored properly from ckpt

32d8fd8

ORippler requested review from awaelchli, rohitgr7 and SeanNaren as code owners December 9, 2021 09:54

[pre-commit.ci] auto fixes from pre-commit.com hooks

50e376c

for more information, see https://pre-commit.ci

justusschock approved these changes Dec 10, 2021

View reviewed changes

justusschock added the bug Something isn't working label Dec 10, 2021

rohitgr7 reviewed Dec 10, 2021

View reviewed changes

tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved

rohitgr7 reviewed Dec 10, 2021

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Show resolved Hide resolved

carmocca reviewed Dec 15, 2021

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Show resolved Hide resolved

ORippler and others added 2 commits December 15, 2021 18:46

Fix broken test_callbacks_state_fit_ckpt_path

16861f1

`ModelCheckpoint` is configured to save after every epoch, but `trainer.fit` is called with `max_steps = 1` Note there may be a better way of doing this, where `ModelCheckpoint` is called after `training_step`

Update test_restore.py

0632f23

justusschock and others added 3 commits December 16, 2021 13:09

Update test_restore.py

c7c6141

[pre-commit.ci] auto fixes from pre-commit.com hooks

e64c560

for more information, see https://pre-commit.ci

Check that all attributes are restored properly

763a159

This was referenced Dec 16, 2021

Initialize ModelCheckpoint state as early as possible #11108

Merged

ModelCheckpoint state does not get reloaded after moving on_init_end implementation #11110

Closed

carmocca and others added 4 commits December 17, 2021 00:20

Merge branch 'master' into fix/4911

8bd4ecf

revert changes, use fix on master

f1b66b3

Convert to proper unit test

ba07b8c

[pre-commit.ci] auto fixes from pre-commit.com hooks

e97a086

for more information, see https://pre-commit.ci

carmocca approved these changes Dec 17, 2021

View reviewed changes

mergify bot added the ready PRs ready to be merged label Dec 17, 2021

awaelchli reviewed Dec 17, 2021

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Show resolved Hide resolved

awaelchli reviewed Dec 17, 2021

View reviewed changes

tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved

Refactor test_mode_checkpoint_saveload_ckpt

3d7994a

* First save, then load ckpt. * Instantiate ModelCheckpoint twice.

justusschock requested review from awaelchli and rohitgr7 December 20, 2021 15:03

justusschock enabled auto-merge (squash) December 20, 2021 15:04

awaelchli approved these changes Dec 20, 2021

View reviewed changes

tests/checkpointing/test_model_checkpoint.py Show resolved Hide resolved

justusschock merged commit 86a3c5e into Lightning-AI:master Dec 20, 2021

rohitgr7 mentioned this pull request Feb 7, 2022

Weekly Patch Release v1.5.10 #11789

Closed

12 tasks

carmocca reviewed Mar 21, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add required states for resumed ModelCheckpoint GC #10995

Add required states for resumed ModelCheckpoint GC #10995

ORippler commented Dec 8, 2021 •

edited

Loading

justusschock left a comment

ORippler commented Dec 8, 2021 •

edited

Loading

justusschock commented Dec 8, 2021

ORippler commented Dec 9, 2021 •

edited

Loading

justusschock commented Dec 9, 2021 •

edited

Loading

tchaton commented Dec 9, 2021

codecov bot commented Dec 16, 2021 •

edited

Loading

awaelchli left a comment

carmocca Mar 21, 2022

Add required states for resumed ModelCheckpoint GC #10995

Add required states for resumed ModelCheckpoint GC #10995

Conversation

ORippler commented Dec 8, 2021 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

justusschock left a comment

Choose a reason for hiding this comment

ORippler commented Dec 8, 2021 • edited Loading

justusschock commented Dec 8, 2021

ORippler commented Dec 9, 2021 • edited Loading

justusschock commented Dec 9, 2021 • edited Loading

tchaton commented Dec 9, 2021

codecov bot commented Dec 16, 2021 • edited Loading

Codecov Report

awaelchli left a comment

Choose a reason for hiding this comment

carmocca Mar 21, 2022

Choose a reason for hiding this comment

ORippler commented Dec 8, 2021 •

edited

Loading

ORippler commented Dec 8, 2021 •

edited

Loading

ORippler commented Dec 9, 2021 •

edited

Loading

justusschock commented Dec 9, 2021 •

edited

Loading

codecov bot commented Dec 16, 2021 •

edited

Loading