[BugFix] Resolve DDP hanging bug within ModelCheckpoint and val_loss #6004

tchaton · 2021-02-16T10:04:04Z

What does this PR do?

Currently, the training can hang under the following conditions:

No monitor is being provided to ModelCheckPoint
val_loss is being logged

Internally in ModelCheckpoint, self.monitor will become val_loss.

    def _add_backward_monitor_support(self, trainer):
        metrics = trainer.logger_connector.callback_metrics
        # backward compatibility... need to deprecate
        if self.monitor is None and 'val_loss' in metrics:
            self.monitor = 'val_loss'
        if self.monitor is None and 'checkpoint_on' in metrics:
            self.monitor = 'checkpoint_on'
        if self.save_top_k is None and self.monitor is not None:
            self.save_top_k = 1

And the code will diverge slightly later if check_monitor_top_k returns different value for both processes.

self.check_monitor_top_k(metrics.get(self.monitor))

Solution:

val_loss and checkpoint_on should be reduced.

Fixes #5865

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Check that target branch and milestone match!

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2021-02-16T10:04:08Z

Hello @tchaton! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-02-17 13:12:13 UTC

codecov · 2021-02-16T10:05:57Z

Codecov Report

Merging #6004 (374466d) into master (5157ba5) will decrease coverage by 3%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #6004    +/-   ##
=======================================
- Coverage      93%     90%    -3%     
=======================================
  Files         160     170    +10     
  Lines       11340   11786   +446     
=======================================
+ Hits        10550   10637    +87     
- Misses        790    1149   +359

…ytorch-lightning into resolve_hangs

justusschock · 2021-02-16T10:10:30Z

pytorch_lightning/callbacks/model_checkpoint.py

+        # when `val_loss` is being logged and no ModelCheckpoint is being provided
+        # `val_loss` or `checkpoint_on` will be selected for monitor and need to be reduced to
+        # prevent processes divergence
+        if self.monitor in ("val_loss", "checkpoint_on"):


Shouldn't we reduce it always?

Metrics already perform reduction during compute or self.log.
I think the issue only happens with legacy metrics.
@awaelchli Should it be always be done ?

I'm not sure about this. I thought the metrics that the checkpoint gets from the trainer/logger connector are already reduced? We shouldn't let the checkpoint have the responsibility to reduce metrics or to assume how. the mean is not always correct.

can we get a test case of this legacy metric? It will help debug and figure out why this reduction is necessary
loss should always be reduced I think, so I'm a bit puzzled how this would fix an issue

carmocca · 2021-02-16T12:38:27Z

Note that we should deprecate and remove _add_backward_monitor_support (we never got around to do it). I can do it in another PR

tchaton · 2021-02-16T13:11:32Z

Note that we should deprecate and remove _add_backward_monitor_support (we never got around to do it). I can do it in another PR

Sounds good !

SeanNaren · 2021-02-16T22:13:53Z

Seems like this didn't fix the issue in #5865. We're really missing a test case here for us to debug, so might be a good idea to get the user to help out here.

resolve bugs

464a9b0

tchaton added bug Something isn't working distributed Generic distributed-related topic labels Feb 16, 2021

tchaton added this to the 1.2 milestone Feb 16, 2021

tchaton self-assigned this Feb 16, 2021

tchaton requested review from Borda, carmocca and williamFalcon as code owners February 16, 2021 10:04

update

153ac4c

tchaton requested review from awaelchli, justusschock and SeanNaren as code owners February 16, 2021 10:05

Merge branch 'master' into resolve_hangs

3890b54

tchaton enabled auto-merge (squash) February 16, 2021 10:06

tchaton added 2 commits February 16, 2021 10:06

update changelog

ca9485c

Merge branch 'resolve_hangs' of https://github.com/PyTorchLightning/p…

27758dd

…ytorch-lightning into resolve_hangs

justusschock approved these changes Feb 16, 2021

View reviewed changes

tchaton mentioned this pull request Feb 16, 2021

Training stuck at 0% after few epochs while training with DDP #5865

Closed

Merge branch 'master' into resolve_hangs

b1ecdaf

carmocca mentioned this pull request Feb 16, 2021

Add deprecation warning to ModelCheckpoint when logging val_loss with no monitor #6012

Merged

10 tasks

mergify bot added the has conflicts label Feb 16, 2021

carmocca mentioned this pull request Feb 17, 2021

[HotFix] Resolve TPU Training #6027

Merged

11 tasks

Merge branch 'master' into resolve_hangs

e542a43

tchaton closed this Feb 17, 2021

auto-merge was automatically disabled February 17, 2021 13:12
Pull request was closed

tchaton deleted the resolve_hangs branch February 17, 2021 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Resolve DDP hanging bug within ModelCheckpoint and val_loss #6004

[BugFix] Resolve DDP hanging bug within ModelCheckpoint and val_loss #6004

tchaton commented Feb 16, 2021 •

edited

Loading

pep8speaks commented Feb 16, 2021 •

edited

Loading

codecov bot commented Feb 16, 2021 •

edited

Loading

justusschock Feb 16, 2021

tchaton Feb 16, 2021 •

edited

Loading

awaelchli Feb 16, 2021

SeanNaren Feb 16, 2021 •

edited

Loading

carmocca commented Feb 16, 2021

tchaton commented Feb 16, 2021

SeanNaren commented Feb 16, 2021

[BugFix] Resolve DDP hanging bug within ModelCheckpoint and val_loss #6004

[BugFix] Resolve DDP hanging bug within ModelCheckpoint and val_loss #6004

Conversation

tchaton commented Feb 16, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

pep8speaks commented Feb 16, 2021 • edited Loading

Comment last updated at 2021-02-17 13:12:13 UTC

codecov bot commented Feb 16, 2021 • edited Loading

Codecov Report

justusschock Feb 16, 2021

Choose a reason for hiding this comment

tchaton Feb 16, 2021 • edited Loading

Choose a reason for hiding this comment

awaelchli Feb 16, 2021

Choose a reason for hiding this comment

SeanNaren Feb 16, 2021 • edited Loading

Choose a reason for hiding this comment

carmocca commented Feb 16, 2021

tchaton commented Feb 16, 2021

SeanNaren commented Feb 16, 2021

tchaton commented Feb 16, 2021 •

edited

Loading

pep8speaks commented Feb 16, 2021 •

edited

Loading

codecov bot commented Feb 16, 2021 •

edited

Loading

tchaton Feb 16, 2021 •

edited

Loading

SeanNaren Feb 16, 2021 •

edited

Loading