Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] Resolve DDP hanging bug within ModelCheckpoint and val_loss #6004

Closed
wants to merge 7 commits into from

Conversation

tchaton
Copy link
Contributor

@tchaton tchaton commented Feb 16, 2021

What does this PR do?

Currently, the training can hang under the following conditions:

  • No monitor is being provided to ModelCheckPoint
  • val_loss is being logged

Internally in ModelCheckpoint, self.monitor will become val_loss.

    def _add_backward_monitor_support(self, trainer):
        metrics = trainer.logger_connector.callback_metrics
        # backward compatibility... need to deprecate
        if self.monitor is None and 'val_loss' in metrics:
            self.monitor = 'val_loss'
        if self.monitor is None and 'checkpoint_on' in metrics:
            self.monitor = 'checkpoint_on'
        if self.save_top_k is None and self.monitor is not None:
            self.save_top_k = 1

And the code will diverge slightly later if check_monitor_top_k returns different value for both processes.

self.check_monitor_top_k(metrics.get(self.monitor))

Solution:

  • val_loss and checkpoint_on should be reduced.

Fixes #5865

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Check that target branch and milestone match!

Did you have fun?

Make sure you had fun coding 🙃

@tchaton tchaton added bug Something isn't working distributed Generic distributed-related topic labels Feb 16, 2021
@tchaton tchaton added this to the 1.2 milestone Feb 16, 2021
@tchaton tchaton self-assigned this Feb 16, 2021
@pep8speaks
Copy link

pep8speaks commented Feb 16, 2021

Hello @tchaton! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-02-17 13:12:13 UTC

@codecov
Copy link

codecov bot commented Feb 16, 2021

Codecov Report

Merging #6004 (374466d) into master (5157ba5) will decrease coverage by 3%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #6004    +/-   ##
=======================================
- Coverage      93%     90%    -3%     
=======================================
  Files         160     170    +10     
  Lines       11340   11786   +446     
=======================================
+ Hits        10550   10637    +87     
- Misses        790    1149   +359     

@tchaton tchaton enabled auto-merge (squash) February 16, 2021 10:06
# when `val_loss` is being logged and no ModelCheckpoint is being provided
# `val_loss` or `checkpoint_on` will be selected for monitor and need to be reduced to
# prevent processes divergence
if self.monitor in ("val_loss", "checkpoint_on"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we reduce it always?

Copy link
Contributor Author

@tchaton tchaton Feb 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics already perform reduction during compute or self.log.
I think the issue only happens with legacy metrics.
@awaelchli Should it be always be done ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. I thought the metrics that the checkpoint gets from the trainer/logger connector are already reduced? We shouldn't let the checkpoint have the responsibility to reduce metrics or to assume how. the mean is not always correct.

Copy link
Contributor

@SeanNaren SeanNaren Feb 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get a test case of this legacy metric? It will help debug and figure out why this reduction is necessary
loss should always be reduced I think, so I'm a bit puzzled how this would fix an issue

@carmocca
Copy link
Contributor

Note that we should deprecate and remove _add_backward_monitor_support (we never got around to do it). I can do it in another PR

@tchaton
Copy link
Contributor Author

tchaton commented Feb 16, 2021

Note that we should deprecate and remove _add_backward_monitor_support (we never got around to do it). I can do it in another PR

Sounds good !

@SeanNaren
Copy link
Contributor

Seems like this didn't fix the issue in #5865. We're really missing a test case here for us to debug, so might be a good idea to get the user to help out here.

@carmocca carmocca mentioned this pull request Feb 17, 2021
11 tasks
@tchaton tchaton closed this Feb 17, 2021
auto-merge was automatically disabled February 17, 2021 13:12

Pull request was closed

@tchaton tchaton deleted the resolve_hangs branch February 17, 2021 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic has conflicts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Training stuck at 0% after few epochs while training with DDP
6 participants