Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent crash if sync_dist=True on CPU #4626

Merged
merged 3 commits into from
Nov 11, 2020
Merged

Conversation

SeanNaren
Copy link
Contributor

@SeanNaren SeanNaren commented Nov 11, 2020

What does this PR do?

As discussed in the PL slack, if sync_dist=True on CPU the code currently crashes. This regresses behaviour before the latest horovod changes (this bug was introduced in #3775) and breaks lots of functionality when sync_dist=True needs to cover all accelerator cases, including CPU.

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In in short, see following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Nov 11, 2020

Hello @SeanNaren! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-11 20:29:39 UTC

@SeanNaren
Copy link
Contributor Author

cc @ananthsub @carmocca

@SeanNaren SeanNaren added this to the 1.0.7 milestone Nov 11, 2020
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great ! Great catch !

Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@codecov
Copy link

codecov bot commented Nov 11, 2020

Codecov Report

Merging #4626 (4252c32) into master (3d202f9) will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #4626   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         116     116           
  Lines        8873    8879    +6     
======================================
+ Hits         8254    8261    +7     
+ Misses        619     618    -1     

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have not tested on master, but changes look ok.
thanks @tchaton
EDIT: thanks @sean

@SeanNaren
Copy link
Contributor Author

I've decided to take @awaelchli advice and enforce the check across the accelerators, keeping the base class as is. Less liability of running into issues in the future imo. I'll give some time for people to re-review :) cc @tchaton @ananthsub @rohitgr7

@SeanNaren SeanNaren merged commit 33470ba into master Nov 11, 2020
@SeanNaren SeanNaren deleted the bug/sync_dist_default branch November 11, 2020 22:04
@Borda Borda modified the milestones: 1.0.7, 1.0.x Nov 11, 2020
@@ -682,3 +682,69 @@ def get_expected_output(func_attr, original_values):
assert func_name in trainer.logger_connector.progress_bar_metrics
else:
assert func_name not in trainer.logger_connector.progress_bar_metrics


def test_logging_sync_dist_true_cpu(tmpdir):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we parametrize this test and test both True/False
also, it seems as the bellow is using the very same class, can we define it just once?
cc: @SeanNaren

Borda pushed a commit that referenced this pull request Nov 12, 2020
* Added test/fix for sync_dist raising NotImplementedError

* Fixed comments/formatting

* Revert base class change, enforce sync tensors across accelerators, added GPU test

(cherry picked from commit 33470ba)
Borda pushed a commit that referenced this pull request Nov 12, 2020
* Added test/fix for sync_dist raising NotImplementedError

* Fixed comments/formatting

* Revert base class change, enforce sync tensors across accelerators, added GPU test

(cherry picked from commit 33470ba)
Borda pushed a commit that referenced this pull request Nov 12, 2020
* Added test/fix for sync_dist raising NotImplementedError

* Fixed comments/formatting

* Revert base class change, enforce sync tensors across accelerators, added GPU test

(cherry picked from commit 33470ba)
Borda added a commit that referenced this pull request Nov 12, 2020
* Added test/fix for sync_dist raising NotImplementedError

* Fixed comments/formatting

* Revert base class change, enforce sync tensors across accelerators, added GPU test

(cherry picked from commit 33470ba)
rohitgr7 pushed a commit that referenced this pull request Nov 21, 2020
* Added test/fix for sync_dist raising NotImplementedError

* Fixed comments/formatting

* Revert base class change, enforce sync tensors across accelerators, added GPU test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: 0 High priority task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants