-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent crash if sync_dist=True on CPU #4626
Conversation
Hello @SeanNaren! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-11-11 20:29:39 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great ! Great catch !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Codecov Report
@@ Coverage Diff @@
## master #4626 +/- ##
======================================
Coverage 93% 93%
======================================
Files 116 116
Lines 8873 8879 +6
======================================
+ Hits 8254 8261 +7
+ Misses 619 618 -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to take @awaelchli advice and enforce the check across the accelerators, keeping the base class as is. Less liability of running into issues in the future imo. I'll give some time for people to re-review :) cc @tchaton @ananthsub @rohitgr7 |
@@ -682,3 +682,69 @@ def get_expected_output(func_attr, original_values): | |||
assert func_name in trainer.logger_connector.progress_bar_metrics | |||
else: | |||
assert func_name not in trainer.logger_connector.progress_bar_metrics | |||
|
|||
|
|||
def test_logging_sync_dist_true_cpu(tmpdir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we parametrize this test and test both True/False
also, it seems as the bellow is using the very same class, can we define it just once?
cc: @SeanNaren
* Added test/fix for sync_dist raising NotImplementedError * Fixed comments/formatting * Revert base class change, enforce sync tensors across accelerators, added GPU test (cherry picked from commit 33470ba)
* Added test/fix for sync_dist raising NotImplementedError * Fixed comments/formatting * Revert base class change, enforce sync tensors across accelerators, added GPU test (cherry picked from commit 33470ba)
* Added test/fix for sync_dist raising NotImplementedError * Fixed comments/formatting * Revert base class change, enforce sync tensors across accelerators, added GPU test (cherry picked from commit 33470ba)
* Added test/fix for sync_dist raising NotImplementedError * Fixed comments/formatting * Revert base class change, enforce sync tensors across accelerators, added GPU test (cherry picked from commit 33470ba)
* Added test/fix for sync_dist raising NotImplementedError * Fixed comments/formatting * Revert base class change, enforce sync tensors across accelerators, added GPU test
What does this PR do?
As discussed in the PL slack, if
sync_dist=True
on CPU the code currently crashes. This regresses behaviour before the latest horovod changes (this bug was introduced in #3775) and breaks lots of functionality whensync_dist=True
needs to cover all accelerator cases, including CPU.PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In in short, see following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃