-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] ddp testing #2856
[WIP] ddp testing #2856
Conversation
Hello @awaelchli! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-08-09 05:17:30 UTC |
Codecov Report
@@ Coverage Diff @@
## master #2856 +/- ##
=======================================
- Coverage 90% 89% -1%
=======================================
Files 79 79
Lines 7192 7302 +110
=======================================
+ Hits 6496 6515 +19
- Misses 696 787 +91 |
Success! DDP test is running and freezing as expected (see PR description for detail). Now I can continue trying to fix while having the proof / reproducible test here :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, this will be so helpful
std = std.decode('utf-8').strip() | ||
err = err.decode('utf-8').strip() | ||
assert std | ||
if p.returncode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about if p.returncode is falsey? It should probably raise something, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think return code 0 means success, and > 0 fail. It's when you do sys.exit(0) it's success
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh makes sense. Maybe p.returncode != 0? Covers the case that it's None etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will incorporate all your feedback comments in the follow up PR #2997
Came to the realization that it makes no sense to implement tests like the ones here. Multiple trainer.fit() or trainer.test are not possible because the program is launched several times in this mode. See also my other answer in this PR #2790 |
I'm adding tests for (single node) distributed_backend=ddp. So far we only had multi gpu tests with ddp_spawn backend.
This test here is just demonstrating that on master DDP is not working. Fixing the issue in #2790
Proof here that the process hangs up:
http://35.192.60.23/PyTorchLightning/pytorch-lightning/7892/1/2
It's stuck at the ddp test (last test that passed was the DP test right before that).
I manually killed the job after several minutes.