-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ddp for test #2866
Fix ddp for test #2866
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2866 +/- ##
=======================================
- Coverage 90% 88% -2%
=======================================
Files 79 79
Lines 7472 7469 -3
=======================================
- Hits 6756 6568 -188
- Misses 716 901 +185 |
1 similar comment
Codecov Report
@@ Coverage Diff @@
## master #2866 +/- ##
=======================================
- Coverage 90% 88% -2%
=======================================
Files 79 79
Lines 7472 7469 -3
=======================================
- Hits 6756 6568 -188
- Misses 716 901 +185 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does the fix work as previous tests were also passing, mind adding a test for the failure case?
@Borda This only affects ddp. Since we don't have ddp testing, things like this won't show up. I have a PR here that adds some basic ddp tests #2856
@tullie @williamFalcon explained to me that choosing random ports is necessary to avoid an error message saying "tried to init ddp connection multiple times" (paraphrasing) when you do forexample .fit and then .test, ultimately calls model.init_ddp_connection twice. |
Going to close this because seems like #2790 is the best wip diff to work on this. @awaelchli will try and look into more of these issues this weekend. |
Thank you tullie. #2790 has lots of dirty debug code so apologies, I ran into some weird things that made me desperate :)) |
What does this PR do?
DDP for Trainer.test is broken. The problem is that when each distributed process is spawn, at the start of the test function, a random port is set. When we get to the init_process_group part, because each process has a different port set, the program hangs.
These lines were introduced in https://github.com/PyTorchLightning/pytorch-lightning/pull/2512/files. I'm wondering why they were added in the first place?
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃