Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ddp for test #2866

Closed
wants to merge 1 commit into from
Closed

Fix ddp for test #2866

wants to merge 1 commit into from

Conversation

tullie
Copy link
Contributor

@tullie tullie commented Aug 7, 2020

What does this PR do?

DDP for Trainer.test is broken. The problem is that when each distributed process is spawn, at the start of the test function, a random port is set. When we get to the init_process_group part, because each process has a different port set, the program hangs.

These lines were introduced in https://github.com/PyTorchLightning/pytorch-lightning/pull/2512/files. I'm wondering why they were added in the first place?

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team August 7, 2020 15:52
@Borda Borda added bug Something isn't working ci Continuous Integration distributed Generic distributed-related topic labels Aug 7, 2020
@codecov
Copy link

codecov bot commented Aug 7, 2020

Codecov Report

Merging #2866 into master will decrease coverage by 2%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #2866    +/-   ##
=======================================
- Coverage      90%     88%    -2%     
=======================================
  Files          79      79            
  Lines        7472    7469     -3     
=======================================
- Hits         6756    6568   -188     
- Misses        716     901   +185     

1 similar comment
@codecov
Copy link

codecov bot commented Aug 7, 2020

Codecov Report

Merging #2866 into master will decrease coverage by 2%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #2866    +/-   ##
=======================================
- Coverage      90%     88%    -2%     
=======================================
  Files          79      79            
  Lines        7472    7469     -3     
=======================================
- Hits         6756    6568   -188     
- Misses        716     901   +185     

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does the fix work as previous tests were also passing, mind adding a test for the failure case?

@mergify mergify bot requested a review from a team August 7, 2020 17:16
@awaelchli
Copy link
Contributor

@Borda This only affects ddp. Since we don't have ddp testing, things like this won't show up. I have a PR here that adds some basic ddp tests #2856

These lines were introduced in https://github.com/PyTorchLightning/pytorch-lightning/pull/2512/files. I'm wondering why they were added in the first place?

@tullie @williamFalcon explained to me that choosing random ports is necessary to avoid an error message saying "tried to init ddp connection multiple times" (paraphrasing) when you do forexample .fit and then .test, ultimately calls model.init_ddp_connection twice.
However, I think this is just a workaround and I tried to make it use the same connection without running init multiple times here in this PR #2790 .

@tullie
Copy link
Contributor Author

tullie commented Aug 7, 2020

Going to close this because seems like #2790 is the best wip diff to work on this. @awaelchli will try and look into more of these issues this weekend.

@tullie tullie closed this Aug 7, 2020
@awaelchli
Copy link
Contributor

Thank you tullie. #2790 has lots of dirty debug code so apologies, I ran into some weird things that made me desperate :))

@Borda Borda modified the milestones: 1.0.0, 0.9.0 Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci Continuous Integration distributed Generic distributed-related topic
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants