Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ddp testing #2856

Closed
wants to merge 18 commits into from
Closed

[WIP] ddp testing #2856

wants to merge 18 commits into from

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Aug 6, 2020

I'm adding tests for (single node) distributed_backend=ddp. So far we only had multi gpu tests with ddp_spawn backend.

This test here is just demonstrating that on master DDP is not working. Fixing the issue in #2790

Proof here that the process hangs up:
http://35.192.60.23/PyTorchLightning/pytorch-lightning/7892/1/2
It's stuck at the ddp test (last test that passed was the DP test right before that).
I manually killed the job after several minutes.

@pep8speaks
Copy link

pep8speaks commented Aug 6, 2020

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-09 05:17:30 UTC

@mergify mergify bot requested a review from a team August 6, 2020 21:45
@Borda Borda added the ci Continuous Integration label Aug 6, 2020
@codecov
Copy link

codecov bot commented Aug 7, 2020

Codecov Report

Merging #2856 into master will decrease coverage by 1%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #2856    +/-   ##
=======================================
- Coverage      90%     89%    -1%     
=======================================
  Files          79      79            
  Lines        7192    7302   +110     
=======================================
+ Hits         6496    6515    +19     
- Misses        696     787    +91     

tests/models/test_gpu.py Outdated Show resolved Hide resolved
@awaelchli awaelchli mentioned this pull request Aug 7, 2020
7 tasks
@mergify mergify bot requested a review from a team August 7, 2020 20:56
@mergify mergify bot requested a review from a team August 7, 2020 21:03
@awaelchli awaelchli added this to the 1.0.0 milestone Aug 8, 2020
@awaelchli
Copy link
Contributor Author

Success! DDP test is running and freezing as expected (see PR description for detail). Now I can continue trying to fix while having the proof / reproducible test here :)

Copy link
Contributor

@tullie tullie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, this will be so helpful

tests/models/test_gpu.py Show resolved Hide resolved
tests/models/test_gpu.py Show resolved Hide resolved
tests/models/test_gpu.py Show resolved Hide resolved
std = std.decode('utf-8').strip()
err = err.decode('utf-8').strip()
assert std
if p.returncode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about if p.returncode is falsey? It should probably raise something, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think return code 0 means success, and > 0 fail. It's when you do sys.exit(0) it's success

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh makes sense. Maybe p.returncode != 0? Covers the case that it's None etc..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will incorporate all your feedback comments in the follow up PR #2997

@awaelchli
Copy link
Contributor Author

@tullie most recent fixes + test is in #2790, this one here is just to show current failing on master.

I totally confused you with Ethan and pinged him on the PR instead of you, sorry!!

@awaelchli
Copy link
Contributor Author

Came to the realization that it makes no sense to implement tests like the ones here. Multiple trainer.fit() or trainer.test are not possible because the program is launched several times in this mode.
The tests that I wrote here cannot pass, and it is not a bug in Lightning, it is simply a limitation by design that I, blind as I am, was not aware of.

See also my other answer in this PR #2790

@awaelchli awaelchli closed this Aug 15, 2020
@awaelchli awaelchli deleted the ddp_testing branch August 16, 2020 16:05
@Borda Borda modified the milestones: 1.0.0, 0.9.0 Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants