-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for uneven inputs in LightningDDP #5141
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5141 +/- ##
======================================
Coverage 93% 93%
======================================
Files 134 134
Lines 9907 9907
======================================
Hits 9206 9206
Misses 701 701 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One minor thing with version comparison.
Can you maybe also add a todo for changing with PT1.8?
It looks like the CI error seems unrelated: https://github.com/PyTorchLightning/pytorch-lightning/pull/5141/checks?check_run_id=1558105867
|
Failing doc checks... Will investigate! |
The base branch was changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Few questions:
- How does this fit into the LightningModule lifecycle? Where should users set the context manager?
- Does this work with automatic optimization?
- Can we use the testing script you provided as a test?
- Is there any disadvantage of using this context manager when inputs are not uneven?
# during forward computation. | ||
# This should be called only once during whole training period. | ||
if DDP_JOIN_AND_REBUILD_BUCKETS_AVAILABLE and self.reducer._rebuild_buckets(): | ||
logging.info("Reducer buckets have been rebuilt in this iteration.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably a debug
or rank_zero_debug
message
f2a21ad
to
7b38ba7
Compare
I don't have full context, but I took a stab at answering the questions below
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions. |
@rohan-varma @carmocca @tchaton how is it going here? 🐰 |
We recently refactored DDP #5185 and now rely directly on the PyTorch implementation. I can see that these codelines are already in pytorch master |
So how can we use |
should we close this then? @awaelchli
Just as you would with plain PyTorch, but needs 1.2 to be released first. |
Not sure, I think we still need to insert that join context manager somewhere in Lightning right @rohan-varma? |
Glad to see that the PT lightning implementation now directly relies on the pt DDP implementation! @awaelchli Yes, we would need to insert the context manager somewhere. In this script: https://gist.github.com/rohan-varma/3906e7f07669f0177801a9f753848550 I did it by directly accessing the lightning DDP override, and calling |
Your example script is very useful! the ddp no_sync is used here: and this context manager is then later used here in the training loop in the case of gradient accumulation happening: @justusschock What do you think, should we allow the plugin to return a training loop context similar to what is done with the training_step_context in the new plugins? |
Is it a good idea to make |
@awaelchli That's probably a good idea, we could also move the ddp sync stuff there as well as the join. The training loop would be a bit cleaner then, since nothing explicitly related to ddp would be there. |
When will uneven inputs be supported or at least a simple way of fixing? Thx |
@ananthsub mentioned to me offline that he'd be taking over this PR, so cc'ing him here. |
Hey @ananthsub, Any updates on this PR ? Best, |
@rohan-varma thx for your patience, I'll try to push it a bit... |
In the linked issue we discussed and concluded that at the moment we can't integrate it #3325 |
What does this PR do?
Adds support for DDP
join()
API for uneven inputs support to PT lightning. See pytorch/pytorch#38174 for the PyTorch RFC. Fixes #3325It is implemented in a backwards-compatible way by gating the code with a version check requiring torch >= 1.7.0 where this API is available. Open to discussion on better ideas on how to ensure the BC. Since PT lightning overrides PT DDP implementation, we will likely have to change this code again for PT 1.8 where these APIs have somewhat changed.
We also introduce
rebuild_buckets
feature back to PyTorch lightning (the logic to enable it was moved to Python in PT 1.7, but not updated on the lightning side), this is necessary since the DDPjoin()
API assumes that buckets are rebuilt (which is always true in PT DDP). It also provides potential for performance improvement by ensuring our allreduce order corresponds with gradient ready order.Tested using the script at https://gist.github.com/rohan-varma/3906e7f07669f0177801a9f753848550. The
join()
API is also extensively tested in the PyTorch codebase.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃