Support for uneven inputs in LightningDDP #5141

rohan-varma · 2020-12-15T02:42:40Z

What does this PR do?

Adds support for DDP join() API for uneven inputs support to PT lightning. See pytorch/pytorch#38174 for the PyTorch RFC. Fixes #3325

It is implemented in a backwards-compatible way by gating the code with a version check requiring torch >= 1.7.0 where this API is available. Open to discussion on better ideas on how to ensure the BC. Since PT lightning overrides PT DDP implementation, we will likely have to change this code again for PT 1.8 where these APIs have somewhat changed.

We also introduce rebuild_buckets feature back to PyTorch lightning (the logic to enable it was moved to Python in PT 1.7, but not updated on the lightning side), this is necessary since the DDP join() API assumes that buckets are rebuilt (which is always true in PT DDP). It also provides potential for performance improvement by ensuring our allreduce order corresponds with gradient ready order.

Tested using the script at https://gist.github.com/rohan-varma/3906e7f07669f0177801a9f753848550. The join() API is also extensively tested in the PyTorch codebase.

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2020-12-15T03:47:36Z

Codecov Report

Merging #5141 (0f157c6) into master (84bb9db) will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #5141   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         134     134           
  Lines        9907    9907           
======================================
  Hits         9206    9206           
  Misses        701     701

justusschock

LGTM. One minor thing with version comparison.

Can you maybe also add a todo for changing with PT1.8?

pytorch_lightning/utilities/__init__.py

rohan-varma · 2020-12-15T15:49:56Z

It looks like the CI error seems unrelated: https://github.com/PyTorchLightning/pytorch-lightning/pull/5141/checks?check_run_id=1558105867

Failed to import 'pytorch_lightning.profiler': no module named pytorch_lightning.profiler
Warning, treated as error:
[autosummary] failed to import 'pytorch_lightning.callbacks.Callback': no module named pytorch_lightning.callbacks.Callback
make: *** [Makefile:19: html] Error 2
Error: Process completed with exit code 2.

SeanNaren · 2020-12-15T16:31:00Z

Failing doc checks... Will investigate!

The base branch was changed.

carmocca

This is great! Few questions:

How does this fit into the LightningModule lifecycle? Where should users set the context manager?
Does this work with automatic optimization?
Can we use the testing script you provided as a test?
Is there any disadvantage of using this context manager when inputs are not uneven?

carmocca · 2020-12-15T21:39:31Z

pytorch_lightning/overrides/data_parallel.py

+        # during forward computation.
+        # This should be called only once during whole training period.
+        if DDP_JOIN_AND_REBUILD_BUCKETS_AVAILABLE and self.reducer._rebuild_buckets():
+            logging.info("Reducer buckets have been rebuilt in this iteration.")


This should probably a debug or rank_zero_debug message

pytorch_lightning/overrides/data_parallel.py

rohan-varma · 2020-12-17T02:35:05Z

How does this fit into the LightningModule lifecycle? Where should users set the context manager?

Does this work with automatic optimization?

Can we use the testing script you provided as a test?

Is there any disadvantage of using this context manager when inputs are not uneven?

I don't have full context, but I took a stab at answering the questions below

I'm relatively new to lightning, but my presumption is that we should use it similar to how other context managers that are supported in PT lightning for DDP are used. One example is the no_sync context manager that seems to be supported in PT lightning. If there are pointers on how the API for using this looks like, I'd be happy to take a look.
I'm not sure what automatic optimization is in this context, does anyone else have insight into this?
Does this mean we should add the script's logic as a unittest? I think that would be useful.
There should be no perf disadvantage, we analyzed that case (and more performance benchmarking) in the original PyTorch PR: Join-based API to support DDP uneven inputs pytorch/pytorch#42577

stale · 2021-01-26T16:36:27Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

Borda · 2021-01-29T10:17:03Z

@rohan-varma @carmocca @tchaton how is it going here? 🐰

awaelchli · 2021-01-29T10:53:38Z

We recently refactored DDP #5185 and now rely directly on the PyTorch implementation.

I can see that these codelines are already in pytorch master
https://github.com/pytorch/pytorch/blob/ebe26b81d2874992331e3cd48859c73a19517895/torch/nn/parallel/distributed.py#L674
which means this should be automatically supported by Lightning with newer pytorch versions.

snie2012 · 2021-01-30T19:02:16Z

So how can we use join with Lightning?

carmocca · 2021-02-01T12:21:33Z

which means this should be automatically supported by Lightning with newer pytorch versions.

should we close this then? @awaelchli

So how can we use join with Lightning?

Just as you would with plain PyTorch, but needs 1.2 to be released first.

awaelchli · 2021-02-01T12:31:16Z

Not sure, I think we still need to insert that join context manager somewhere in Lightning right @rohan-varma?
My comment was more about the few outdated lines in this PR.

rohan-varma · 2021-02-01T21:25:54Z

Glad to see that the PT lightning implementation now directly relies on the pt DDP implementation!

@awaelchli Yes, we would need to insert the context manager somewhere. In this script: https://gist.github.com/rohan-varma/3906e7f07669f0177801a9f753848550 I did it by directly accessing the lightning DDP override, and calling .join() on it, this should work if it subclasses DDP. Not sure how other context managers such as DDP's no_sync for grads are used in lightning, is there a pointer to that?

awaelchli · 2021-02-02T05:52:54Z

Your example script is very useful!

the ddp no_sync is used here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/c6b1a7264321ca81682dc1aaa4ba21edce67c399/pytorch_lightning/plugins/ddp_plugin.py#L148

and this context manager is then later used here in the training loop in the case of gradient accumulation happening:
https://github.com/PyTorchLightning/pytorch-lightning/blob/c6b1a7264321ca81682dc1aaa4ba21edce67c399/pytorch_lightning/trainer/training_loop.py#L672

@justusschock What do you think, should we allow the plugin to return a training loop context similar to what is done with the training_step_context in the new plugins?

snie2012 · 2021-02-04T06:01:05Z

Is it a good idea to make join available by simply passing a flag to the trainer/datamodule?

justusschock · 2021-02-05T09:03:02Z

@awaelchli That's probably a good idea, we could also move the ddp sync stuff there as well as the join. The training loop would be a bit cleaner then, since nothing explicitly related to ddp would be there.

skylook · 2021-02-08T07:01:07Z

When will uneven inputs be supported or at least a simple way of fixing? Thx

rohan-varma · 2021-02-10T06:05:35Z

@ananthsub mentioned to me offline that he'd be taking over this PR, so cc'ing him here.

tchaton · 2021-02-22T09:41:25Z

Hey @ananthsub,

Any updates on this PR ?

Best,
T.C

Borda · 2021-05-11T21:52:31Z

@rohan-varma thx for your patience, I'll try to push it a bit...
@PyTorchLightning/core-contributors can I get your review here?

awaelchli · 2021-05-12T11:31:35Z

In the linked issue we discussed and concluded that at the moment we can't integrate it #3325
We may want to come back to it when we enable more loop customization, but some features in Lightning wouldn't work well with join.

rohan-varma requested review from awaelchli, Borda, justusschock, SeanNaren, tchaton and williamFalcon as code owners December 15, 2020 02:42

rohan-varma marked this pull request as draft December 15, 2020 03:31

rohan-varma changed the title ~~[WIP] Support for uneven inputs~~ Support for uneven inputs Dec 15, 2020

rohan-varma marked this pull request as ready for review December 15, 2020 05:06

rohan-varma mentioned this pull request Dec 15, 2020

Support uneven DDP inputs with pytorch model.join #3325

Open

rohan-varma changed the title ~~Support for uneven inputs~~ Support for uneven inputs in LightningDDP Dec 15, 2020

justusschock previously approved these changes Dec 15, 2020

View reviewed changes

pytorch_lightning/utilities/__init__.py Outdated Show resolved Hide resolved

SeanNaren previously approved these changes Dec 15, 2020

View reviewed changes

SeanNaren self-requested a review December 15, 2020 16:30

tchaton changed the base branch from master to release/1.2-dev December 15, 2020 20:15

tchaton requested review from ananyahjha93 and teddykoker as code owners December 15, 2020 20:15

carmocca reviewed Dec 15, 2020

View reviewed changes

tchaton reviewed Dec 16, 2020

View reviewed changes

pytorch_lightning/overrides/data_parallel.py Show resolved Hide resolved

SkafteNicki mentioned this pull request Dec 16, 2020

A question about metrics on ddp #2398

Closed

SkafteNicki linked an issue Dec 16, 2020 that may be closed by this pull request

A question about metrics on ddp #2398

Closed

[Feat] Added uneven input support/sync with upstream DDP

7b38ba7

SeanNaren force-pushed the uneven_inp branch from f2a21ad to 7b38ba7 Compare December 16, 2020 16:28

stale bot added the won't fix This will not be worked on label Dec 31, 2020

SeanNaren added the feature Is an improvement or enhancement label Dec 31, 2020

github-actions bot added the has conflicts label Jan 12, 2021

stale bot added the won't fix This will not be worked on label Jan 26, 2021

stale bot removed the won't fix This will not be worked on label Jan 29, 2021

Borda requested review from justusschock, tchaton and carmocca February 8, 2021 07:35

Borda added the bug Something isn't working label Feb 8, 2021

carmocca added this to the 1.3 milestone Feb 8, 2021

Base automatically changed from release/1.2-dev to master February 11, 2021 14:32

carmocca assigned ananthsub Feb 22, 2021

edenlightning removed the bug Something isn't working label Mar 23, 2021

edenlightning removed this from the 1.3 milestone Apr 15, 2021

awaelchli closed this May 12, 2021

awaelchli mentioned this pull request Apr 26, 2022

test produces a warning when using DDP #12862

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for uneven inputs in LightningDDP #5141

Support for uneven inputs in LightningDDP #5141

rohan-varma commented Dec 15, 2020 •

edited by SeanNaren

Loading

codecov bot commented Dec 15, 2020

justusschock left a comment

rohan-varma commented Dec 15, 2020

SeanNaren commented Dec 15, 2020

carmocca left a comment

carmocca Dec 15, 2020

rohan-varma commented Dec 17, 2020

stale bot commented Jan 26, 2021

Borda commented Jan 29, 2021

awaelchli commented Jan 29, 2021

snie2012 commented Jan 30, 2021 •

edited

Loading

carmocca commented Feb 1, 2021

awaelchli commented Feb 1, 2021

rohan-varma commented Feb 1, 2021

awaelchli commented Feb 2, 2021

snie2012 commented Feb 4, 2021

justusschock commented Feb 5, 2021

skylook commented Feb 8, 2021

rohan-varma commented Feb 10, 2021

tchaton commented Feb 22, 2021

Borda commented May 11, 2021

awaelchli commented May 12, 2021

Support for uneven inputs in LightningDDP #5141

Support for uneven inputs in LightningDDP #5141

Conversation

rohan-varma commented Dec 15, 2020 • edited by SeanNaren Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented Dec 15, 2020

Codecov Report

justusschock left a comment

Choose a reason for hiding this comment

rohan-varma commented Dec 15, 2020

SeanNaren commented Dec 15, 2020

carmocca left a comment

Choose a reason for hiding this comment

carmocca Dec 15, 2020

Choose a reason for hiding this comment

rohan-varma commented Dec 17, 2020

stale bot commented Jan 26, 2021

Borda commented Jan 29, 2021

awaelchli commented Jan 29, 2021

snie2012 commented Jan 30, 2021 • edited Loading

carmocca commented Feb 1, 2021

awaelchli commented Feb 1, 2021

rohan-varma commented Feb 1, 2021

awaelchli commented Feb 2, 2021

snie2012 commented Feb 4, 2021

justusschock commented Feb 5, 2021

skylook commented Feb 8, 2021

rohan-varma commented Feb 10, 2021

tchaton commented Feb 22, 2021

Borda commented May 11, 2021

awaelchli commented May 12, 2021

rohan-varma commented Dec 15, 2020 •

edited by SeanNaren

Loading

snie2012 commented Jan 30, 2021 •

edited

Loading