Error on example model #6517

denix56 · 2021-03-15T01:03:28Z

🐛 Bug

I receive on every GAN model that I try to execute.

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

I tried compete reinstall of pytorch in different conda environment, tried both 1.8.0 and 1.9.0.dev
I get error on both GAN example and my custom model.
My custom model has only 1 parameter (each of the networks), that I multiply with input, but it gives error too.

To Reproduce

GAN model from examples

Expected behavior

Environment

PyTorch 1.8.0 (or latest dev)
PyTorch-Lightning 1.2.3

Additional context

I run on 2 Nvidia V100 gpus in ddp mode

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-03-15T20:57:38Z

Fixed by #6460?

awaelchli · 2021-03-15T21:39:40Z

python pl_examples/domain_templates/generative_adversarial_net.py --num_processes 2 --accelerator ddp_cpu

fails on 1.2.3
works on master

fix will be released as part of the bug release this week

workaround for you until then:

Trainer(..., plugins=[DDPPlugin(find_unused_parameters=True)])

denix56 · 2021-03-26T23:02:41Z

@awaelchli
I have lightning 1.2.5 and the bug still exists

MarsSu0618 · 2021-03-28T03:01:54Z

Hi, @awaelchli
I encounter the same problem.

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /opt/conda/conda-bld/pytorch_1579022060824/work/torch/csrc/distributed/c10d/reducer.cpp:514)

My environment:

pytorch: 1.4
pytorch-lightning: 1.1.1

So can i upgrade version to solve it?

tchaton · 2021-03-29T08:52:50Z

Dear @denix56, @MarsSu0618,

Would you mind sharing a reproducible example for us to debug this behaviour ?

And please, update to PyTorch 1.6 / master on Lightning.

Best,
T.C

Borda · 2021-04-01T12:05:24Z

Hi @denix56, @MarsSu0618 could you check with the latest 1.2.6?
Also, I would not recommend relying on PT 1.9dev as it is not a stable release

awaelchli · 2021-04-27T12:49:48Z

Would it make sense to catch this PyTorch warning and replace it with a Lightning-friendly warning?
For PyTorch users it is clear that one has to set the flag in DDP but for Lightning users this is hidden away, and they can only know by going to our documentation.

We can catch the warning in forward in the wrapper, then replace it with a simple instruction to set the plugin parameter.
What do you think @SeanNaren @ananthsub @tchaton (tagging some random people who may know what I'm talking about?)

SeanNaren · 2021-04-27T12:58:13Z

Considering find_unused_parameters is now True by default, this should solve the issue here right @awaelchli?

There is a warning printed now since find_unused_parameters is set to True by default, which suggests to people to turn it off. If we could catch this and tell people to use accelerator=ddp_find_unused_parameters_false instead that would be great (once #7224 is merged) :)

stale · 2021-05-30T13:53:10Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

denix56 added bug Something isn't working help wanted Open to be worked on labels Mar 15, 2021

awaelchli closed this as completed Mar 15, 2021

MarsSu0618 mentioned this issue Mar 29, 2021

RuntimeError on torch.nn.parallel.DistributedDataParallel #6706

Closed

tchaton reopened this Mar 29, 2021

Borda added the waiting on author Waiting on user action, correction, or update label Apr 1, 2021

stale bot added the won't fix This will not be worked on label May 30, 2021

stale bot closed this as completed Jun 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on example model #6517

Error on example model #6517

denix56 commented Mar 15, 2021 •

edited

Loading

awaelchli commented Mar 15, 2021

awaelchli commented Mar 15, 2021 •

edited

Loading

denix56 commented Mar 26, 2021 •

edited

Loading

MarsSu0618 commented Mar 28, 2021

tchaton commented Mar 29, 2021

Borda commented Apr 1, 2021

awaelchli commented Apr 27, 2021

SeanNaren commented Apr 27, 2021

stale bot commented May 30, 2021

Error on example model #6517

Error on example model #6517

Comments

denix56 commented Mar 15, 2021 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

awaelchli commented Mar 15, 2021

awaelchli commented Mar 15, 2021 • edited Loading

denix56 commented Mar 26, 2021 • edited Loading

MarsSu0618 commented Mar 28, 2021

tchaton commented Mar 29, 2021

Borda commented Apr 1, 2021

awaelchli commented Apr 27, 2021

SeanNaren commented Apr 27, 2021

stale bot commented May 30, 2021

denix56 commented Mar 15, 2021 •

edited

Loading

awaelchli commented Mar 15, 2021 •

edited

Loading

denix56 commented Mar 26, 2021 •

edited

Loading