Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on example model #6517

Closed
denix56 opened this issue Mar 15, 2021 · 9 comments
Closed

Error on example model #6517

denix56 opened this issue Mar 15, 2021 · 9 comments
Labels
bug Something isn't working help wanted Open to be worked on waiting on author Waiting on user action, correction, or update won't fix This will not be worked on

Comments

@denix56
Copy link

denix56 commented Mar 15, 2021

🐛 Bug

I receive on every GAN model that I try to execute.

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

I tried compete reinstall of pytorch in different conda environment, tried both 1.8.0 and 1.9.0.dev
I get error on both GAN example and my custom model.
My custom model has only 1 parameter (each of the networks), that I multiply with input, but it gives error too.

To Reproduce

GAN model from examples

Expected behavior

Environment

PyTorch 1.8.0 (or latest dev)
PyTorch-Lightning 1.2.3

Additional context

I run on 2 Nvidia V100 gpus in ddp mode

@denix56 denix56 added bug Something isn't working help wanted Open to be worked on labels Mar 15, 2021
@awaelchli
Copy link
Contributor

Fixed by #6460?

@awaelchli
Copy link
Contributor

awaelchli commented Mar 15, 2021

python pl_examples/domain_templates/generative_adversarial_net.py --num_processes 2 --accelerator ddp_cpu

fails on 1.2.3
works on master

fix will be released as part of the bug release this week

workaround for you until then:

Trainer(..., plugins=[DDPPlugin(find_unused_parameters=True)])

@denix56
Copy link
Author

denix56 commented Mar 26, 2021

@awaelchli
I have lightning 1.2.5 and the bug still exists

@MarsSu0618
Copy link

Hi, @awaelchli
I encounter the same problem.

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /opt/conda/conda-bld/pytorch_1579022060824/work/torch/csrc/distributed/c10d/reducer.cpp:514)

My environment:

pytorch: 1.4
pytorch-lightning: 1.1.1

So can i upgrade version to solve it?

@tchaton
Copy link
Contributor

tchaton commented Mar 29, 2021

Dear @denix56, @MarsSu0618,

Would you mind sharing a reproducible example for us to debug this behaviour ?

And please, update to PyTorch 1.6 / master on Lightning.

Best,
T.C

@Borda
Copy link
Member

Borda commented Apr 1, 2021

Hi @denix56, @MarsSu0618 could you check with the latest 1.2.6?
Also, I would not recommend relying on PT 1.9dev as it is not a stable release

@Borda Borda added the waiting on author Waiting on user action, correction, or update label Apr 1, 2021
@awaelchli
Copy link
Contributor

Would it make sense to catch this PyTorch warning and replace it with a Lightning-friendly warning?
For PyTorch users it is clear that one has to set the flag in DDP but for Lightning users this is hidden away, and they can only know by going to our documentation.

We can catch the warning in forward in the wrapper, then replace it with a simple instruction to set the plugin parameter.
What do you think @SeanNaren @ananthsub @tchaton (tagging some random people who may know what I'm talking about?)

@SeanNaren
Copy link
Contributor

Considering find_unused_parameters is now True by default, this should solve the issue here right @awaelchli?

There is a warning printed now since find_unused_parameters is set to True by default, which suggests to people to turn it off. If we could catch this and tell people to use accelerator=ddp_find_unused_parameters_false instead that would be great (once #7224 is merged) :)

@stale
Copy link

stale bot commented May 30, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label May 30, 2021
@stale stale bot closed this as completed Jun 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on waiting on author Waiting on user action, correction, or update won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants