Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default process group is not initialized in setup() function #6318

Closed
dhkim0225 opened this issue Mar 3, 2021 · 4 comments · Fixed by #6506
Closed

Default process group is not initialized in setup() function #6318

dhkim0225 opened this issue Mar 3, 2021 · 4 comments · Fixed by #6506
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 0 High priority task

Comments

@dhkim0225
Copy link
Contributor

dhkim0225 commented Mar 3, 2021

🐛 Bug

Default process group is not initialized in Datamodule setup() function.

This is a BC breaking with PL >= 1.2.0
With, PL == 1.1.8 this code works.

Reproduce notebook: https://colab.research.google.com/drive/1AHadRi0Bly9OnzrJFv8XmS2T9Y5zklvg?usp=sharing

Expected behavior

fit() should be work.

Environment

* CUDA:
	- GPU:
		- Tesla T4
	- available:         True
	- version:           10.1
* Packages:
	- numpy:             1.19.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.7.1+cu101
	- pytorch-lightning: 1.2.1
	- tqdm:              4.41.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.7.10
	- version:           #1 SMP Thu Jul 23 08:00:38 PDT 2020
@dhkim0225 dhkim0225 added bug Something isn't working help wanted Open to be worked on labels Mar 3, 2021
@tchaton tchaton added the priority: 1 Medium priority task label Mar 3, 2021
@awaelchli
Copy link
Contributor

awaelchli commented Mar 3, 2021

I believe this is the PR that changed this behavior: #5858
As a workaround, you should be able to move the code you have in setup() to the dataloader methods.

@SeanNaren
Copy link
Contributor

hey @dhkim0225 thanks for making the issue!

@awaelchli to copy the offline discussion we had:

I don't think #5858 is directly the reason, DDP used to override the setup function and call the hook later itself: https://github.com/PyTorchLightning/pytorch-lightning/blob/7f8fdda9a2c43e679e29fca[…]367c55/pytorch_lightning/accelerators/legacy/ddp_accelerator.py
I'm unsure what the fix here should be.

  • Refactor the code such that we init distributed before the setup hook is called, introducing a new function call from trainer -> accelerator -> training type plugin, such as (init_distributed) which could be a no-op for single device
  • Define a new hook self.call_hook("on_after_accelerator_backend_setup", model) which is called after the setup function and the accelerator has setup distributed via the setup function

@awaelchli awaelchli added the distributed Generic distributed-related topic label Mar 7, 2021
@awaelchli
Copy link
Contributor

awaelchli commented Mar 8, 2021

@SeanNaren thanks providing some suggestions.

The first option sounds reasonable, however it is challenging to make it be called at the right time consistently, since not all plugins init the ddp connection at the same time. For example, we have the DDPPlugin that does it in pre_dispatch:

https://github.com/PyTorchLightning/pytorch-lightning/blob/ff1610492788fb3df79d534c369276d06246c368/pytorch_lightning/plugins/training_type/ddp.py#L225

while the DDPSpawn plugin and all its subclassed plugins do it in the spawned subprocess after dispatch:
https://github.com/PyTorchLightning/pytorch-lightning/blob/718074b99afc17204a1973f1bc94befa611ac094/pytorch_lightning/plugins/training_type/ddp_spawn.py#L129

In fact you can see my old TODO note there below the init_ddp_connection call from the accelerator refactor.

The second option you mention is not going to work, because the hook after accelerator.setup is too early.

I'm not sure, but at the moment it looks like calling the setup hook would have to be a responsibility of the plugin, which is suboptimal.

@SeanNaren
Copy link
Contributor

thanks for the patience @dhkim0225, I have a fix in #6506 feel free to try out. Required a bit of thought/refactoring but we got to a solution in the end. Currently for anyone else who might find this issue, this only works with DDP, not DDP Spawn. This is due to how DDP Spawn is designed, which may be improved in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants