Default process group is not initialized in setup() function #6318

dhkim0225 · 2021-03-03T03:59:05Z

🐛 Bug

Default process group is not initialized in Datamodule setup() function.

This is a BC breaking with PL >= 1.2.0
With, PL == 1.1.8 this code works.

Reproduce notebook: https://colab.research.google.com/drive/1AHadRi0Bly9OnzrJFv8XmS2T9Y5zklvg?usp=sharing

Expected behavior

fit() should be work.

Environment

* CUDA:
	- GPU:
		- Tesla T4
	- available:         True
	- version:           10.1
* Packages:
	- numpy:             1.19.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.7.1+cu101
	- pytorch-lightning: 1.2.1
	- tqdm:              4.41.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.7.10
	- version:           #1 SMP Thu Jul 23 08:00:38 PDT 2020

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-03-03T23:05:22Z

I believe this is the PR that changed this behavior: #5858
As a workaround, you should be able to move the code you have in setup() to the dataloader methods.

SeanNaren · 2021-03-07T11:22:23Z

hey @dhkim0225 thanks for making the issue!

@awaelchli to copy the offline discussion we had:

I don't think #5858 is directly the reason, DDP used to override the setup function and call the hook later itself: https://github.com/PyTorchLightning/pytorch-lightning/blob/7f8fdda9a2c43e679e29fca[…]367c55/pytorch_lightning/accelerators/legacy/ddp_accelerator.py
I'm unsure what the fix here should be.

Refactor the code such that we init distributed before the setup hook is called, introducing a new function call from trainer -> accelerator -> training type plugin, such as (init_distributed) which could be a no-op for single device
Define a new hook self.call_hook("on_after_accelerator_backend_setup", model) which is called after the setup function and the accelerator has setup distributed via the setup function

awaelchli · 2021-03-08T02:06:49Z

@SeanNaren thanks providing some suggestions.

The first option sounds reasonable, however it is challenging to make it be called at the right time consistently, since not all plugins init the ddp connection at the same time. For example, we have the DDPPlugin that does it in pre_dispatch:

https://github.com/PyTorchLightning/pytorch-lightning/blob/ff1610492788fb3df79d534c369276d06246c368/pytorch_lightning/plugins/training_type/ddp.py#L225

while the DDPSpawn plugin and all its subclassed plugins do it in the spawned subprocess after dispatch:
https://github.com/PyTorchLightning/pytorch-lightning/blob/718074b99afc17204a1973f1bc94befa611ac094/pytorch_lightning/plugins/training_type/ddp_spawn.py#L129

In fact you can see my old TODO note there below the init_ddp_connection call from the accelerator refactor.

The second option you mention is not going to work, because the hook after accelerator.setup is too early.

I'm not sure, but at the moment it looks like calling the setup hook would have to be a responsibility of the plugin, which is suboptimal.

SeanNaren · 2021-03-18T20:46:39Z

thanks for the patience @dhkim0225, I have a fix in #6506 feel free to try out. Required a bit of thought/refactoring but we got to a solution in the end. Currently for anyone else who might find this issue, this only works with DDP, not DDP Spawn. This is due to how DDP Spawn is designed, which may be improved in the future

dhkim0225 added bug Something isn't working help wanted Open to be worked on labels Mar 3, 2021

tchaton added the priority: 1 Medium priority task label Mar 3, 2021

awaelchli added the distributed Generic distributed-related topic label Mar 7, 2021

This was referenced Mar 10, 2021

Create custom distributed plugin to allow model.parallelize Lightning-Universe/lightning-transformers#23

Closed

[Fix] Move init dist connection into the setup function #6506

Merged

edenlightning added priority: 0 High priority task and removed priority: 1 Medium priority task labels Mar 17, 2021

edenlightning assigned SeanNaren Mar 17, 2021

SeanNaren closed this as completed in #6506 Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default process group is not initialized in setup() function #6318

Default process group is not initialized in setup() function #6318

dhkim0225 commented Mar 3, 2021 •

edited

Loading

awaelchli commented Mar 3, 2021 •

edited

Loading

SeanNaren commented Mar 7, 2021

awaelchli commented Mar 8, 2021 •

edited

Loading

SeanNaren commented Mar 18, 2021

Default process group is not initialized in setup() function #6318

Default process group is not initialized in setup() function #6318

Comments

dhkim0225 commented Mar 3, 2021 • edited Loading

🐛 Bug

Expected behavior

Environment

awaelchli commented Mar 3, 2021 • edited Loading

SeanNaren commented Mar 7, 2021

awaelchli commented Mar 8, 2021 • edited Loading

SeanNaren commented Mar 18, 2021

dhkim0225 commented Mar 3, 2021 •

edited

Loading

awaelchli commented Mar 3, 2021 •

edited

Loading

awaelchli commented Mar 8, 2021 •

edited

Loading