-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default process group is not initialized in setup() function #6318
Comments
I believe this is the PR that changed this behavior: #5858 |
hey @dhkim0225 thanks for making the issue! @awaelchli to copy the offline discussion we had: I don't think #5858 is directly the reason, DDP used to override the setup function and call the hook later itself: https://github.com/PyTorchLightning/pytorch-lightning/blob/7f8fdda9a2c43e679e29fca[…]367c55/pytorch_lightning/accelerators/legacy/ddp_accelerator.py
|
@SeanNaren thanks providing some suggestions. The first option sounds reasonable, however it is challenging to make it be called at the right time consistently, since not all plugins init the ddp connection at the same time. For example, we have the DDPPlugin that does it in pre_dispatch: while the DDPSpawn plugin and all its subclassed plugins do it in the spawned subprocess after dispatch: In fact you can see my old TODO note there below the The second option you mention is not going to work, because the hook after I'm not sure, but at the moment it looks like calling the setup hook would have to be a responsibility of the plugin, which is suboptimal. |
thanks for the patience @dhkim0225, I have a fix in #6506 feel free to try out. Required a bit of thought/refactoring but we got to a solution in the end. Currently for anyone else who might find this issue, this only works with DDP, not DDP Spawn. This is due to how DDP Spawn is designed, which may be improved in the future |
🐛 Bug
Default process group is not initialized in
Datamodule
setup()
function.This is a BC breaking with
PL >= 1.2.0
With,
PL == 1.1.8
this code works.Reproduce notebook: https://colab.research.google.com/drive/1AHadRi0Bly9OnzrJFv8XmS2T9Y5zklvg?usp=sharing
Expected behavior
fit()
should be work.Environment
The text was updated successfully, but these errors were encountered: