-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transfer_batch_to_device
doesn't work under DP
#2350
Comments
Ummm... yeah good point. i'm not sure we can add a hook here. Maybe @awaelchli can look into this |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@edenafek @awaelchli did we add a hook for this now? |
No it is not there yet. This would require a custom scatter/gather in LightningDataParallel/LightningDistributedDataParallel that the user defines. Here I am not sure what the recommended way is in Lightning. Should the user subclass these classes and init them in configure_ddp hook? |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Not sure if the earlier label removal counts towards a new "activity" by the stale bot, so commenting here to indicate that this is not stale and still needs to be addressed. |
@awaelchli is this still relevant? |
@edenlightning yes. Still not supported for DP AFAIK. |
@awaelchli is it possible to support? |
Unlikely it can be supported in the near future. The blocker here is Users can still provide their own DataParallel wrapper. For example, torch-geometric people have to import the custom DP module from here: https://github.com/rusty1s/pytorch_geometric/blob/master/torch_geometric/nn/data_parallel.py and use that in Lightning (they would also have to do this without Lightning). |
I am not sure if these belongs here, but it seems related. I can also open another issue if you prefer. For me, when I use DDP, my (custom) With normal execution on 1 GPU and without accelerator, the However, with DDP, the Or is this the thing you are talking about, which cannot be fixed in the near future? Then what would you recommend us to do to solve this? Add a check for which accelerator is used to |
@rubencart This seems to be unrelated, and should not be a problem for DDP. This part of the code underwent a lot of changes lately. If you don't mind, would you send us a repro example in a new issue? If you ping me there and it is fixable, I will fix it. |
this has been updated in the recent refactors, so won't be a problem now. you can try master. the new version will be officially released next week I guess. |
@rohitgr7 Okay thanks! |
transfer_batch_to_device
doesn't work under DP/DDP/DDP2transfer_batch_to_device
doesn't work under DP
@rohitgr7 is issue still in master? |
@edenlightning yes it is just with DP since pytorch DataParallel handles the splitting and transferring of batches to the device internally. |
@edenlightning Even though this issue was originally reported as a bug, it is really not a bug. We document that DP is not supported for this hook. As mentioned already #2350 (comment) |
@awaelchli maybe in the multi_gpu guide under DP? |
@SeanNaren maybe add to your PR? |
🐛 Bug
This is discussed under #1756 and I'm opening a separate issue here for visibility.
In the training loop, for DP/DDP/DDP2, we do not move the data to devices ourselves, but instead use the default scatter to transfer data. This results in
transfer_batch_to_device
not being called.https://github.com/PyTorchLightning/pytorch-lightning/blob/16a7326e5259a3cdd20a508c34a0f84806d88f8e/pytorch_lightning/trainer/training_loop.py#L736-L737
Expected behavior
Ideally, we want
transfer_batch_to_device
to work in all settings. If it's not possible at all to override this behavior, at least a run-time warning and/or some warning in the doc should be given.The text was updated successfully, but these errors were encountered: