`transfer_batch_to_device` doesn't work under DP #2350

ZhaofengWu · 2020-06-24T16:29:11Z

🐛 Bug

This is discussed under #1756 and I'm opening a separate issue here for visibility.

In the training loop, for DP/DDP/DDP2, we do not move the data to devices ourselves, but instead use the default scatter to transfer data. This results in transfer_batch_to_device not being called.

https://github.com/PyTorchLightning/pytorch-lightning/blob/16a7326e5259a3cdd20a508c34a0f84806d88f8e/pytorch_lightning/trainer/training_loop.py#L736-L737

Expected behavior

Ideally, we want transfer_batch_to_device to work in all settings. If it's not possible at all to override this behavior, at least a run-time warning and/or some warning in the doc should be given.

The text was updated successfully, but these errors were encountered:

williamFalcon · 2020-06-25T02:36:46Z

Ummm... yeah good point. i'm not sure we can add a hook here. Maybe @awaelchli can look into this

stale · 2020-08-24T17:28:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

williamFalcon · 2020-09-23T02:29:16Z

@edenafek @awaelchli did we add a hook for this now?

awaelchli · 2020-09-23T12:54:31Z

No it is not there yet. This would require a custom scatter/gather in LightningDataParallel/LightningDistributedDataParallel that the user defines. Here I am not sure what the recommended way is in Lightning. Should the user subclass these classes and init them in configure_ddp hook?

stale · 2020-10-23T13:33:17Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

ZhaofengWu · 2020-10-24T03:24:06Z

Not sure if the earlier label removal counts towards a new "activity" by the stale bot, so commenting here to indicate that this is not stale and still needs to be addressed.

edenlightning · 2021-02-16T18:16:32Z

@awaelchli is this still relevant?

rohitgr7 · 2021-02-16T18:26:22Z

@edenlightning yes. Still not supported for DP AFAIK.

edenlightning · 2021-02-16T18:32:23Z

@awaelchli is it possible to support?

awaelchli · 2021-02-18T11:57:33Z

Unlikely it can be supported in the near future. The blocker here is torch.nn.DataParallel which just inherently through its design cannot scatter and gather arbitrary objects. At this time, I don't see a non-hacky solution.

Users can still provide their own DataParallel wrapper. For example, torch-geometric people have to import the custom DP module from here: https://github.com/rusty1s/pytorch_geometric/blob/master/torch_geometric/nn/data_parallel.py and use that in Lightning (they would also have to do this without Lightning).

rubencart · 2021-02-18T16:22:31Z

I am not sure if these belongs here, but it seems related. I can also open another issue if you prefer.

For me, when I use DDP, my (custom) LightningDataModule.transfer_batch_to_device fails because its batch argument is now a tuple that contains not only the batch but also the batch index and the optimizer index. When I don't use an accelerator, the batch argument only contains a batch, as I would expect.

With normal execution on 1 GPU and without accelerator, the _step method in gpu_accelerator.py (line 59) calls args[0] = self.to_device(args[0]), where args contains the batch, its index and the optim index. transfer_batch_to_device is then correctly called with only the batch (and the device) as argument.

However, with DDP, the _step method in ddp_accelerator.py (line 173) calls self.ddp_plugin.on_before_forward(self.trainer.get_model(), *args), where args is again the batch, idx and optim idx. The ddp_plugin passes all arguments to transfer_batch_to_device without doing anything with them. It seems that this could thus maybe be fixed by changing line 173 into self.ddp_plugin.on_before_forward(self.trainer.get_model(), args[0])?

Or is this the thing you are talking about, which cannot be fixed in the near future? Then what would you recommend us to do to solve this? Add a check for which accelerator is used to transfer_batch_to_device and handle the batch accordingly?

awaelchli · 2021-02-18T16:32:57Z

@rubencart This seems to be unrelated, and should not be a problem for DDP. This part of the code underwent a lot of changes lately. If you don't mind, would you send us a repro example in a new issue? If you ping me there and it is fixable, I will fix it.

rohitgr7 · 2021-02-18T16:38:17Z

However, with DDP, the _step method in ddp_accelerator.py (line 173) calls self.ddp_plugin.on_before_forward(self.trainer.get_model(), *args), where args is again the batch, idx and optim idx. The ddp_plugin passes all arguments to transfer_batch_to_device without doing anything with them. It seems that this could thus maybe be fixed by changing line 173 into self.ddp_plugin.on_before_forward(self.trainer.get_model(), args[0])?

this has been updated in the recent refactors, so won't be a problem now. you can try master. the new version will be officially released next week I guess.

rubencart · 2021-02-19T09:55:50Z

@rohitgr7 Okay thanks!

edenlightning · 2021-03-02T19:46:40Z

@rohitgr7 is issue still in master?

rohitgr7 · 2021-03-02T19:53:54Z

@edenlightning yes it is just with DP since pytorch DataParallel handles the splitting and transferring of batches to the device internally.

awaelchli · 2021-03-03T23:57:43Z

@edenlightning Even though this issue was originally reported as a bug, it is really not a bug. We document that DP is not supported for this hook.

As mentioned already #2350 (comment)
the users can provide their own DP wrapper as they would anyway have to in regular pytorch code. We could add documentation for how to do that in a faq section somewhere? What do you think?

edenlightning · 2021-04-27T14:06:45Z

@awaelchli maybe in the multi_gpu guide under DP?

edenlightning · 2021-05-02T14:06:23Z

@SeanNaren maybe add to your PR?

ZhaofengWu added bug Something isn't working help wanted Open to be worked on labels Jun 24, 2020

williamFalcon assigned awaelchli Jun 25, 2020

williamFalcon added feature Is an improvement or enhancement and removed bug Something isn't working labels Jun 25, 2020

This was referenced Jun 26, 2020

Support torchtext on a single GPU #2379

Merged

Improve moving data / model to GPU using torchtext #1245

Closed

stale bot added the won't fix This will not be worked on label Aug 24, 2020

awaelchli removed the won't fix This will not be worked on label Aug 24, 2020

williamFalcon self-assigned this Sep 23, 2020

stale bot added the won't fix This will not be worked on label Oct 23, 2020

rohitgr7 removed the won't fix This will not be worked on label Oct 23, 2020

willprice mentioned this issue Nov 12, 2020

Data Parallel bug (return outputs not being moved to same device) #4073

Closed

rohitgr7 mentioned this issue Nov 21, 2020

Add before_batch_transfer and after_batch_transfer hooks #3671

Merged

7 tasks

awaelchli mentioned this issue Dec 7, 2020

DDP not moving batch to device? #4987

Closed

tchaton mentioned this issue Dec 19, 2020

[bug-fix] Call transfer_batch_to_device in DDPlugin #5195

Merged

11 tasks

dhkim0225 mentioned this issue Dec 29, 2020

[PyTorch Lightning + DALIGenericIterator] allocates all tensors to the gpu. NVIDIA/DALI#2575

Closed

awaelchli mentioned this issue Jan 3, 2021

RuntimeError: Tensor for 'out' is on CPU, Tensor for argument #1 'self' is on CPU, but expected them to be on GPU (while checking arguments for addmm) #5196

Closed

edenlightning added distributed Generic distributed-related topic strategy: dp (removed in pl) DataParallel labels Feb 16, 2021

edenlightning unassigned williamFalcon Feb 16, 2021

edenlightning added bug Something isn't working and removed feature Is an improvement or enhancement labels Feb 16, 2021

edenlightning added this to the 1.2.x milestone Feb 16, 2021

rohitgr7 changed the title ~~transfer_batch_to_device doesn't work under DP/DDP/DDP2~~ transfer_batch_to_device doesn't work under DP Feb 19, 2021

edenlightning added 3rd party Related to a 3rd-party won't fix This will not be worked on and removed help wanted Open to be worked on labels Feb 23, 2021

edenlightning unassigned awaelchli Feb 23, 2021

stale bot removed the won't fix This will not be worked on label Feb 23, 2021

edenlightning closed this as completed Feb 23, 2021

rohitgr7 reopened this Feb 23, 2021

rohitgr7 removed the distributed Generic distributed-related topic label Feb 23, 2021

Borda modified the milestones: 1.2.x, 1.3 Apr 18, 2021

edenlightning added priority: 2 Low priority task docs Documentation related and removed bug Something isn't working labels Apr 27, 2021

rohitgr7 mentioned this issue Apr 29, 2021

Code cleaning in preparation for #7258 [3/n] #7262

Merged

10 tasks

awaelchli mentioned this issue May 4, 2021

update docs about transfer_batch_to_device hook while using DP #7343

Merged

11 tasks

edenlightning closed this as completed in #7343 May 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`transfer_batch_to_device` doesn't work under DP #2350

`transfer_batch_to_device` doesn't work under DP #2350

ZhaofengWu commented Jun 24, 2020

williamFalcon commented Jun 25, 2020

stale bot commented Aug 24, 2020

williamFalcon commented Sep 23, 2020

awaelchli commented Sep 23, 2020

stale bot commented Oct 23, 2020

ZhaofengWu commented Oct 24, 2020

edenlightning commented Feb 16, 2021

rohitgr7 commented Feb 16, 2021

edenlightning commented Feb 16, 2021

awaelchli commented Feb 18, 2021 •

edited

Loading

rubencart commented Feb 18, 2021

awaelchli commented Feb 18, 2021

rohitgr7 commented Feb 18, 2021

rubencart commented Feb 19, 2021

edenlightning commented Mar 2, 2021

rohitgr7 commented Mar 2, 2021

awaelchli commented Mar 3, 2021

edenlightning commented Apr 27, 2021

edenlightning commented May 2, 2021

transfer_batch_to_device doesn't work under DP #2350

transfer_batch_to_device doesn't work under DP #2350

Comments

ZhaofengWu commented Jun 24, 2020

🐛 Bug

Expected behavior

williamFalcon commented Jun 25, 2020

stale bot commented Aug 24, 2020

williamFalcon commented Sep 23, 2020

awaelchli commented Sep 23, 2020

stale bot commented Oct 23, 2020

ZhaofengWu commented Oct 24, 2020

edenlightning commented Feb 16, 2021

rohitgr7 commented Feb 16, 2021

edenlightning commented Feb 16, 2021

awaelchli commented Feb 18, 2021 • edited Loading

rubencart commented Feb 18, 2021

awaelchli commented Feb 18, 2021

rohitgr7 commented Feb 18, 2021

rubencart commented Feb 19, 2021

edenlightning commented Mar 2, 2021

rohitgr7 commented Mar 2, 2021

awaelchli commented Mar 3, 2021

edenlightning commented Apr 27, 2021

edenlightning commented May 2, 2021

`transfer_batch_to_device` doesn't work under DP #2350

`transfer_batch_to_device` doesn't work under DP #2350

awaelchli commented Feb 18, 2021 •

edited

Loading