-
Notifications
You must be signed in to change notification settings - Fork 34
Distributed training performance slowdown when resuming from a checkpoint. #184
Comments
Hi @subhashbylaiah, i see the assumption here is the model uses In the ddp source code, it uses ray_lightning/ray_lightning/ray_ddp.py Lines 377 to 386 in 6aed848
i am going to test this assumption, and keep u posted here. |
@amogkam, my current guess for this issue is as follows. the trainer is using the delayed gpu accelerator. the checkpoint is gpu checkpoint. When resuming from the checkpoint, it load the gpu checkpoint. and the speed might be also due to load the gpu checkpoint from the cpu and then moving to the gpu. |
Thanks @JiahaoYao for checking on this issue. Can you please confirm if you are able to reproduce this issue with the example code? |
I am using ray_lightning to distribute training across a 8 node ray cluster with GPU. I am seeing the training performance significantly slow down (by a factor of 2-3) when resuming from a checkpoint. When I start a new training it takes an average of 35 minutes per epoch. But, when I restart the training from a previous checkpoint it takes over 90 minutes per epoch. This behavior is pretty consistent. I am using CLI to submit the job to the remote Ray cluster.
To isolate this problem I also tried to run a multi-node distributed training just using Pytorch lightning, and this was not a problem. Resuming from a checkpoint took just about the same time as with a fresh training run.
I have the code here to reproduce the example, along with instructions to run it.
https://gist.github.com/subhashbylaiah/6b403339cfaf619c59af403f9740bf29
From my analysis, as also I have shared in the notes to reproduce this, I see the cause of the issue to be somehow associated with the precision of the input images.
BTW, the trainer precision is still fp16 in both cases.
Library versions
The text was updated successfully, but these errors were encountered: