Collective mismatch at end of training epoch #13997
-
I’m facing an issue where training a lightning module with DDP on >4 GPUs gets stuck at end of first training epoch (I made sure there is no validation epoch). This doesn’t occur with 2 GPUs. I made sure that the dataset is balanced, and that the total batch size is equal to number of GPUs. Detecting unused parameters is on. There are unused parameters (and that’s intentional). I obtained a stack traces with TORCH_CPP_LOG_LEVEL=INFO and TORCH_DISTRIBUTED_DEBUG=DETAIL. I’m having difficulty understanding these stack traces, since they include >10 layers of PyTorch Lightning calls, and I don’t have a good enough understanding of Lighting’s internals. Perhaps someone can glance at this and get a sense for what are the top possible causes? Stack trace from rank 7:
Stack trace from rank 2 (ranks 0,1,3,4,5,6 are also similar):
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Found the issue. Even with find_unused_parameters=True, there needs to be at least one used parameter every training step. I had a unique case where for some batches no parameters were used. This caused ranks to lose sync. My guess for why this happens is as follows: The ranks with used parameters would get stuck on allreduce in the backwards hook, waiting for the rank with no used parameters to catch up. However, the rank with no used parameters doesn't hit a backwards hook, and instead proceeds to the next training step, when it eventually joins up with the allreduce operation. Since it has now done one more training step than the other ranks, it will run out of data at the end of the epoch earlier. When this happens, it proceeds to save the model checkpoint, while the other ranks are still waiting on gradient allreduce. The quick workaround was to add a dummy parameter to the model and simply return it in place of the loss in this special case. |
Beta Was this translation helpful? Give feedback.
-
Could you give me more detailed instructions? For example,some detailed code. Thanks!!! |
Beta Was this translation helpful? Give feedback.
-
Had a similar issue, realized it was because some of my datasamples in my dataset were none and since I was batching them randomly for each worker it caused some workers to move faster through batches than the other causing a difference between iterations among workers at the end of the epoch. Realized this after I used |
Beta Was this translation helpful? Give feedback.
Found the issue. Even with find_unused_parameters=True, there needs to be at least one used parameter every training step.
I had a unique case where for some batches no parameters were used. This caused ranks to lose sync. My guess for why this happens is as follows: The ranks with used parameters would get stuck on allreduce in the backwards hook, waiting for the rank with no used parameters to catch up. However, the rank with no used parameters doesn't hit a backwards hook, and instead proceeds to the next training step, when it eventually joins up with the allreduce operation. Since it has now done one more training step than the other ranks, it will run out of data at the end of the e…