Skip to content

Collective mismatch at end of training epoch #13997

Discussion options

You must be logged in to vote

Found the issue. Even with find_unused_parameters=True, there needs to be at least one used parameter every training step.

I had a unique case where for some batches no parameters were used. This caused ranks to lose sync. My guess for why this happens is as follows: The ranks with used parameters would get stuck on allreduce in the backwards hook, waiting for the rank with no used parameters to catch up. However, the rank with no used parameters doesn't hit a backwards hook, and instead proceeds to the next training step, when it eventually joins up with the allreduce operation. Since it has now done one more training step than the other ranks, it will run out of data at the end of the e…

Replies: 3 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by valtsblukis
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants