-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DDP token averaging for equivalent non-parallel training similar to #34191 #34242
Comments
I observed this as well when I was running some experiments (things were close postfix, but not exact). Would you like to take a stab at a PR? :) |
A simple implemention may be:
|
Although this issue has little impact on the training results, it significantly affects to reproduce experiments across different hardware configurations. I hope it can be resolved alongside gradient accumulation. I attempted to use all-reduce during training, but it slowed down the process. Is it possible to calculate the total number of tokens per batch across devices when initializing the Dataloader with accelerate (without compromising compatibility with the existing code) ? |
That is the issue with it, and why I'm not the biggest fan of that particular solution. We can't, bc there are situations like The fairseq solution may be the way |
Can confirm the fairseq solution works great, it'll be part of #34283 |
I'll leave this open for now. I didn't see significant discrepancies between DDP and non, but if users have stories/can show where it goes wrong, post them here for us to dig into please |
What we can do then is add it in under a flag which is disabled by default ( |
yyds,thanks |
Feature request
Token averaging in gradient accumulation was fixed in #34191 . But token averaging in DDP seems to have the same issue.
Expected behaivor
With all the tokens contributing to loss in each step (in each GPU, gradient accumulation step, and microbatch), the equation becomes:
I believe we should average the above tokens at the same time for equivalent non-parallel training.
Current issue
Prior to #34191, the loss/gradients were averaged on$\sum\limits_{GPUs}$ , $\sum\limits_{gas}$ , and $\sum\limits_{microb}$ separately. And, the introduction of
num_items_in_batch
in #34191 refers to:So, the loss/gradients are now averaged on$\sum\limits_{GPUs}$ and $\left(\sum\limits_{gas}\sum\limits_{microb}\right)$ separately. However, this still does not seem equivalent to non-parallel training.
Can we also incorporate$\sum\limits_{GPUs}$ when determining
num_items_in_batch
? Something likeall_reduce(num_items_in_batch)
?Motivation
DDP seems not fully equivalent to non-parallel training.
related comments: #34191 (comment)
Your contribution
Found some fairseq implementation of this feature
https://github.com/facebookresearch/fairseq/blob/018621f3cca02ca9de945dc082c3fb1a7f9f2deb/fairseq/trainer.py#L932-L949
The text was updated successfully, but these errors were encountered: