Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ZeRO/bf16 grad accumulation in bf16 needs higher precision accumulator #1800

Open
stas00 opened this issue Mar 1, 2022 · 0 comments
Open
Labels
bug Something isn't working

Comments

@stas00
Copy link
Collaborator

stas00 commented Mar 1, 2022

Describe the bug

Just discovered that the recent implementation of BF16 support accumulates grads in BF16. This may work in some cases but it's likely to impact the training for the worse and ideally it should be implemented in fp32 (and may be optionally in fp16 if memory is an issue).

It should have 0 to little impact when no GAS is used, but using GAS in many microbatches PP setups or manual GAS to increase batch size - the accumulated error could be quite significant.

I also had no idea ZERO / fp16 was accumulating grads in fp16, again I'm not quite sure how much of an impact that may have on the training, this will too be setup dependent.

@tjruwase has been working on implementing BF16_optimizer #1801 which now supports fp32 grad accumulation. so most likely this needs to be backported to ZeRO and then probably give users 3 choices for the grad accumulator - bf16/fp16/fp32 with the default being fp32 for best results out of the box when GAS is used and for those who know what they are doing 2 progressively less precise and progressively more lean solutions.

Thank you!

@stas00 stas00 added the bug Something isn't working label Mar 1, 2022
@stas00 stas00 changed the title [BUG] ZeRO/bf16 grad accumulation in bf16 needs fixing [BUG] ZeRO/bf16 grad accumulation in bf16 needs higher precision accumulator Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant