[BUG] ZeRO/bf16 grad accumulation in bf16 needs higher precision accumulator #1800

stas00 · 2022-03-01T06:04:29Z

Describe the bug

Just discovered that the recent implementation of BF16 support accumulates grads in BF16. This may work in some cases but it's likely to impact the training for the worse and ideally it should be implemented in fp32 (and may be optionally in fp16 if memory is an issue).

It should have 0 to little impact when no GAS is used, but using GAS in many microbatches PP setups or manual GAS to increase batch size - the accumulated error could be quite significant.

I also had no idea ZERO / fp16 was accumulating grads in fp16, again I'm not quite sure how much of an impact that may have on the training, this will too be setup dependent.

@tjruwase has been working on implementing BF16_optimizer #1801 which now supports fp32 grad accumulation. so most likely this needs to be backported to ZeRO and then probably give users 3 choices for the grad accumulator - bf16/fp16/fp32 with the default being fp32 for best results out of the box when GAS is used and for those who know what they are doing 2 progressively less precise and progressively more lean solutions.

Thank you!

stas00 added the bug Something isn't working label Mar 1, 2022

stas00 changed the title ~~[BUG] ZeRO/bf16 grad accumulation in bf16 needs fixing~~ [BUG] ZeRO/bf16 grad accumulation in bf16 needs higher precision accumulator Mar 1, 2022

stas00 mentioned this issue Jun 27, 2022

[BUG] gradient overflow with fp16 enabled #1773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ZeRO/bf16 grad accumulation in bf16 needs higher precision accumulator #1800

[BUG] ZeRO/bf16 grad accumulation in bf16 needs higher precision accumulator #1800

stas00 commented Mar 1, 2022 •

edited

Loading

[BUG] ZeRO/bf16 grad accumulation in bf16 needs higher precision accumulator #1800

[BUG] ZeRO/bf16 grad accumulation in bf16 needs higher precision accumulator #1800

Comments

stas00 commented Mar 1, 2022 • edited Loading

stas00 commented Mar 1, 2022 •

edited

Loading