You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Just discovered that the recent implementation of BF16 support accumulates grads in BF16. This may work in some cases but it's likely to impact the training for the worse and ideally it should be implemented in fp32 (and may be optionally in fp16 if memory is an issue).
It should have 0 to little impact when no GAS is used, but using GAS in many microbatches PP setups or manual GAS to increase batch size - the accumulated error could be quite significant.
I also had no idea ZERO / fp16 was accumulating grads in fp16, again I'm not quite sure how much of an impact that may have on the training, this will too be setup dependent.
@tjruwase has been working on implementing BF16_optimizer #1801 which now supports fp32 grad accumulation. so most likely this needs to be backported to ZeRO and then probably give users 3 choices for the grad accumulator - bf16/fp16/fp32 with the default being fp32 for best results out of the box when GAS is used and for those who know what they are doing 2 progressively less precise and progressively more lean solutions.
Thank you!
The text was updated successfully, but these errors were encountered:
stas00
changed the title
[BUG] ZeRO/bf16 grad accumulation in bf16 needs fixing
[BUG] ZeRO/bf16 grad accumulation in bf16 needs higher precision accumulator
Mar 1, 2022
Describe the bug
Just discovered that the recent implementation of BF16 support accumulates grads in BF16. This may work in some cases but it's likely to impact the training for the worse and ideally it should be implemented in fp32 (and may be optionally in fp16 if memory is an issue).
It should have 0 to little impact when no GAS is used, but using GAS in many microbatches PP setups or manual GAS to increase batch size - the accumulated error could be quite significant.
I also had no idea ZERO / fp16 was accumulating grads in fp16, again I'm not quite sure how much of an impact that may have on the training, this will too be setup dependent.
@tjruwase has been working on implementing BF16_optimizer #1801 which now supports fp32 grad accumulation. so most likely this needs to be backported to ZeRO and then probably give users 3 choices for the grad accumulator - bf16/fp16/fp32 with the default being fp32 for best results out of the box when GAS is used and for those who know what they are doing 2 progressively less precise and progressively more lean solutions.
Thank you!
The text was updated successfully, but these errors were encountered: