-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bf16+pipeline parallelism #1801
Conversation
bf16 checkpoint save/load
@stas00, FYI |
ed26ef4
to
27e5b95
Compare
…d into olruwase/bf16-updates
…d into olruwase/bf16-updates
…d into olruwase/bf16-updates
@@ -981,6 +969,10 @@ def _configure_distributed_model(self, model): | |||
hasattr(param, | |||
'ds_id') for param in self.module.parameters()): | |||
self.__check_params(self.module, torch.bfloat16) | |||
if self.zero_optimization_stage() == 0 and not self.pipeline_parallelism: | |||
raise NotImplementedError( | |||
"When not running ZeRO, BF16 training support is only supported for Pipeline parallelism" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, I wonder why BF16 is only supported for Pipeline parallelism or ZeRO 1 to ZeRO3, since there is not such limit in prior version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kisseternity, apologies for the confusion here. This is a new bf16+Pipeline parallelism
code path that was written in the last minute for BLOOM model training. The existing restrictions in combining with ZeRO are temporary. We plan to harmonize these combinations and eliminate the confusions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kisseternity, apologies for the confusion here. This is a new
bf16+Pipeline parallelism
code path that was written in the last minute for BLOOM model training. The existing restrictions in combining with ZeRO are temporary. We plan to harmonize these combinations and eliminate the confusions.
Thanks for replying. In that case, I think bf16 can be used without the bf16 optimizer or ZeRO as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these changes do not affect the previous support for bf16+ZeRO
.
bf16_optimizer implementing optimizer state sharding (a.k.a., zero stage 1)
Integration with pipeline parallelism