Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zero3 performance optimizations #3622

Merged
merged 13 commits into from
Jun 8, 2023
Merged

zero3 performance optimizations #3622

merged 13 commits into from
Jun 8, 2023

Conversation

BacharL
Copy link
Collaborator

@BacharL BacharL commented May 28, 2023

Results:
Single server with 8x V100 GPUs
ZeRO Stage3, not using any CPU/disk offload, and contiguous gradients enabled
Bert 1.5B, zero3, fp16, adamw optimizer, global batch 1024

average_perf_per_step:
micro batch 8, comm_overlap=false 6.86 -> 8.66
micro batch 8, comm_overlap=true 6.88 -> 8.75
micro batch 32, comm_overlap=false 24.87 -> 34.13
micro batch 32, comm_overlap=true 24.95 -> 32.13

Single server with 8x A100 GPUs
ZeRO Stage3, not using any CPU/disk offload, and contiguous gradients enabled
Bert 1.5B, zero3, fp16, adamw optimizer, global batch 1024

average_perf_per_step:
micro batch 32, comm_overlap=false 42 -> 56

params_already_reduced is not used
Debug strings are evaluated even when logging is disabled
Use allreduce instead of reduce scatter. lower cpu overhead.
@BacharL BacharL force-pushed the perf1 branch 2 times, most recently from d6a8711 to bd4d724 Compare May 29, 2023 12:43
Don't check overflow in gradients for every bucket.
Do overflow chack once on grad flat buffer just before optimizer step
@tjruwase
Copy link
Contributor

tjruwase commented Jun 1, 2023

@hablb, thanks for this PR. It looks like it could yield some decent performance gains. To appropriately capture the significance, please consider updating the original post with perf results similar to #1453 (comment).

@tjruwase tjruwase merged commit 0977106 into microsoft:master Jun 8, 2023
molly-smith pushed a commit that referenced this pull request Jun 23, 2023
* Remove dead code

params_already_reduced is not used

* Prevent evaluation of debug strings

Debug strings are evaluated even when logging is disabled

* Use contiguous gradients tensor reduce scatter between ranks

Use allreduce instead of reduce scatter. lower cpu overhead.

* move overflow tracker to optimizer.step

Don't check overflow in gradients for every bucket.
Do overflow chack once on grad flat buffer just before optimizer step

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants