zero3 performance optimizations #3622

BacharL · 2023-05-28T12:35:47Z

Results:
Single server with 8x V100 GPUs
ZeRO Stage3, not using any CPU/disk offload, and contiguous gradients enabled
Bert 1.5B, zero3, fp16, adamw optimizer, global batch 1024

average_perf_per_step:
micro batch 8, comm_overlap=false 6.86 -> 8.66
micro batch 8, comm_overlap=true 6.88 -> 8.75
micro batch 32, comm_overlap=false 24.87 -> 34.13
micro batch 32, comm_overlap=true 24.95 -> 32.13

Single server with 8x A100 GPUs
ZeRO Stage3, not using any CPU/disk offload, and contiguous gradients enabled
Bert 1.5B, zero3, fp16, adamw optimizer, global batch 1024

average_perf_per_step:
micro batch 32, comm_overlap=false 42 -> 56

params_already_reduced is not used

Debug strings are evaluated even when logging is disabled

Use allreduce instead of reduce scatter. lower cpu overhead.

Don't check overflow in gradients for every bucket. Do overflow chack once on grad flat buffer just before optimizer step

deepspeed/runtime/zero/stage3.py

tjruwase · 2023-06-01T18:11:57Z

@hablb, thanks for this PR. It looks like it could yield some decent performance gains. To appropriately capture the significance, please consider updating the original post with perf results similar to #1453 (comment).

* Remove dead code params_already_reduced is not used * Prevent evaluation of debug strings Debug strings are evaluated even when logging is disabled * Use contiguous gradients tensor reduce scatter between ranks Use allreduce instead of reduce scatter. lower cpu overhead. * move overflow tracker to optimizer.step Don't check overflow in gradients for every bucket. Do overflow chack once on grad flat buffer just before optimizer step --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

BacharL added 2 commits May 22, 2023 11:28

Remove dead code

e1e4b75

params_already_reduced is not used

Prevent evaluation of debug strings

e3dbb7a

Debug strings are evaluated even when logging is disabled

BacharL requested review from jeffra, tjruwase, samyam and mrwyattii as code owners May 28, 2023 12:35

BacharL force-pushed the perf1 branch 3 times, most recently from a942dfa to 3a87dfa Compare May 29, 2023 08:35

Use contiguous gradients tensor reduce scatter between ranks

bd4d724

Use allreduce instead of reduce scatter. lower cpu overhead.

BacharL force-pushed the perf1 branch 2 times, most recently from d6a8711 to bd4d724 Compare May 29, 2023 12:43

move overflow tracker to optimizer.step

9cf826d

Don't check overflow in gradients for every bucket. Do overflow chack once on grad flat buffer just before optimizer step