ZeRO3, improved parameter all-gather operation #1188

zarzen · 2021-06-25T16:28:00Z

For bing_bert model with following configuration (about 5.1B params), forward computation improved from ~470ms to ~270ms, backward time improved from ~900ms to ~670ms. (hardware setup: 1x ec2-p4d.24xlarge instance)

    "bert_model_config": {
        "vocab_size_or_config_json_file": 32003,
        "hidden_size": 2560,
        "num_hidden_layers": 64,
        "num_attention_heads": 40,
        "intermediate_size": 10240,
        "hidden_act": "gelu",
        "hidden_dropout_prob": 0.1,
        "attention_probs_dropout_prob": 0.1,
        "max_position_embeddings": 512,
        "initializer_range": 0.02
    },

Removing the norm computation in debug printing
Changing _all_gather to be sync op in fetch_sub_module
Reason: the async version is not async at all, because each
all_gather calls torch.cuda.synchronize() to guarantee previous
communication op to be completed
Adding new function _allgather_params_coalesced
the existing _allgather_params has explicit memcpy after the
all-gather op. We can avoid the explicit memory copy at
python side, to improve the performance.
changed _partition_param with torch.empty function.

Notes:
Using most recent updates on PyTorch _all_gather_base function could get further performance boost.
As the _all_gather_base function avoids the redundant memory copy. Refer to pytorch/pytorch#56315

1) Removing the norm computation in debug printing 2) Changing _all_gather to be sync op in fetch_sub_module Reason: the async version is not async at all, because each all_gather calls torch.cuda.synchronize() to guarantee previous communication op to be completed 3) Adding new function _allgather_params_split_launch the existing _allgather_params has explicit memcpy after the all-gather op. We can avoid the explicit memory copy at python side, to improve the performance. Known issue: the `torch.distributed.all_gather` will do implicit memcpy at the end of each `ncclAllgather`.

ghost · 2021-06-25T16:28:12Z

All CLA requirements met.

micro benchmark shows the improvement of allgather a transformer layer with 9834560 elements in half precision is about 1.1ms on aws-p4d instance.

Performance improvement of 5.1B bert on aws-p4d: fwd: 300ms -> 200ms bwd: 680ms -> 610ms

zarzen · 2021-07-01T15:35:49Z

@jfc4050 @tjruwase
I have pushed the customized all_gather operation.
The op uses the cuda stream specified by torch.cuda.stream, and it returns a handle with wrapped cuda event. so you can query/synchronize/wait of the communication on the given stream.

The performance of micro-benchmark shows about 1.1ms time reduction for allgathering a transformer layer with 9.8M params (half-precision) of each partition on a p4d.24xl instance.

In end to end training, the forward time could be further reduced to 200ms for 5.1B bing-bert model. (previously forward time around 280ms-300ms)

Looking for suggestions.

zarzen · 2021-10-12T23:00:09Z

@zarzen, thanks for your question. We just added the HF unit tests and it is causing failures on this PR and #1170. I am currently investigating the failure on #1170 and will get to this one afterwards. However, if you have bandwidth you can also look into this. The steps to run the HF tests and repro can be found here.

hey did you found the reason for failure at #1170 ? I saw that pr has passed the tests. I plan to work on a fix this thursday. it would be nice if you can provide some insights about your fix. Thanks!

tjruwase · 2021-10-12T23:25:41Z

@zarzen, thanks for following up and sorry that I forgot to update you. Yes, I was able to fix the issue. The problem was that 2 zero context objects were constructed along the way and a parameter that was gathered by one context was partitioned in the other. The fix is to avoid multiple zero contexts, but rather to reuse the existing context to register any newly discovered parameter. The core of the fix is here.

We are fortunate that the HF unit tests was able to expose this issue. I will take a closer look at your unit test failures as well.

zarzen · 2021-10-14T19:02:48Z

@zarzen, thanks for following up and sorry that I forgot to update you. Yes, I was able to fix the issue. The problem was that 2 zero context objects were constructed along the way and a parameter that was gathered by one context was partitioned in the other. The fix is to avoid multiple zero contexts, but rather to reuse the existing context to register any newly discovered parameter. The core of the fix is here.

We are fortunate that the HF unit tests was able to expose this issue. I will take a closer look at your unit test failures as well.

Are you refer to this commit, a75e46, for fixing the multi-context issue?

does that mean I can wait #1170 get merged first?

zarzen · 2021-10-15T03:24:05Z

The runtime error is throw from check_gpu_tensors at ProcessGroupNCCL.cpp.

zarzen · 2021-10-22T19:22:32Z

Updates: im able to reproduce the failure at my side, currently working on a fix.

but it is strange that the ds_tensor haven't been moved to cuda

zarzen · 2021-10-22T20:04:55Z

Hi @tjruwase
I found the test failure is due to the device of ds_tensor, which is on CPU rather than a CUDA device, which is unexpected. I thought the ds_tensor is guaranteed on CUDA when we call allgather_param.
Current fix is ad-hoc, where i just move the ds_tensor to cuda at here:

DeepSpeed/deepspeed/runtime/zero/partition_parameters.py

Line 922 in c092b78

local_tensors.append(param.ds_tensor.cuda())

does this imply other potential bugs maybe?

tjruwase · 2021-10-26T17:26:56Z

Hi @tjruwase I found the test failure is due to the device of ds_tensor, which is on CPU rather than a CUDA device, which is unexpected. I thought the ds_tensor is guaranteed on CUDA when we call allgather_param. Current fix is ad-hoc, where i just move the ds_tensor to cuda at here:

DeepSpeed/deepspeed/runtime/zero/partition_parameters.py

Line 922 in c092b78

local_tensors.append(param.ds_tensor.cuda())

does this imply other potential bugs maybe?

Yes, this is quite concerning actually. It will require further investigation. But I think this is not blocking to merge this PR, correct?

zarzen · 2021-10-27T17:43:14Z

Hi @tjruwase I found the test failure is due to the device of ds_tensor, which is on CPU rather than a CUDA device, which is unexpected. I thought the ds_tensor is guaranteed on CUDA when we call allgather_param. Current fix is ad-hoc, where i just move the ds_tensor to cuda at here:

DeepSpeed/deepspeed/runtime/zero/partition_parameters.py

Line 922 in c092b78

local_tensors.append(param.ds_tensor.cuda())

does this imply other potential bugs maybe?

Yes, this is quite concerning actually. It will require further investigation. But I think this is not blocking to merge this PR, correct?

I think so.

deepspeed/runtime/zero/partition_parameters.py

zarzen requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners June 25, 2021 16:28

zarzen marked this pull request as draft June 25, 2021 17:33

zarzen force-pushed the impr_allgather_params branch from c848c30 to 1e73e75 Compare June 25, 2021 17:38

zarzen mentioned this pull request Jun 25, 2021

Performance Degradation with ZERO Stage 3 #1069

Closed

tjruwase mentioned this pull request Jun 30, 2021

ZeRO-3 Slowdown #1170

Closed

zarzen added 2 commits June 30, 2021 19:36

WIP: wrapped ncclAllgather as customized op in DS

67b3db3

micro benchmark shows the improvement of allgather a transformer layer with 9834560 elements in half precision is about 1.1ms on aws-p4d instance.

WIP: integrated into partition_parameters

70e681f

Performance improvement of 5.1B bert on aws-p4d: fwd: 300ms -> 200ms bwd: 680ms -> 610ms

Fix format

81b4fc4

zarzen marked this pull request as ready for review July 2, 2021 19:11

zarzen and others added 4 commits July 6, 2021 16:08

Merge branch 'master' into impr_allgather_params

8a14e43

cleaned dead code, modified unit test

32c8fa7

Merge branch 'master' into impr_allgather_params

c4728f5

Merge branch 'master' into impr_allgather_params

e075fd4

tjruwase removed request for conglongli, awan-10 and arashashari July 14, 2021 14:21

Merge branch 'master' into impr_allgather_params

6201b29

warn if not cuda tensor for allgather

50a9215

jeffra and others added 2 commits October 15, 2021 12:29

Merge branch 'master' into impr_allgather_params

c554a58

Merge branch 'master' into impr_allgather_params

b7e131d

tjruwase mentioned this pull request Oct 21, 2021

Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) #1453

Merged

zarzen and others added 2 commits October 21, 2021 19:08

fix formatting

813cb22

Merge branch 'master' into impr_allgather_params

588d3d0

tjruwase and others added 2 commits October 22, 2021 12:23

Merge branch 'master' into impr_allgather_params

eb0a540

fix: move ds_tensor to cuda device

c092b78

but it is strange that the ds_tensor haven't been moved to cuda

Merge branch 'master' into impr_allgather_params

e73809d

Merge branch 'master' into impr_allgather_params

d1d3c28

tjruwase reviewed Oct 27, 2021

View reviewed changes

deepspeed/runtime/zero/partition_parameters.py Outdated Show resolved Hide resolved

tjruwase reviewed Oct 27, 2021

View reviewed changes

deepspeed/runtime/zero/partition_parameters.py Outdated Show resolved Hide resolved

tjruwase and others added 4 commits October 27, 2021 15:58

Merge branch 'master' into impr_allgather_params

62cb104

Merge branch 'master' into impr_allgather_params

ab64b17

remove try clause on the path for fetching params

7a80172

Merge branch 'microsoft:master' into impr_allgather_params

f01dad8

tjruwase approved these changes Oct 30, 2021

View reviewed changes

tjruwase added 2 commits October 29, 2021 20:33

Merge branch 'master' into impr_allgather_params

524e609

Merge branch 'master' into impr_allgather_params

d7fff58

tjruwase enabled auto-merge (squash) October 31, 2021 05:59

tjruwase merged commit c0eeb69 into microsoft:master Oct 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeRO3, improved parameter all-gather operation #1188

ZeRO3, improved parameter all-gather operation #1188

zarzen commented Jun 25, 2021 •

edited

Loading

ghost commented Jun 25, 2021 •

edited by ghost

Loading

zarzen commented Jul 1, 2021 •

edited

Loading

zarzen commented Oct 12, 2021

tjruwase commented Oct 12, 2021

zarzen commented Oct 14, 2021

zarzen commented Oct 15, 2021

zarzen commented Oct 22, 2021

zarzen commented Oct 22, 2021 •

edited

Loading

tjruwase commented Oct 26, 2021

zarzen commented Oct 27, 2021

ZeRO3, improved parameter all-gather operation #1188

ZeRO3, improved parameter all-gather operation #1188

Conversation

zarzen commented Jun 25, 2021 • edited Loading

ghost commented Jun 25, 2021 • edited by ghost Loading

zarzen commented Jul 1, 2021 • edited Loading

zarzen commented Oct 12, 2021

tjruwase commented Oct 12, 2021

zarzen commented Oct 14, 2021

zarzen commented Oct 15, 2021

zarzen commented Oct 22, 2021

zarzen commented Oct 22, 2021 • edited Loading

tjruwase commented Oct 26, 2021

zarzen commented Oct 27, 2021

zarzen commented Jun 25, 2021 •

edited

Loading

ghost commented Jun 25, 2021 •

edited by ghost

Loading

zarzen commented Jul 1, 2021 •

edited

Loading

zarzen commented Oct 22, 2021 •

edited

Loading