Add all-gather coalescing for FSDP/ZeRO1 #5950

jeffhataws · 2023-11-30T04:03:10Z

This PR adds all-gather coalescence support and use that in FSDP/ZeRO1 (replacing #5624). This PR is to be used in conjunction with openxla/xla#5740 .

A separate and related PR for reduce-scatter coalescence that also enables using reduce-scatter's scale param in FSDP is #5938.

This is a revival of #4145 . Will need to address the comments.

Also allow using reduce-scatter's scale param in FSDP. (revived #4145)

…thout token

…ter tuple change without token

…lass

JackCaoG · 2023-11-30T18:24:23Z

@jeffhataws let me know when you are done addressing comments, I will take another look

torch_xla/csrc/cross_replica_reduces.cpp

torch_xla/csrc/cross_replica_reduces.h

JackCaoG · 2023-11-30T22:54:30Z

torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py

@@ -295,6 +295,7 @@ def __init__(
      sharding_world_size: Optional[int] = None,
      shard_param_on_dim_0: bool = False,
      pin_layout_in_collective_ops: bool = True,
+      coalesce_all_gather_ops: bool = False,


Do you mind explaining the change in this file? I think coalesce_all_gather_ops is always False in our test, did you run into these issues with your own test?

When the coalesce_all_gather_ops is True, the parameter shards are collected into a list and gathered in one all-gather coalesced command at the end (instead of all-gather one parameter at a time).

It is off by default to avoid changing existing behavior. The code is same as what we are using in our local fork.

JackCaoG · 2023-11-30T22:57:17Z

torch_xla/csrc/cross_replica_reduces.cpp

+  ReduceContext cc_ctx = GetReduceContext(inputs);
+  std::vector<xla::XlaOp> result(inputs.size());
+
+  for (auto& type_ctx : cc_ctx.contexts) {


if you want to assume there is only one type_ctx, let's not use the for loop and GetReduceContext at all. This way we don't need to handle the token per type.

Let me check with others on this.

JackCaoG

mostly lgtm beside the changes in FSDP. If we didn't change the default behavior of all-gather test should pass right?

I will look into reduce scatter one today, let's try to merge these two pr soon.

JackCaoG

Thanks! I think we should test allgather_coalesced using resnet on gpu to make sure we don't break it in the future. You can refer to existing test

xla/.circleci/common.sh

Line 136 in 2c4983d

    
           PJRT_DEVICE=CUDA python test/test_train_mp_imagenet_fsdp.py --fake_data --auto_wrap_policy type_based --use_small_fake_sample --num_epochs=1

.

we can do that in a separate pr.

* Add all-gather and reduce-scatter coalescence support for FSDP. Also allow using reduce-scatter's scale param in FSDP. (revived pytorch#4145) * clang-format-7 and python lint fixes * Fix "SyntaxError: 'return' outside function" error * Code/test fixes to get run_tests.sh to run on CPU * Fix allgather to be compatible with openxla allgather tuple change without token * Fix reduce-scatter-coalesce to be compatible with openxla reduce-scatter tuple change without token * Separate out the reduce-scatter-coalesce changes into a separate PR * Some cleanups * Add separate BuildAllGatherCoalesced builder and AllGatherCoalesced class * Use token_handler.GetInput to capture token * Clean up * Clean up * Switch to GetOperandListWithToken naming for func GetOperandList

* Add all-gather and reduce-scatter coalescence support for FSDP. Also allow using reduce-scatter's scale param in FSDP. (revived #4145) * clang-format-7 and python lint fixes * Fix "SyntaxError: 'return' outside function" error * Code/test fixes to get run_tests.sh to run on CPU * Fix allgather to be compatible with openxla allgather tuple change without token * Fix reduce-scatter-coalesce to be compatible with openxla reduce-scatter tuple change without token * Separate out the reduce-scatter-coalesce changes into a separate PR * Some cleanups * Add separate BuildAllGatherCoalesced builder and AllGatherCoalesced class * Use token_handler.GetInput to capture token * Clean up * Clean up * Switch to GetOperandListWithToken naming for func GetOperandList

jeffhataws added 8 commits November 18, 2023 23:28

Add all-gather and reduce-scatter coalescence support for FSDP.

57e6b36

Also allow using reduce-scatter's scale param in FSDP. (revived #4145)

clang-format-7 and python lint fixes

6343182

Fix "SyntaxError: 'return' outside function" error

cb5b6fb

Code/test fixes to get run_tests.sh to run on CPU

3e97aa8

Fix allgather to be compatible with openxla allgather tuple change wi…

ec27f90

…thout token

Fix reduce-scatter-coalesce to be compatible with openxla reduce-scat…

f9f3a71

…ter tuple change without token

Separate out the reduce-scatter-coalesce changes into a separate PR

5843b43

Some cleanups

8e6715e

jeffhataws mentioned this pull request Nov 30, 2023

Add all-gather coalescing for FSDP/ZeRO1 #5624

Closed

Add separate BuildAllGatherCoalesced builder and AllGatherCoalesced c…

3cd39ea

…lass

jeffhataws requested a review from JackCaoG November 30, 2023 06:07

Use token_handler.GetInput to capture token

49f4ba4

jeffhataws added 2 commits November 30, 2023 19:48

Clean up

6fd0d0a

Clean up

fae4d07

jeffhataws mentioned this pull request Nov 30, 2023

Add reduce-scatter coalescing for FSDP/ZeRO1 #5956

Merged

JackCaoG reviewed Nov 30, 2023

View reviewed changes

torch_xla/csrc/cross_replica_reduces.cpp Outdated Show resolved Hide resolved

JackCaoG reviewed Nov 30, 2023

View reviewed changes

torch_xla/csrc/cross_replica_reduces.h Outdated Show resolved Hide resolved

JackCaoG reviewed Nov 30, 2023

View reviewed changes

Switch to GetOperandListWithToken naming for func GetOperandList

8dc7af9

JackCaoG added the backport_2.2 label Dec 1, 2023

JackCaoG approved these changes Dec 2, 2023

View reviewed changes

JackCaoG merged commit 1271964 into master Dec 2, 2023
17 of 18 checks passed

This was referenced Dec 5, 2023

Use reduce-scatter coalescing for FSDP #6024

Merged

ZeRO1: Add bucketting logic to control the size of tensors for all-gather/reduce-scatter #6025

Merged

2.2 backport PR request list #6036

Open

Add out-of-place all-gather coalesced #6059

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add all-gather coalescing for FSDP/ZeRO1 #5950

Add all-gather coalescing for FSDP/ZeRO1 #5950

jeffhataws commented Nov 30, 2023

JackCaoG commented Nov 30, 2023

JackCaoG Nov 30, 2023

jeffhataws Dec 1, 2023

JackCaoG Nov 30, 2023

jeffhataws Dec 1, 2023

JackCaoG left a comment

JackCaoG left a comment

Add all-gather coalescing for FSDP/ZeRO1 #5950

Add all-gather coalescing for FSDP/ZeRO1 #5950

Conversation

jeffhataws commented Nov 30, 2023

JackCaoG commented Nov 30, 2023

JackCaoG Nov 30, 2023

Choose a reason for hiding this comment

jeffhataws Dec 1, 2023

Choose a reason for hiding this comment

JackCaoG Nov 30, 2023

Choose a reason for hiding this comment

jeffhataws Dec 1, 2023

Choose a reason for hiding this comment

JackCaoG left a comment

Choose a reason for hiding this comment

JackCaoG left a comment

Choose a reason for hiding this comment