Add reduce-scatter coalescing for FSDP/ZeRO1 #5956

jeffhataws · 2023-11-30T20:33:38Z

This PR adds reduce-scatter coalescence support and use that in FSDP/ZeRO1 (replacing #5938). This also enables using reduce-scatter's scale param in FSD.. This PR is companion to #5950 and to be used in conjunction with openxla openxla/xla#5740 .

This is a revival of #4145 . Will need to address the comments.

torch_xla/csrc/cross_replica_reduces.cpp

torch_xla/core/xla_model.py

JackCaoG · 2023-11-30T23:00:59Z

I think my comment for this pr will be very similar to the all-gather one, let's try not to change the default behavior of reduce_scatter. Let me know when I should do another round of review

JackCaoG · 2023-12-02T01:08:38Z

Now that allgather one is merged, do you mind resolve the conflict in this pr?

jeffhataws · 2023-12-03T18:11:27Z

FSDP test passes with pin layout disabled (--no_pin_layout_in_collective_ops).

@JackCaoG any idea on where this error comes from or how to resolve it?

F0000 00:00:1701586053.119077 375805 shape.h:169] Check failed: has_layout() element_type: TUPLE tuple_shapes { element_type: S64 dimensions: 4 layout { minor_to_major: 0 } is_dynamic_dimension: false } tuple_shapes { element_type: S64 dimensions: 10 layout { minor_to_major: 0 } is_dynamic_dimension: false }

jeffhataws · 2023-12-04T00:36:48Z

FSDP test passes with pin layout disabled (--no_pin_layout_in_collective_ops).

@JackCaoG any idea on where this error comes from or how to resolve it?

F0000 00:00:1701586053.119077 375805 shape.h:169] Check failed: has_layout() element_type: TUPLE tuple_shapes { element_type: S64 dimensions: 4 layout { minor_to_major: 0 } is_dynamic_dimension: false } tuple_shapes { element_type: S64 dimensions: 10 layout { minor_to_major: 0 } is_dynamic_dimension: false }

I worked around this by falling back to single-tensor reduce-scatter when not coalescing (bucket size is 0). Now the FSDP test is passing on GPU.

JackCaoG · 2023-12-04T18:37:19Z

somewhere in the code that xla is assert on input shape has layout, but tuple does not have layout... This is the part where I say we should actually test coalescing on fsdp resnet test. If there are errors on XLA side we should fix them. If it doesn't work on GPU and TPU, for this release we can call it a Trainium specified feature and trying to fix them for next release.

jeffhataws · 2023-12-04T20:15:41Z

somewhere in the code that xla is assert on input shape has layout, but tuple does not have layout... This is the part where I say we should actually test coalescing on fsdp resnet test. If there are errors on XLA side we should fix them. If it doesn't work on GPU and TPU, for this release we can call it a Trainium specified feature and trying to fix them for next release.

Yes I agree. Will you merge this for 2.2 then?

JackCaoG · 2023-12-04T20:23:21Z

I can merge to unblock you guys(and since it doesn't impact the default behavior), but if we can't verify that it actually works I am just not going to mention it in the release note.

torch_xla/core/xla_model.py

torch_xla/csrc/cross_replica_reduces.cpp

torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py

JackCaoG · 2023-12-04T21:40:08Z

@jeffhataws can you separate out the FSDP change in a separate pr? I can asked @alanwaketan to review that part. I will try to help you to land the reduce scatter one after adding some tests.

jeffhataws · 2023-12-04T22:51:14Z

Strange, locally ./test/run_tests.sh is failing after latest changes to separate FSDP change out:

======================================================================
FAIL: test_patched_linear_2D_bias (__main__.TestAtenXlaTensor)
----------------------------------------------------------------------
Traceback (most recent call last):                                                                
  File "/home/ubuntu/jthuynh/pytorch/xla/test/test_operations.py", line 1817, in test_patched_linear_2D_bias                                                                                         
    self.assertTrue(torch.allclose(output.cpu(), output_cpu))                                     
AssertionError: False is not true     
                                                                                                  
======================================================================
FAIL: test_patched_linear_3D (__main__.TestAtenXlaTensor)
----------------------------------------------------------------------
Traceback (most recent call last):                                                                
  File "/home/ubuntu/jthuynh/pytorch/xla/test/test_operations.py", line 1761, in test_patched_linear_3D                                                                                              
    self.assertTrue(torch.allclose(output.cpu(), output_cpu))                                     
AssertionError: False is not true         
                                                                                                  
======================================================================
FAIL: test_patched_linear_3D_bias (__main__.TestAtenXlaTensor)                                    
----------------------------------------------------------------------                            
Traceback (most recent call last):      
  File "/home/ubuntu/jthuynh/pytorch/xla/test/test_operations.py", line 1789, in test_patched_linear_3D_bias                                                                                         
    self.assertTrue(torch.allclose(output.cpu(), output_cpu))
AssertionError: False is not true

JackCaoG · 2023-12-04T22:53:32Z

hmm can you print out output values? If they are close you can ignore them. Might just be a precision issue and your local machine and CI are using different random seeds.

jeffhataws · 2023-12-05T00:59:44Z

hmm can you print out output values? If they are close you can ignore them. Might just be a precision issue and your local machine and CI are using different random seeds.

Something weird with my environtment so let's ignore. CI seems to be fine.

jeffhataws · 2023-12-05T04:55:19Z

@jeffhataws can you separate out the FSDP change in a separate pr? I can asked @alanwaketan to review that part. I will try to help you to land the reduce scatter one after adding some tests.

#6024 is the PR for FSDP.

torch_xla/core/xla_model.py

JackCaoG · 2023-12-05T19:41:59Z

torch_xla/csrc/init_python_bindings.cpp

+  std::vector<XLATensorPtr> xtensors_out =
+      GetXlaTensors(outputs, /*want_all=*/true);


if you want to handle the outputs cases, it is better to define a ReduceScatterCoalescedOut instead of merging the logic into a single function.

Is it ok if I work on this in another change?

That's OK, but can we explictly error out in python api side if output is not None? I'd rather throw a explictly error on cases that we don't test/support yet.

we can work on the reduce_scatter_out in a separate pr

Thanks Jack. Here's the change to error out if output!=None: 9baadf5

#6058 is for adding out-of-place version for reduce-scatter, along with #6059 for out-of-place all-gather.

JackCaoG · 2023-12-05T19:42:59Z

Left some comments, @jeffhataws can you rebase this pr and add test cases to https://github.com/pytorch/xla/blob/master/test/test_mp_reduce_scatter.py so we can actually run reduce_scatter with list input on GPU?

Also allow using reduce-scatter's scale param in FSDP. (revived #4145) Fix reduce-scatter-coalesce to be compatible with openxla reduce-scatter tuple change without token Switch to GetOperandListWithToken naming for func GetOperandList Add separate BuildReduceScatterCoalesced builder Use token_handler.GetInput to consume the token If bucket_size_mb is default 0, reduce-scatter every tensor rather than coalesce Fix error checking in xm.reduce_scatter Move FSDP changes to another PR

jeffhataws · 2023-12-06T03:35:56Z

Left some comments, @jeffhataws can you rebase this pr and add test cases to https://github.com/pytorch/xla/blob/master/test/test_mp_reduce_scatter.py so we can actually run reduce_scatter with list input on GPU?

Added a test case. Only can handle pin_layout=False at the moment.

JackCaoG · 2023-12-06T19:29:53Z

Let me take another look. 2.2 branch is cut, I will take care of bakporting this pr after it is merged.

JackCaoG

LGTM with one minor comment. If we can explictly error out on out != None case and CI pass I can merge this one.

…ut!=None

JackCaoG

Thanks @jeffhataws !

Co-authored-by: jeffhataws <jthuynh@amazon.com>

jeffhataws mentioned this pull request Nov 30, 2023

Add reduce-scatter coalescing for FSDP/ZeRO1 #5938

Closed

JackCaoG reviewed Nov 30, 2023

View reviewed changes

torch_xla/csrc/cross_replica_reduces.cpp Show resolved Hide resolved

JackCaoG reviewed Nov 30, 2023

View reviewed changes

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

jeffhataws requested a review from JackCaoG December 1, 2023 06:26

JackCaoG added the backport_2.2 label Dec 1, 2023

JackCaoG reviewed Dec 4, 2023

View reviewed changes

torch_xla/core/xla_model.py Show resolved Hide resolved

JackCaoG reviewed Dec 4, 2023

View reviewed changes

torch_xla/csrc/cross_replica_reduces.cpp Show resolved Hide resolved

JackCaoG reviewed Dec 4, 2023

View reviewed changes

torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py Outdated Show resolved Hide resolved

jeffhataws mentioned this pull request Dec 5, 2023

Use reduce-scatter coalescing for FSDP #6024

Merged

jeffhataws mentioned this pull request Dec 5, 2023

ZeRO1: Add bucketting logic to control the size of tensors for all-gather/reduce-scatter #6025

Merged

JackCaoG reviewed Dec 5, 2023

View reviewed changes

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

JackCaoG reviewed Dec 5, 2023

View reviewed changes

jeffhataws force-pushed the cc_coalesce_reducescatter branch 2 times, most recently from be0d3f7 to 51e9919 Compare December 5, 2023 20:54

jeffhataws added 2 commits December 5, 2023 21:04

Fix error checking when input is list and output is specified

3d70fc0

Add reduce-scatter test with list of tensors as input/output

69968e5

jeffhataws force-pushed the cc_coalesce_reducescatter branch from 4624ec5 to 69968e5 Compare December 6, 2023 03:34

JackCaoG reviewed Dec 6, 2023

View reviewed changes

JackCaoG mentioned this pull request Dec 6, 2023

2.2 backport PR request list #6036

Open

For now, error when reduce-scatter coaelesce xm API is used with outp…

9baadf5

…ut!=None

JackCaoG approved these changes Dec 7, 2023

View reviewed changes

JackCaoG merged commit ac94781 into master Dec 7, 2023
18 checks passed

JackCaoG pushed a commit that referenced this pull request Dec 8, 2023

Add reduce-scatter coalescing for FSDP/ZeRO1 (#5956)

4133ec2

jeffhataws mentioned this pull request Dec 8, 2023

Add out-of-place reduce-scatter coalescing #6058

Merged

ManfeiBai pushed a commit that referenced this pull request Dec 8, 2023

[Backport] Add reduce-scatter coalescing for FSDP/ZeRO1 (#5956) (#6055)

8472abd

Co-authored-by: jeffhataws <jthuynh@amazon.com>

jeffhataws added a commit to jeffhataws/xla that referenced this pull request Dec 8, 2023

Add reduce-scatter coalescing for FSDP/ZeRO1 (pytorch#5956)

92d3d29

jeffhataws added a commit to jeffhataws/xla that referenced this pull request Dec 11, 2023

Add reduce-scatter coalescing for FSDP/ZeRO1 (pytorch#5956)

4f29242

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Add reduce-scatter coalescing for FSDP/ZeRO1 (pytorch#5956)

de65932

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Add reduce-scatter coalescing for FSDP/ZeRO1 (#5956)

0dc8c39

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Add reduce-scatter coalescing for FSDP/ZeRO1 (#5956)

ae438f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reduce-scatter coalescing for FSDP/ZeRO1 #5956

Add reduce-scatter coalescing for FSDP/ZeRO1 #5956

jeffhataws commented Nov 30, 2023 •

edited

Loading

JackCaoG commented Nov 30, 2023

JackCaoG commented Dec 2, 2023

jeffhataws commented Dec 3, 2023

jeffhataws commented Dec 4, 2023

JackCaoG commented Dec 4, 2023

jeffhataws commented Dec 4, 2023

JackCaoG commented Dec 4, 2023

JackCaoG commented Dec 4, 2023

jeffhataws commented Dec 4, 2023

JackCaoG commented Dec 4, 2023

jeffhataws commented Dec 5, 2023

jeffhataws commented Dec 5, 2023

JackCaoG Dec 5, 2023

jeffhataws Dec 6, 2023

JackCaoG Dec 6, 2023

JackCaoG Dec 6, 2023

jeffhataws Dec 6, 2023

jeffhataws Dec 8, 2023

JackCaoG commented Dec 5, 2023

jeffhataws commented Dec 6, 2023

JackCaoG commented Dec 6, 2023

JackCaoG left a comment

JackCaoG left a comment

		std::vector<XLATensorPtr> xtensors_out =
		GetXlaTensors(outputs, /want_all=/true);

Add reduce-scatter coalescing for FSDP/ZeRO1 #5956

Add reduce-scatter coalescing for FSDP/ZeRO1 #5956

Conversation

jeffhataws commented Nov 30, 2023 • edited Loading

JackCaoG commented Nov 30, 2023

JackCaoG commented Dec 2, 2023

jeffhataws commented Dec 3, 2023

jeffhataws commented Dec 4, 2023

JackCaoG commented Dec 4, 2023

jeffhataws commented Dec 4, 2023

JackCaoG commented Dec 4, 2023

JackCaoG commented Dec 4, 2023

jeffhataws commented Dec 4, 2023

JackCaoG commented Dec 4, 2023

jeffhataws commented Dec 5, 2023

jeffhataws commented Dec 5, 2023

JackCaoG Dec 5, 2023

Choose a reason for hiding this comment

jeffhataws Dec 6, 2023

Choose a reason for hiding this comment

JackCaoG Dec 6, 2023

Choose a reason for hiding this comment

JackCaoG Dec 6, 2023

Choose a reason for hiding this comment

jeffhataws Dec 6, 2023

Choose a reason for hiding this comment

jeffhataws Dec 8, 2023

Choose a reason for hiding this comment

JackCaoG commented Dec 5, 2023

jeffhataws commented Dec 6, 2023

JackCaoG commented Dec 6, 2023

JackCaoG left a comment

Choose a reason for hiding this comment

JackCaoG left a comment

Choose a reason for hiding this comment

jeffhataws commented Nov 30, 2023 •

edited

Loading