ZeRO1: Add bucketting logic to control the size of tensors for all-gather/reduce-scatter #6025

jeffhataws · 2023-12-05T05:32:32Z

This PR updates XLA ZeRO1 implementation to use allgather coalesed and reduce-scatter coalesced.

JackCaoG · 2024-03-12T23:54:32Z

Test crashed at torch_xla::tensor_methods::all_gather_coalesced_out(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, which seems to be a real issue.

torch_xla/core/xla_model.py

…d reduce-scatter

cr: https://code.amazon.com/reviews/CR-112545987

torch_xla/core/xla_model.py

JackCaoG · 2024-03-18T17:45:43Z

@alanwaketan can you review this one as well since you also review the gradient bucketing one?

torch_xla/distributed/zero_redundancy_optimizer.py

torch_xla/core/xla_model.py

…ap_mb arg

hgt312

overall LGTM

torch_xla/distributed/zero_redundancy_optimizer.py

torch_xla/core/xla_model.py

test/test_mp_all_gather.py

torch_xla/core/xla_model.py

JackCaoG

mostly lgtm, minor comments.

…atter

JackCaoG · 2024-03-21T19:03:41Z

@jeffhataws is this ready for another round of review?

jeffhataws · 2024-03-21T21:15:01Z

@jeffhataws is this ready for another round of review?

I have a set of cleanup coming in an hour or so. I noticed that we have this code which was unique in all-gather. I will remove it, and have separate bucket_cap_mb for allgather and reduce-scatter in ZeRO1.

    if groups:
      divisor = len(groups[0]) if type(groups[0]) == list else len(groups)
    else:
      divisor = xrt_world_size()
    self._bucket_cap = self._bucket_cap / divisor

…educescatter

jeffhataws · 2024-03-21T23:28:11Z

@jeffhataws is this ready for another round of review?

@JackCaoG It is ready now for another round. Thanks.

JackCaoG · 2024-03-21T23:50:16Z

@jeffhataws Thanks for the refactoring work!

jeffhataws · 2024-03-22T03:03:41Z

test/test_mp_reduce_scatter.py

+      assert res.cpu().allclose(expected)
+
+    xm.rendezvous(
+        'test_reduce_scatter_list_input_output_bucketized, zero bucket size')


Hi @JackCaoG , does rendezvous allow comma and space in the rendezvous key? How come this didn't error out?

If this is not a concern, we can merge this PR.

Looking at the implementation of xla_rendezvous, I think tag got ignored so it doesn't really matter.

xla/torch_xla/core/xla_model.py

Lines 1110 to 1115 in 782f05d

def xla_rendezvous(payload: bytes = b'',

ordinals: Optional[List[int]] = None,

tag: Optional[str] = None) -> List[bytes]:

"""Share `payload` with all replicas in `ordinals`.

`tag` is ignored except for logging.

…ther/reduce-scatter (#6025) Co-authored-by: Rahul Solanki <rhsoln@amazon.com> Co-authored-by: guangtai <guangtai@amazon.com> Co-authored-by: Amithrajith Mamidala <amithrm@amazon.com>

…or all-gather/reduce-scatter (#6025) (#6806) Co-authored-by: jeffhataws <jthuynh@amazon.com> Co-authored-by: Rahul Solanki <rhsoln@amazon.com> Co-authored-by: guangtai <guangtai@amazon.com> Co-authored-by: Amithrajith Mamidala <amithrm@amazon.com>

jeffhataws force-pushed the jeffhataws_zero1_fixes2 branch 2 times, most recently from 7c3d92d to 84a509d Compare December 7, 2023 22:01

jeffhataws added the backport_2.2 label Dec 7, 2023

jeffhataws requested review from alanwaketan and JackCaoG December 8, 2023 03:14

jeffhataws force-pushed the jeffhataws_zero1_fixes2 branch from 84a509d to 285a766 Compare December 10, 2023 17:04

jeffhataws force-pushed the jeffhataws_zero1_fixes2 branch from a453257 to 6022c91 Compare March 7, 2024 21:38

JackCaoG added the backport_2.3 label Mar 12, 2024

jeffhataws commented Mar 13, 2024

View reviewed changes

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

jeffhataws commented Mar 13, 2024

View reviewed changes

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

jeffhataws commented Mar 15, 2024

View reviewed changes

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

aws-rhsoln and others added 10 commits March 15, 2024 17:21

add bucketting logic to control the size of tensors for all-gather an…

90eda15

…d reduce-scatter

Yapf lint fixes

46a069a

handle the case when groups is none

8e79997

update zero1

5a87467

yapf lint fixes

b354c27

Fix missing curly brackets in assertion msg

22e29d3

Fixing FAL issue when sharded params are initialized with torch.double

96c61cd

cr: https://code.amazon.com/reviews/CR-112545987

Yapf fixes

6b7ce8f

Fix indices and variable names

a5de71a

Checking of <tensor>.numel for output tensors cause error in GPU runtime

77b2ad1

jeffhataws force-pushed the jeffhataws_zero1_fixes2 branch from a8f050e to 77b2ad1 Compare March 15, 2024 21:16

jeffhataws commented Mar 16, 2024

View reviewed changes

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

Avoid passing empty input buckets

ae348b2

hgt312 reviewed Mar 19, 2024

View reviewed changes

torch_xla/distributed/zero_redundancy_optimizer.py Outdated Show resolved Hide resolved

jeffhataws force-pushed the jeffhataws_zero1_fixes2 branch from 173ef47 to 13965fd Compare March 19, 2024 23:06

Fix indent for 2 lines in ZeRO1 (shard.grad = grad_shard, index += 1)

8586370

jeffhataws force-pushed the jeffhataws_zero1_fixes2 branch from 13965fd to 8586370 Compare March 19, 2024 23:08

jeffhataws commented Mar 20, 2024

View reviewed changes

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

Refactor bucketized all-gather/reduce-scatter functions; add bucket_c…

675e7a1

…ap_mb arg

jeffhataws force-pushed the jeffhataws_zero1_fixes2 branch from ec4b1e0 to 675e7a1 Compare March 20, 2024 16:10

jeffhataws requested a review from hgt312 March 20, 2024 16:16

hgt312 approved these changes Mar 20, 2024

View reviewed changes

torch_xla/distributed/zero_redundancy_optimizer.py Outdated Show resolved Hide resolved

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

JackCaoG reviewed Mar 20, 2024

View reviewed changes

test/test_mp_all_gather.py Show resolved Hide resolved

JackCaoG reviewed Mar 20, 2024

View reviewed changes

test/test_mp_all_gather.py Show resolved Hide resolved

JackCaoG reviewed Mar 20, 2024

View reviewed changes

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

JackCaoG reviewed Mar 20, 2024

View reviewed changes

torch_xla/core/xla_model.py Outdated Show resolved Hide resolved

JackCaoG reviewed Mar 20, 2024

View reviewed changes

torch_xla/core/xla_model.py Show resolved Hide resolved

JackCaoG reviewed Mar 20, 2024

View reviewed changes

Refactor bucketing logic into a class, shared by all-gather/reduce-sc…

d7c9958

…atter

Remove bucket-cap division logic; separate bucket cap for allgather/r…

5006388

…educescatter

jeffhataws mentioned this pull request Mar 21, 2024

Misc bug fixes in Zero optimizer: handling differentiable argument, optimizer_dtype #6454

Closed

jeffhataws requested a review from JackCaoG March 21, 2024 23:27

JackCaoG approved these changes Mar 21, 2024

View reviewed changes

jeffhataws commented Mar 22, 2024

View reviewed changes

JackCaoG merged commit e75677f into master Mar 22, 2024
18 checks passed

JackCaoG mentioned this pull request Mar 22, 2024

2.3 backport PR request list #6676

Closed

jeffhataws deleted the jeffhataws_zero1_fixes2 branch November 22, 2024 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeRO1: Add bucketting logic to control the size of tensors for all-gather/reduce-scatter #6025

ZeRO1: Add bucketting logic to control the size of tensors for all-gather/reduce-scatter #6025

jeffhataws commented Dec 5, 2023

JackCaoG commented Mar 12, 2024

JackCaoG commented Mar 18, 2024

hgt312 left a comment

JackCaoG left a comment

JackCaoG commented Mar 21, 2024

jeffhataws commented Mar 21, 2024

jeffhataws commented Mar 21, 2024 •

edited

Loading

JackCaoG commented Mar 21, 2024

jeffhataws Mar 22, 2024

jeffhataws Mar 22, 2024

JackCaoG Mar 22, 2024

	def xla_rendezvous(payload: bytes = b'',
	ordinals: Optional[List[int]] = None,
	tag: Optional[str] = None) -> List[bytes]:
	"""Share `payload` with all replicas in `ordinals`.

	`tag` is ignored except for logging.

ZeRO1: Add bucketting logic to control the size of tensors for all-gather/reduce-scatter #6025

ZeRO1: Add bucketting logic to control the size of tensors for all-gather/reduce-scatter #6025

Conversation

jeffhataws commented Dec 5, 2023

JackCaoG commented Mar 12, 2024

JackCaoG commented Mar 18, 2024

hgt312 left a comment

Choose a reason for hiding this comment

JackCaoG left a comment

Choose a reason for hiding this comment

JackCaoG commented Mar 21, 2024

jeffhataws commented Mar 21, 2024

jeffhataws commented Mar 21, 2024 • edited Loading

JackCaoG commented Mar 21, 2024

jeffhataws Mar 22, 2024

Choose a reason for hiding this comment

jeffhataws Mar 22, 2024

Choose a reason for hiding this comment

JackCaoG Mar 22, 2024

Choose a reason for hiding this comment

jeffhataws commented Mar 21, 2024 •

edited

Loading