Add API to assemble CPU shards to a sharded tensor #5630

jonb377 · 2023-09-20T20:41:19Z

Currently, to convert CPU shards to an XLAShardedTensor, the API XLAShardedTensor::load_local_shards_ must be used. This loads CPU shards in-place to an existing sharded tensor. A more convenient method for use cases outside of distributed checkpointing is to directly assemble the shards into a global tensor on device.

This PR adds a private _XLAC API _get_global_tensor_from_cpu_shards to allow directly creating a sharded tensor from a list of CPU shards and an OpSharding generated by Mesh::get_op_sharding.

I currently plan to use this new API in a few places:

tpu.py::discover_master_worker_ip_address: place each host's IP into a global tensor and pull out the zeroth entry for worker 0. The existing IP discovery API currently doesn't work with SPMD, but it's needed for distributed checkpointing.
Distributed data loading: A per-device, sharding-aware dataloader can use this API to efficiently create device data from the loaded CPU shards.

torch_xla/runtime.py

test/spmd/test_xla_sharding.py

yeounoh · 2023-09-22T19:23:36Z

torch_xla/csrc/init_python_bindings.cpp

+          // Set a default value for the global shape based on the sharding
+          // type.
+          if (sharding.type() == xla::OpSharding::OTHER) {
+            // Infer the global shape to be the shard shape scaled by the tiling


This may not hold true, in case of uneven tiling. Let's make a note on this.

or, actually for the right most shard in each dim, we can add the sizes, as the padding is always on the last dims.

Ok we can't do that, since "Input shard shape must include padding: " << shard.sizes()"

Let me know if I should revisit this. I decided on this approach because padding can cross over multiple devices, e.g. sharding a tensor with shape (1, 2) on the mesh (1, 4) will have shards with shape (1, 1) with no real data on the last two devices.

ping @yeounoh

I think for the most part, user should specify the global shape. I am thinking that we should actually make it explicit, not optional... but if not, this way of handling, inferring the default shape is good.

That's a good point... I chose to let it be inferred for convenience, since for e.g. distributed data loading, the shards are provided on CPU with the correct padded local shape, and deriving the global shape could be difficult for more sophisticated shardings.

I'll go ahead and land with it optional for now, since we're keeping the API private. Thanks Yeounoh!

torch_xla/csrc/init_python_bindings.cpp

yeounoh · 2023-09-22T19:28:15Z

torch_xla/experimental/xla_sharding.py

@@ -87,6 +87,14 @@ def get_op_sharding(self,
    Return the OpSharding for the given partition spec. This is an expensive
    operation as the mesh grows, so the value is cached for reuse.
    """
+    partition_spec = _translate_named_partition_spec(self, partition_spec)


Thanks for refactoring 👍

yeounoh

Let some comments.

torch_xla/csrc/init_python_bindings.cpp

alanwaketan · 2023-09-22T20:59:08Z

torch_xla/experimental/xla_sharded_tensor.py

+  @staticmethod
+  def from_cpu_shards(shards: List[torch.Tensor],
+                      sharding: torch_xla._XLAC.OpSharding,
+                      global_shape: torch.Size = None):


What's the benefit of providing a global_shape?

If the shards are padded, the global_shape can be used to remove padding from the global tensor.

yeounoh

LGTM

jonb377 · 2023-10-05T19:46:42Z

Thanks @yeounoh and @alanwaketan for the review. I'll merge after a TPU CI run.

alanwaketan

LGTM.

This reverts commit 3913a77.

…5680) This reverts commit 3913a77.

* Add API to assemble CPU shards to a sharded tensor * Handle replicated sharding * Move validations into get_op_sharding * Improve tests and error handling

…5680) This reverts commit 3913a77.

* Add API to assemble CPU shards to a sharded tensor * Handle replicated sharding * Move validations into get_op_sharding * Improve tests and error handling

…5680) This reverts commit 3913a77.

* Add API to assemble CPU shards to a sharded tensor * Handle replicated sharding * Move validations into get_op_sharding * Improve tests and error handling

)" (pytorch#5680) This reverts commit 3913a77.

* Add API to assemble CPU shards to a sharded tensor * Handle replicated sharding * Move validations into get_op_sharding * Improve tests and error handling

)" (pytorch#5680) This reverts commit 3913a77.

* Add API to assemble CPU shards to a sharded tensor * Handle replicated sharding * Move validations into get_op_sharding * Improve tests and error handling

…5680) This reverts commit 3913a77.

* Add API to assemble CPU shards to a sharded tensor * Handle replicated sharding * Move validations into get_op_sharding * Improve tests and error handling

…5680) This reverts commit 3913a77.

jonb377 requested review from yeounoh, alanwaketan and JackCaoG September 20, 2023 20:41

jonb377 marked this pull request as ready for review September 20, 2023 20:41

jonb377 force-pushed the jonbolin-assemble-shards branch from 34cc1d3 to a1e83ca Compare September 20, 2023 20:41

Add API to assemble CPU shards to a sharded tensor

13cfb7a

jonb377 force-pushed the jonbolin-assemble-shards branch from a1e83ca to 13cfb7a Compare September 20, 2023 20:44

jonb377 commented Sep 20, 2023

View reviewed changes

torch_xla/runtime.py Outdated Show resolved Hide resolved

Handle replicated sharding

4413bc7

jonb377 force-pushed the jonbolin-assemble-shards branch from 5b696f0 to 4413bc7 Compare September 20, 2023 23:02

Move validations into get_op_sharding

bf329c7

jonb377 self-assigned this Sep 21, 2023

jonb377 added the SPMD / Distributed label Sep 21, 2023