New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SPMD] auto-sharding PoC #6719

Merged

yeounoh merged 39 commits into master from spmd_auto_alpa

Mar 14, 2024

Contributor

yeounoh commented Mar 12, 2024 •

edited

Loading

This implemented a PoC prototype on XLA:TPU, as described in #6322

PyTorch/XLA auto-sharding can be enabled by one of the following:

Setting envvar XLA_SPMD_AUTO=1
Calling the SPMD API in the beginning of your code:

import torch_xla.runtime as xr
xr.use_spmd(auto=True)

Calling pytorch.distributed._tensor.distribute_module with auto-policy and xla:

import torch_xla.runtime as xr
from torch.distributed._tensor import DeviceMesh, distribute_module
from torch_xla.distributed.spmd import auto_policy

device_count = xr.global_runtime_device_count()
device_mesh = DeviceMesh("xla", list(range(device_count)))

# Currently, model should be loaded to xla device via distribute_module.
model = MyModule()  # nn.module
sharded_model = distribute_module(model, device_mesh, auto_policy)

Some notable limitations that we will address in follow-ups:

XLA:GPU is not supported
TPU pod is not supported

yeounoh added the SPMD / Distributed label

yeounoh requested a review from JackCaoG

March 12, 2024 00:21

yeounoh self-assigned this

yeounoh marked this pull request as draft

March 12, 2024 00:22

yeounoh force-pushed the spmd_auto_alpa branch 2 times, most recently from 126ceee to 4d568ef Compare

March 12, 2024 00:25

yeounoh commented

View reviewed changes

WORKSPACE Outdated Show resolved Hide resolved

yeounoh commented

View reviewed changes

setup.py Outdated Show resolved Hide resolved

yeounoh force-pushed the spmd_auto_alpa branch 2 times, most recently from 6ca8f97 to d6dc442 Compare

March 12, 2024 00:38

yeounoh commented

View reviewed changes

test/spmd/test_dynamo_spmd.py Show resolved Hide resolved

yeounoh commented

View reviewed changes

test/spmd/test_spmd_graph_dump.py Show resolved Hide resolved

yeounoh force-pushed the spmd_auto_alpa branch from d6dc442 to 319062e Compare

March 12, 2024 00:40

yeounoh commented

View reviewed changes

torch_xla/csrc/init_python_bindings.cpp Outdated Show resolved Hide resolved

yeounoh commented

View reviewed changes

torch_xla/csrc/runtime/profiler.cc Outdated Show resolved Hide resolved

yeounoh force-pushed the spmd_auto_alpa branch from 91bf28d to 9db4bab Compare

March 12, 2024 01:04

yeounoh commented

View reviewed changes

torch_xla/csrc/init_python_bindings.cpp Show resolved Hide resolved

yeounoh force-pushed the spmd_auto_alpa branch 12 times, most recently from 303b239 to d3c1d70 Compare

March 12, 2024 07:34

yeounoh added the backport_2.3 label

yeounoh added 17 commits

March 14, 2024 00:49


          * Move OpenXLA to 1fc74e9890cd7785945fa39de9a3b54659f3e792, to apply …

2b3494e

…patch

* Ungroup resharding ops

* Replace device data after resharding


          Move XLA pin to 075d25e0c19e4e455ba0a2bcc432d581128e66aa

ee0a198

Delete quantization openxla patch

Debugging probes


          * Skip resharding if not using SPMD.

ecc1760

* Disable parameter wrapping with auto-sharding


          Debug wrapping

e838e32


          Build openxla from local

71ae2c0


          Reshard UNKNOWN to REPLICATED

03af881


          Verify the use of UNKNOWN sharding type as auto-sharding pass does no…

8e29449

…t fully support it, yet


          Unittests for simple linear training

577c178


          use UNKNOWN sharding type for device data

5df0e95


          Enable unittests for cpu

a948778


          *Enable test_xla_auto_sharding.py

edfc8ef

* Linter fix


          Introduce torch_xla.distributed.spmd.auto_policy to enable auto-sharding

1cd02b9


          Add _xla_get_auto_sharding

c42e77f


          Fix errors after rebasing

8efb664


          Test auto_policy & register SPMD to device mapper

d30a721


          Use DTensor API directly

42423d3


          Linter & refactor

7899faf

yeounoh force-pushed the spmd_auto_alpa branch 4 times, most recently from 968bca4 to eadcae6 Compare

March 14, 2024 18:15


          Update spmd.md doc

542ae10

yeounoh force-pushed the spmd_auto_alpa branch from eadcae6 to 542ae10 Compare

March 14, 2024 18:17


          remove aten bridge change

557161d

yeounoh force-pushed the spmd_auto_alpa branch from 6e7f1a4 to 557161d Compare

March 14, 2024 19:57

JackCaoG reviewed

View reviewed changes

test/run_tests.sh

@@ @@ -226,6 +226,8 @@ function run_xla_op_tests3 { @@
                 run_test "$CDIR/spmd/test_xla_distributed_checkpoint.py"
                 run_test "$CDIR/spmd/test_xla_spmd_python_api_interaction.py"
                 run_test "$CDIR/spmd/test_dtensor_integration.py"
+                run_test "$CDIR/spmd/test_dtensor_integration2.py"

Collaborator

JackCaoG Mar 14, 2024

do we need this on TPU CI as well or it is ok to leave out?

Contributor Author

yeounoh Mar 14, 2024

Ohhh i think it's ok to leave out. Want to run this sanity check on TPU!

JackCaoG approved these changes

View reviewed changes

Collaborator

JackCaoG left a comment

Feel free to adjust remaining comments in a follow up [r

yeounoh merged commit 370089a into master

18 checks passed

yeounoh added a commit that referenced this pull request


          [SPMD] auto-sharding PoC (#6719)

298de8b

yeounoh added a commit that referenced this pull request


          [r2.3] backport: auto-sharding PoC (#6719) (#6755)

680ebc1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport_2.3 SPMD / Distributed