Add doc for multinode GPU training. #5704

vanbasten23 · 2023-10-16T18:26:51Z

Will do some testing first. Once the feature is more stable, I'll merge this PR>

docs/pjrt.md

jonb377

LGTM

jonb377 · 2023-10-17T16:50:16Z

docs/pjrt.md

+--nnodes=2 \
+--node_rank=1 \
+--nproc_per_node=4 \
+--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet_torchrun.py  --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1


test_train_mp_imagenet_torchrun.py -> pytorch/xla/test/test_train_mp_imagenet.py

Reopening - I still see test_train_mp_imagenet_torchrun

jonb377 · 2023-10-17T16:51:19Z

docs/pjrt.md

@@ -194,15 +194,65 @@ for more information.

 *Warning: GPU support is still highly experimental!*

+### Single-node GPU training
+
 To use GPUs with PJRT, simply set `PJRT_DEVICE=GPU` and configure
 `GPU_NUM_DEVICES` to the number of devices on the host. For example:

 ```
 PJRT_DEVICE=GPU GPU_NUM_DEVICES=4 python3 xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1


Is this still accurate with GPU_NUM_DEVICES?

hi @will-cromar and @jonb377 ,I know in our previous discussion, we said we should replace GPU_NUM_DEVICES with LOCAL_WORLD_SIZE. But I don't think we can do that.

The reason why is if do so, then running

PJRT_DEVICE=GPU LOCAL_WORLD_SIZE=2 python -c 'xm.xla_device()'

would hang because distributed runtime service expect 2 clients here but we only have 1 process/client.

How do you feel if we keep using GPU_NUM_DEVICES in the single-host-multi-GPU case?

That's a good point. Let's leave GPU_NUM_DEVICES for single-host-multi-GPU and try to think of a better solution.

Let's leave it in, I don't see a straightforward way around the issue. Do we need to modify the runtime initialization logic in the computation client to account for this?

wbmc

LGTM

yeounoh · 2023-10-26T18:31:04Z

cc @zpcore @ManfeiBai for reference.

will-cromar

Overall LGTM

docs/pjrt.md

jonb377 · 2023-10-27T18:07:17Z

docs/pjrt.md

+You can also use `torchrun` to initiate the single-node multi-GPU training. For example,
+
+```
+PJRT_DEVICE=GPU torchrun --nnodes 1 --nproc-per-node ${NUM_GPU_DEVICES} xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1


Does torchrun set all the appropriate environment variables even for single-host? MASTER_ADDR is the one I'm curious about.

Here if I don't specify --rdzv_endpoint, MASTER_ADDR will not be set by torchrun. In our code, if it's not set, we default to localhost.

jonb377 · 2023-10-27T18:08:01Z

docs/pjrt.md

+
+### Multi-node GPU training
+
+**Note that this feature only works for cuda 12+**. Similar to how PyTorch uses multi-node training, you can run the command as below:


Does the cuda 12 constraint also apply to the single-node case?

afaik, cuda 12 constraint only apply to multi-node case.

jonb377 · 2023-10-27T18:09:09Z

docs/pjrt.md

+- `--nnodes`: how many GPU machines to be used.
+- `--node_rank`: the index of the current GPU machines. The value can be 0, 1, ..., ${NUMBER_GPU_VM}-1.
+- `--nproc_per_node`: the number of GPU devices to be used on the current machine.
+- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form <host>:<port>. The `host` will be the internal IP address. The port can be any available port on the machine.


Maybe we can link to the torchrun docs here as well.

jonb377 · 2023-10-27T18:13:47Z

docs/pjrt.md

+--nnodes=2 \
+--node_rank=0 \
+--nproc_per_node=4 \
+--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py  --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1


cc @will-cromar, I saw https://github.com/pytorch/xla/pull/5732/files. Should the flag be --ddp now instead?

Thanks for the catch Jon!

* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.

vanbasten23 requested review from wbmc, will-cromar, jonb377 and JackCaoG October 16, 2023 18:27

vanbasten23 mentioned this pull request Oct 16, 2023

Add multi-host GPU support #5657

Merged

JackCaoG reviewed Oct 16, 2023

View reviewed changes

docs/pjrt.md Show resolved Hide resolved

vanbasten23 added the DO_NOT_MERGE_YET For PRs which cannot be merged, despite tests passing label Oct 16, 2023

jonb377 approved these changes Oct 17, 2023

View reviewed changes

wbmc approved these changes Oct 21, 2023

View reviewed changes

add doc for multinode traning.

946a733

vanbasten23 requested a review from JackCaoG October 25, 2023 23:48

vanbasten23 marked this pull request as ready for review October 25, 2023 23:48

reworded ab it

cb38e99

vanbasten23 force-pushed the addMultiHostGPUDoc branch from de2a842 to cb38e99 Compare October 25, 2023 23:49

vanbasten23 removed the DO_NOT_MERGE_YET For PRs which cannot be merged, despite tests passing label Oct 26, 2023

will-cromar approved these changes Oct 27, 2023

View reviewed changes

docs/pjrt.md Outdated Show resolved Hide resolved

JackCaoG reviewed Oct 27, 2023

View reviewed changes

docs/pjrt.md Outdated Show resolved Hide resolved

JackCaoG approved these changes Oct 27, 2023

View reviewed changes

vanbasten23 added 2 commits October 27, 2023 17:27

fix comments

e1c47ce

emphasize that cuda12+ is needed.

18d33a1

vanbasten23 merged commit b0dca12 into master Oct 27, 2023
2 of 3 checks passed

vanbasten23 deleted the addMultiHostGPUDoc branch October 27, 2023 17:39

jonb377 reviewed Oct 27, 2023

View reviewed changes

vanbasten23 mentioned this pull request Oct 30, 2023

Correct the multinode training doc #5747

Merged

jonb377 pushed a commit that referenced this pull request Oct 31, 2023

Add doc for multinode GPU training. (#5704)

a9f4298

* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.

mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023

Add doc for multinode GPU training. (pytorch#5704)

fcb6323

* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.

mbzomowski mentioned this pull request Nov 16, 2023

tpu ci module refactor mbzomowski-test-org/xla#7

Merged

ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023

Add doc for multinode GPU training. (#5704)

625b3c0

* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.

ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023

Add doc for multinode GPU training. (#5704)

8127782

* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Add doc for multinode GPU training. (pytorch#5704)

f4c774e

* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Add doc for multinode GPU training. (#5704)

ae4b158

* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Add doc for multinode GPU training. (#5704)

87e72e3

* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add doc for multinode GPU training. #5704

Add doc for multinode GPU training. #5704

vanbasten23 commented Oct 16, 2023 •

edited

Loading

jonb377 left a comment

jonb377 Oct 17, 2023

vanbasten23 Oct 20, 2023

jonb377 Oct 27, 2023

jonb377 Oct 17, 2023

vanbasten23 Oct 25, 2023

will-cromar Oct 27, 2023

jonb377 Oct 27, 2023

wbmc left a comment

yeounoh commented Oct 26, 2023

will-cromar left a comment

jonb377 Oct 27, 2023

vanbasten23 Oct 27, 2023

jonb377 Oct 27, 2023

vanbasten23 Oct 27, 2023

jonb377 Oct 27, 2023

vanbasten23 Oct 27, 2023

jonb377 Oct 27, 2023

vanbasten23 Oct 30, 2023


		### Multi-node GPU training

		Note that this feature only works for cuda 12+. Similar to how PyTorch uses multi-node training, you can run the command as below:

Add doc for multinode GPU training. #5704

Add doc for multinode GPU training. #5704

Conversation

vanbasten23 commented Oct 16, 2023 • edited Loading

jonb377 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbmc left a comment

Choose a reason for hiding this comment

yeounoh commented Oct 26, 2023

will-cromar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanbasten23 commented Oct 16, 2023 •

edited

Loading