Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc for multinode GPU training. #5704

Merged
merged 4 commits into from
Oct 27, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 48 additions & 2 deletions docs/pjrt.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,15 +194,61 @@ for more information.

*Warning: GPU support is still highly experimental!*

### Single-node GPU training

To use GPUs with PJRT, simply set `PJRT_DEVICE=GPU` and configure
`GPU_NUM_DEVICES` to the number of devices on the host. For example:

```
PJRT_DEVICE=GPU GPU_NUM_DEVICES=4 python3 xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still accurate with GPU_NUM_DEVICES?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @will-cromar and @jonb377 ,I know in our previous discussion, we said we should replace GPU_NUM_DEVICES with LOCAL_WORLD_SIZE. But I don't think we can do that.

The reason why is if do so, then running

PJRT_DEVICE=GPU LOCAL_WORLD_SIZE=2 python -c 'xm.xla_device()' 

would hang because distributed runtime service expect 2 clients here but we only have 1 process/client.

How do you feel if we keep using GPU_NUM_DEVICES in the single-host-multi-GPU case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Let's leave GPU_NUM_DEVICES for single-host-multi-GPU and try to think of a better solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it in, I don't see a straightforward way around the issue. Do we need to modify the runtime initialization logic in the computation client to account for this?

```

Currently, only a single host is supported, and multi-host GPU cluster support
will be added in an future release.
You can also use `torchrun` to initiate the single-node multi-GPU training. For example,

```
PJRT_DEVICE=GPU torchrun --nnodes 1 --nproc-per-node ${NUM_GPU_DEVICES} xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does torchrun set all the appropriate environment variables even for single-host? MASTER_ADDR is the one I'm curious about.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here if I don't specify --rdzv_endpoint, MASTER_ADDR will not be set by torchrun. In our code, if it's not set, we default to localhost.

```

In the above example, `--nnodes` means how many machines (physical machines or VMs) to be used (it is 1 since we do single-node training). `--nproc-per-node` means how many GPU devices to be used.

### Multi-node GPU training
vanbasten23 marked this conversation as resolved.
Show resolved Hide resolved

**Note that this feature only works for cuda 12+**. Similar to how PyTorch uses multi-node training, you can run the command as below:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the cuda 12 constraint also apply to the single-node case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik, cuda 12 constraint only apply to multi-node case.


```
PJRT_DEVICE=GPU torchrun \
--nnodes=${NUMBER_GPU_VM} \
--node_rank=${CURRENT_NODE_RANK} \
--nproc_per_node=${NUMBER_LOCAL_GPU_DEVICES} \
--rdzv_endpoint=<internal_ip_address:port> multinode_training.py
```

- `--nnodes`: how many GPU machines to be used.
- `--node_rank`: the index of the current GPU machines. The value can be 0, 1, ..., ${NUMBER_GPU_VM}-1.
- `--nproc_per_node`: the number of GPU devices to be used on the current machine.
- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form <host>:<port>. The `host` will be the internal IP address. The port can be any available port on the machine.
Comment on lines +226 to +229
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can link to the torchrun docs here as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


For example, if you want to train on 2 GPU machines: machine_0 and machine_1, on the first GPU machine machine_0, run

```
# PJRT_DEVICE=GPU torchrun \
--nnodes=2 \
--node_rank=0 \
--nproc_per_node=4 \
--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @will-cromar, I saw https://github.com/pytorch/xla/pull/5732/files. Should the flag be --ddp now instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch Jon!

```

On the second GPU machine, run

```
# PJRT_DEVICE=GPU torchrun \
--nnodes=2 \
--node_rank=1 \
--nproc_per_node=4 \
--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet_torchrun.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_train_mp_imagenet_torchrun.py -> pytorch/xla/test/test_train_mp_imagenet.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reopening - I still see test_train_mp_imagenet_torchrun

```

the difference between the 2 commands above are `--node_rank` and potentially `--nproc_per_node` if you want to use different number of GPU devices on each machine. All the rest are identical.

## Differences from XRT

Expand Down