-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add doc for multinode GPU training. #5704
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
--nnodes=2 \ | ||
--node_rank=1 \ | ||
--nproc_per_node=4 \ | ||
--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet_torchrun.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_train_mp_imagenet_torchrun.py
-> pytorch/xla/test/test_train_mp_imagenet.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reopening - I still see test_train_mp_imagenet_torchrun
@@ -194,15 +194,65 @@ for more information. | |||
|
|||
*Warning: GPU support is still highly experimental!* | |||
|
|||
### Single-node GPU training | |||
|
|||
To use GPUs with PJRT, simply set `PJRT_DEVICE=GPU` and configure | |||
`GPU_NUM_DEVICES` to the number of devices on the host. For example: | |||
|
|||
``` | |||
PJRT_DEVICE=GPU GPU_NUM_DEVICES=4 python3 xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still accurate with GPU_NUM_DEVICES
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @will-cromar and @jonb377 ,I know in our previous discussion, we said we should replace GPU_NUM_DEVICES
with LOCAL_WORLD_SIZE
. But I don't think we can do that.
The reason why is if do so, then running
PJRT_DEVICE=GPU LOCAL_WORLD_SIZE=2 python -c 'xm.xla_device()'
would hang because distributed runtime service expect 2 clients here but we only have 1 process/client.
How do you feel if we keep using GPU_NUM_DEVICES
in the single-host-multi-GPU case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Let's leave GPU_NUM_DEVICES
for single-host-multi-GPU and try to think of a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it in, I don't see a straightforward way around the issue. Do we need to modify the runtime initialization logic in the computation client to account for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
de2a842
to
cb38e99
Compare
cc @zpcore @ManfeiBai for reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
You can also use `torchrun` to initiate the single-node multi-GPU training. For example, | ||
|
||
``` | ||
PJRT_DEVICE=GPU torchrun --nnodes 1 --nproc-per-node ${NUM_GPU_DEVICES} xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does torchrun set all the appropriate environment variables even for single-host? MASTER_ADDR
is the one I'm curious about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here if I don't specify --rdzv_endpoint
, MASTER_ADDR
will not be set by torchrun. In our code, if it's not set, we default to localhost.
|
||
### Multi-node GPU training | ||
|
||
**Note that this feature only works for cuda 12+**. Similar to how PyTorch uses multi-node training, you can run the command as below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the cuda 12 constraint also apply to the single-node case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik, cuda 12 constraint only apply to multi-node case.
- `--nnodes`: how many GPU machines to be used. | ||
- `--node_rank`: the index of the current GPU machines. The value can be 0, 1, ..., ${NUMBER_GPU_VM}-1. | ||
- `--nproc_per_node`: the number of GPU devices to be used on the current machine. | ||
- `--rdzv_endpoint`: the endpoint of the GPU machine with node_rank==0, in the form <host>:<port>. The `host` will be the internal IP address. The port can be any available port on the machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can link to the torchrun docs here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
--nnodes=2 \ | ||
--node_rank=0 \ | ||
--nproc_per_node=4 \ | ||
--rdzv_endpoint="<MACHINE_0_IP_ADDRESS>:12355" pytorch/xla/test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @will-cromar, I saw https://github.com/pytorch/xla/pull/5732/files. Should the flag be --ddp
now instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch Jon!
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
* add doc for multinode traning. * reworded ab it * fix comments * emphasize that cuda12+ is needed.
Will do some testing first. Once the feature is more stable, I'll merge this PR>