Merge `--pjrt_distributed` flag with `--ddp` flag. #5732

will-cromar · 2023-10-25T18:27:57Z

These flags were only separate to support XRT. Now that XRT is gone, these options should always be used together.

JackCaoG · 2023-10-25T18:31:48Z

is this option tested in the CI?

will-cromar · 2023-10-25T18:32:12Z

No, but I ran both scripts both ways locally.

JackCaoG · 2023-10-25T18:38:29Z

maybe we should add it to the GPU CI. just let it run for a couple steps and make sure it works. If we provide a option we might as well test it.

vanbasten23 · 2023-10-27T21:50:41Z

By using -ddp in this command: root@xiowei-gpu:/ansible# PJRT_DEVICE=GPU torchrun --nnodes 1 --nproc-per-node 4 pytorch/xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1 --ddp, the globalRate remains to be 0:

| Training Device=xla:0/1 Epoch=1 Step=320 Loss=0.04107 Rate=0.00 GlobalRate=0.00 Time=21:42:42
| Training Device=xla:0/3 Epoch=1 Step=320 Loss=0.04107 Rate=0.00 GlobalRate=0.00 Time=21:42:42
| Training Device=xla:0/2 Epoch=1 Step=320 Loss=0.04107 Rate=0.00 GlobalRate=0.00 Time=21:42:42
| Training Device=xla:0/0 Epoch=1 Step=320 Loss=0.04107 Rate=0.00 GlobalRate=0.00 Time=21:42:42

Previously, the globalRate is much higher:

| Training Device=xla:0/0 Epoch=1 Step=320 Loss=0.00804 Rate=419.54 GlobalRate=247.63 Time=21:46:35
| Training Device=xla:0/1 Epoch=1 Step=320 Loss=0.00804 Rate=419.51 GlobalRate=247.71 Time=21:46:35
| Training Device=xla:0/2 Epoch=1 Step=320 Loss=0.00804 Rate=419.50 GlobalRate=247.70 Time=21:46:35
| Training Device=xla:0/3 Epoch=1 Step=320 Loss=0.00804 Rate=419.49 GlobalRate=247.69 Time=21:46:35

For torchrun, we need to do dist.init_process_group('xla', init_method='xla://') but don't necessarily do ddp? @will-cromar

Merge --pjrt_distributed flag with --ddp flag.

1f5fdae

will-cromar requested a review from JackCaoG October 25, 2023 18:27

JackCaoG approved these changes Oct 25, 2023

View reviewed changes

will-cromar merged commit 8072a44 into master Oct 26, 2023
18 checks passed

vanbasten23 mentioned this pull request Oct 30, 2023

Correct the multinode training doc #5747

Merged

jonb377 pushed a commit that referenced this pull request Oct 31, 2023

Merge --pjrt_distributed flag with --ddp flag. (#5732)

3557eab

mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023

Merge --pjrt_distributed flag with --ddp flag. (pytorch#5732)

b254f8d

mbzomowski mentioned this pull request Nov 16, 2023

tpu ci module refactor mbzomowski-test-org/xla#7

Merged

ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023

Merge --pjrt_distributed flag with --ddp flag. (#5732)

0d43c75

ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023

Merge --pjrt_distributed flag with --ddp flag. (#5732)

eff9583

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Merge --pjrt_distributed flag with --ddp flag. (pytorch#5732)

796191b

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Merge --pjrt_distributed flag with --ddp flag. (#5732)

6fdb036

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Merge --pjrt_distributed flag with --ddp flag. (#5732)

f228a65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge `--pjrt_distributed` flag with `--ddp` flag. #5732

Merge `--pjrt_distributed` flag with `--ddp` flag. #5732

will-cromar commented Oct 25, 2023

JackCaoG commented Oct 25, 2023

will-cromar commented Oct 25, 2023

JackCaoG commented Oct 25, 2023

vanbasten23 commented Oct 27, 2023 •

edited

Loading

Merge --pjrt_distributed flag with --ddp flag. #5732

Merge --pjrt_distributed flag with --ddp flag. #5732

Conversation

will-cromar commented Oct 25, 2023

JackCaoG commented Oct 25, 2023

will-cromar commented Oct 25, 2023

JackCaoG commented Oct 25, 2023

vanbasten23 commented Oct 27, 2023 • edited Loading

Merge `--pjrt_distributed` flag with `--ddp` flag. #5732

Merge `--pjrt_distributed` flag with `--ddp` flag. #5732

vanbasten23 commented Oct 27, 2023 •

edited

Loading