Correct the multinode training doc #5747

vanbasten23 · 2023-10-30T18:08:11Z

This PR

adds --pjrt_distributed flag back due to Merge --pjrt_distributed flag with --ddp flag. #5732 (comment).
makes some correction to the multinode training doc.
fix remaining comments in the previous pr Add doc for multinode GPU training. #5704

docs/pjrt.md

jonb377 · 2023-10-31T23:50:08Z

test/test_train_mp_mnist.py

@@ -73,7 +76,7 @@ def _train_update(device, step, loss, tracker, epoch, writer):


 def train_mnist(flags, **kwargs):
-  if flags.ddp:
+  if flags.ddp or flags.pjrt_distributed:


Ah interesting... Do we need a dedicated flag still, or can we just also check for torchrun some other way? I saw dist.is_torchelastic_launched elsewhere in the codebase: https://github.com/pytorch/xla/blob/83778f0/torch_xla/_internal/rendezvous.py#L20

jonb377

LGTM, nothing blocking. Thanks Xiongfei!

vanbasten23 · 2023-10-31T23:54:32Z

Thanks for the review!

* fix Jon's comment * add pjrt_distributed flag back. * updated the doc * fix typo * fix typo

vanbasten23 added 3 commits October 27, 2023 22:06

fix Jon's comment

e823e2f

add pjrt_distributed flag back.

dfd09b1

updated the doc

689dcb6

vanbasten23 requested review from jonb377 and will-cromar October 30, 2023 18:08

vanbasten23 marked this pull request as ready for review October 30, 2023 18:08

fix typo

44e1d66

jonb377 reviewed Oct 30, 2023

View reviewed changes

docs/pjrt.md Show resolved Hide resolved

fix typo

08b1f7f

vanbasten23 requested a review from jonb377 October 30, 2023 18:27

will-cromar approved these changes Oct 31, 2023

View reviewed changes

jonb377 reviewed Oct 31, 2023

View reviewed changes

jonb377 approved these changes Oct 31, 2023

View reviewed changes

vanbasten23 merged commit 4038f8e into master Oct 31, 2023
18 checks passed

mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023

Correct the multinode training doc (pytorch#5747)

6a08f88

* fix Jon's comment * add pjrt_distributed flag back. * updated the doc * fix typo * fix typo

mbzomowski mentioned this pull request Nov 16, 2023

tpu ci module refactor mbzomowski-test-org/xla#7

Merged

ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023

Correct the multinode training doc (#5747)

9b0c4ad

* fix Jon's comment * add pjrt_distributed flag back. * updated the doc * fix typo * fix typo

ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023

Correct the multinode training doc (#5747)

885f15f

* fix Jon's comment * add pjrt_distributed flag back. * updated the doc * fix typo * fix typo

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Correct the multinode training doc (pytorch#5747)

d7acca5

* fix Jon's comment * add pjrt_distributed flag back. * updated the doc * fix typo * fix typo

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Correct the multinode training doc (#5747)

d07db07

* fix Jon's comment * add pjrt_distributed flag back. * updated the doc * fix typo * fix typo

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Correct the multinode training doc (#5747)

d0d59c8

* fix Jon's comment * add pjrt_distributed flag back. * updated the doc * fix typo * fix typo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct the multinode training doc #5747

Correct the multinode training doc #5747

vanbasten23 commented Oct 30, 2023

jonb377 Oct 31, 2023

jonb377 left a comment

vanbasten23 commented Oct 31, 2023

Correct the multinode training doc #5747

Correct the multinode training doc #5747

Conversation

vanbasten23 commented Oct 30, 2023

jonb377 Oct 31, 2023

Choose a reason for hiding this comment

jonb377 left a comment

Choose a reason for hiding this comment

vanbasten23 commented Oct 31, 2023