Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge --pjrt_distributed flag with --ddp flag. #5732

Merged
merged 1 commit into from
Oct 26, 2023

Conversation

will-cromar
Copy link
Collaborator

These flags were only separate to support XRT. Now that XRT is gone, these options should always be used together.

@JackCaoG
Copy link
Collaborator

is this option tested in the CI?

@will-cromar
Copy link
Collaborator Author

No, but I ran both scripts both ways locally.

@JackCaoG
Copy link
Collaborator

maybe we should add it to the GPU CI. just let it run for a couple steps and make sure it works. If we provide a option we might as well test it.

@will-cromar will-cromar merged commit 8072a44 into master Oct 26, 2023
18 checks passed
@vanbasten23
Copy link
Collaborator

vanbasten23 commented Oct 27, 2023

By using -ddp in this command: root@xiowei-gpu:/ansible# PJRT_DEVICE=GPU torchrun --nnodes 1 --nproc-per-node 4 pytorch/xla/test/test_train_mp_imagenet.py --fake_data --batch_size=128 --num_epochs=1 --ddp, the globalRate remains to be 0:

| Training Device=xla:0/1 Epoch=1 Step=320 Loss=0.04107 Rate=0.00 GlobalRate=0.00 Time=21:42:42
| Training Device=xla:0/3 Epoch=1 Step=320 Loss=0.04107 Rate=0.00 GlobalRate=0.00 Time=21:42:42
| Training Device=xla:0/2 Epoch=1 Step=320 Loss=0.04107 Rate=0.00 GlobalRate=0.00 Time=21:42:42
| Training Device=xla:0/0 Epoch=1 Step=320 Loss=0.04107 Rate=0.00 GlobalRate=0.00 Time=21:42:42

Previously, the globalRate is much higher:

| Training Device=xla:0/0 Epoch=1 Step=320 Loss=0.00804 Rate=419.54 GlobalRate=247.63 Time=21:46:35
| Training Device=xla:0/1 Epoch=1 Step=320 Loss=0.00804 Rate=419.51 GlobalRate=247.71 Time=21:46:35
| Training Device=xla:0/2 Epoch=1 Step=320 Loss=0.00804 Rate=419.50 GlobalRate=247.70 Time=21:46:35
| Training Device=xla:0/3 Epoch=1 Step=320 Loss=0.00804 Rate=419.49 GlobalRate=247.69 Time=21:46:35

For torchrun, we need to do dist.init_process_group('xla', init_method='xla://') but don't necessarily do ddp? @will-cromar

jonb377 pushed a commit that referenced this pull request Oct 31, 2023
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023
ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants