Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the missing parameter error when running mp_imagenet with torchrun #5729

Merged
merged 2 commits into from
Oct 25, 2023

Conversation

vanbasten23
Copy link
Collaborator

@vanbasten23 vanbasten23 commented Oct 25, 2023

Currently the test (PJRT_DEVICE=GPU torchrun --nproc_per_node=4 --nnodes=1 --node_rank=1 --rdzv_endpoint="10.164.0.13:12355" pytorch/xla/test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=128 --num_epochs=1) fails with error:

Traceback (most recent call last):
  File "pytorch/xla/test/test_train_mp_imagenet.py", line 378, in <module>
    _mp_fn(FLAGS)
TypeError: _mp_fn() missing 1 required positional argument: 'flags'

This PR fixes it.

Copy link
Collaborator

@jonb377 jonb377 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Xiongfei

@@ -375,6 +376,6 @@ def _mp_fn(index, flags):

if __name__ == '__main__':
if dist.is_torchelastic_launched():
_mp_fn(FLAGS)
_mp_fn(xu.getenv_as(xenv.LOCAL_RANK, int), FLAGS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is set by torchrun, right? Just confirming my understanding.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@vanbasten23
Copy link
Collaborator Author

Thanks for the review!

@vanbasten23 vanbasten23 merged commit 294610a into master Oct 25, 2023
18 checks passed
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
pytorch#5729)

* Fix the missing parameter error when running mp_imagenet with torchrun

* made it local rank
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
pytorch#5729)

* Fix the missing parameter error when running mp_imagenet with torchrun

* made it local rank
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
#5729)

* Fix the missing parameter error when running mp_imagenet with torchrun

* made it local rank
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
#5729)

* Fix the missing parameter error when running mp_imagenet with torchrun

* made it local rank
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants