Fix run_demo(demo_model_parallel, world_size) issue #2367

TheMemoryDealer · 2023-05-31T19:07:17Z

Fixes #1750

Description

Fixes run_demo(demo_model_parallel, world_size) issue as described in #1750

Checklist

The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
Only one issue is addressed in this pull request
Labels from the issue that this PR is fixing are added to this pull request
No unnessessary issues are included into this pull request.

cc @mrshenli @osalpekar @H-Huang @kwen2501

Fixes pytorch#1750

netlify · 2023-05-31T19:11:48Z

✅ Deploy Preview for pytorch-tutorials-preview ready!

Name	Link
🔨 Latest commit	`86acfe7`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/647a3bc622d6be00081ef72f
😎 Deploy Preview	https://deploy-preview-2367--pytorch-tutorials-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

subramen · 2023-06-01T13:57:10Z

Changing the value of world_size impacts the values of dev0 and dev1, you will need to update how dev is calculated to account for the new world_size value

subramen

Please update the dev calculation in the model class to reflect the new world_size value

In the function demo_model_parallel, dev0 and dev1 are computed in a way that assigns two distinct GPUs to each process. This is achieved by doubling the rank and applying modulus operation with twice the world_size. Assuming 8 gpus world_size is set to 4, leading to the creation of 4 processes. Each of these processes is allocated two distinct GPUs. For instance, the first process (process 0) is assigned GPUs 0 and 1, the second process (process 1) is assigned GPUs 2 and 3, and so forth.

TheMemoryDealer · 2023-06-02T15:35:47Z

@subramen I've updated the calculation in a simple way to take into account the prior division. So now dev0 and dev1 are calculated as

    dev0 = (rank * 2) % (world_size * 2)
    dev1 = (rank * 2 + 1) % (world_size * 2)

TheMemoryDealer · 2023-06-02T17:02:42Z

@subramen going back on it, does % (world_size * 2) not simply eliminate the world_size = n_gpus//2 ? Would it not make more sense to just ignore world_size = n_gpus//2 and keep

dev0 = rank * 2
dev1 = rank * 2 + 1

?

subramen · 2023-06-02T17:47:40Z

Yes, you don't actually need world_size anymore :) that is the update I was looking for.

should work well now assuming half as many processes as there are gpus

Update ddp_tutorial.rst

0168169

Fixes pytorch#1750

facebook-github-bot added the cla signed label May 31, 2023

github-actions bot added distributed docathon-h1-2023 A label for the docathon in H1 2023 easy and removed cla signed labels May 31, 2023

svekars requested a review from subramen May 31, 2023 19:15

facebook-github-bot added the cla signed label May 31, 2023

svekars changed the title ~~Update ddp_tutorial.rst~~ Update ddp_tutorial.rs May 31, 2023

github-actions bot removed the cla signed label May 31, 2023

facebook-github-bot added the cla signed label May 31, 2023

github-actions bot removed the cla signed label Jun 1, 2023

facebook-github-bot added the cla signed label Jun 1, 2023

subramen suggested changes Jun 2, 2023

View reviewed changes

TheMemoryDealer added 2 commits June 2, 2023 16:33

Merge branch 'main' into patch-1

07b5a6a

github-actions bot removed the cla signed label Jun 2, 2023

facebook-github-bot added the cla signed label Jun 2, 2023

TheMemoryDealer requested a review from subramen June 2, 2023 16:39

TheMemoryDealer added 2 commits June 2, 2023 19:58

Update ddp_tutorial.rst

98c3c35

should work well now assuming half as many processes as there are gpus

Merge branch 'main' into patch-1

86acfe7

github-actions bot removed the cla signed label Jun 2, 2023

facebook-github-bot added the cla signed label Jun 2, 2023

subramen approved these changes Jun 2, 2023

View reviewed changes

svekars changed the title ~~Update ddp_tutorial.rs~~ Fix run_demo(demo_model_parallel, world_size) issue Jun 2, 2023

github-actions bot removed the cla signed label Jun 2, 2023

facebook-github-bot added the cla signed label Jun 2, 2023

svekars merged commit 420037e into pytorch:main Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix run_demo(demo_model_parallel, world_size) issue #2367

Fix run_demo(demo_model_parallel, world_size) issue #2367

TheMemoryDealer commented May 31, 2023 •

edited by svekars

Loading

netlify bot commented May 31, 2023 •

edited

Loading

subramen commented Jun 1, 2023 •

edited

Loading

subramen left a comment

TheMemoryDealer commented Jun 2, 2023

TheMemoryDealer commented Jun 2, 2023

subramen commented Jun 2, 2023

Fix run_demo(demo_model_parallel, world_size) issue #2367

Fix run_demo(demo_model_parallel, world_size) issue #2367

Conversation

TheMemoryDealer commented May 31, 2023 • edited by svekars Loading

Description

Checklist

netlify bot commented May 31, 2023 • edited Loading

✅ Deploy Preview for pytorch-tutorials-preview ready!

subramen commented Jun 1, 2023 • edited Loading

subramen left a comment

Choose a reason for hiding this comment

TheMemoryDealer commented Jun 2, 2023

TheMemoryDealer commented Jun 2, 2023

subramen commented Jun 2, 2023

TheMemoryDealer commented May 31, 2023 •

edited by svekars

Loading

netlify bot commented May 31, 2023 •

edited

Loading

subramen commented Jun 1, 2023 •

edited

Loading