Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image for better explanation to FSDP tutorial #2644

Merged
merged 5 commits into from
Nov 13, 2023

Conversation

ChanBong
Copy link
Contributor

@ChanBong ChanBong commented Nov 4, 2023

Fixes #2613

Description

The tutorial lacked an explanation for what's going on behind parameter sharding

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnecessary issues are included into this pull request.

cc @wconstab @osalpekar @H-Huang @kwen2501 @sekyondaMeta @svekars @carljparker @NicolasHug @kit1980 @subramen

Copy link

pytorch-bot bot commented Nov 4, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2644

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 077d1c0 with merge base f05f050 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@@ -46,6 +46,15 @@ At a high level FSDP works as follow:
* Run reduce_scatter to sync gradients
* Discard parameters.

The key insight behind full parameter sharding is that we can decompose the all-reduce operations in DDP into separate reduce-scatter and all-gather operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that this is the correct statement.

Even though an all-reduce can be decomposed as a reduce-scatter and all-gather, the current phrasing might suggest that DDP's gradient all-reduce is being decomposed into a gradient reduce-scatter and gradient all-gather. However, FSDP actually all-gathers parameters.

Whether or not this decomposition of all-reduce into reduce-scatter and all-gather is the key insight is not obvious to me. If we show this decomposition, we probably want more exposition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Adding this picture would demand a clear explanation.

I am unsure what to write. If you can suggest something or direct me to where I can read about this topic, that'd be very helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awgu can you help @ChanBong with this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like the following:

One way to view FSDP's sharding is to decompose the DDP gradient all-reduce into reduce-scatter and all-gather. In particular, FSDP reduce-scatters gradients such that each rank has a shard of the gradients in backward, updates the corresponding shard of the parameters in the optimizer step, and all-gathers them in the next forward.

< Figure >

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks

Copy link
Contributor

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

intermediate_source/FSDP_tutorial.rst Outdated Show resolved Hide resolved
Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com>
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @ChanBong and @awgu for the review

@NicolasHug NicolasHug changed the title Add image for better explanation Add image for better explanation to FSDP tutorial Nov 13, 2023
@svekars svekars merged commit dc448c2 into pytorch:main Nov 13, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

💡 [REQUEST] - <GETTING STARTED WITH FULLY SHARDED DATA PARALLEL(FSDP)>
5 participants