Skip to content

Commit

Permalink
Add image for better explanation to FSDP tutorial (#2644)
Browse files Browse the repository at this point in the history
* Add image for better explanation
* Edit explanation for fsdp sharding
---------

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com>
  • Loading branch information
4 people authored Nov 13, 2023
1 parent f05f050 commit dc448c2
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 0 deletions.
Binary file added _static/img/distributed/fsdp_sharding.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions intermediate_source/FSDP_tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,15 @@ At a high level FSDP works as follow:
* Run reduce_scatter to sync gradients
* Discard parameters.

One way to view FSDP's sharding is to decompose the DDP gradient all-reduce into reduce-scatter and all-gather. Specifically, during the backward pass, FSDP reduces and scatters gradients, ensuring that each rank possesses a shard of the gradients. Then it updates the corresponding shard of the parameters in the optimizer step. Finally, in the subsequent forward pass, it performs an all-gather operation to collect and combine the updated parameter shards.

.. figure:: /_static/img/distributed/fsdp_sharding.png
:width: 100%
:align: center
:alt: FSDP allreduce

FSDP Allreduce

How to use FSDP
--------------
Here we use a toy model to run training on the MNIST dataset for demonstration purposes. The APIs and logic can be applied to training larger models as well.
Expand Down

0 comments on commit dc448c2

Please sign in to comment.