Add image for better explanation to FSDP tutorial (#2644)

* Add image for better explanation * Edit explanation for fsdp sharding --------- Co-authored-by: Svetlana Karslioglu <svekars@meta.com> Co-authored-by: Nicolas Hug <contact@nicolas-hug.com> Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com>
pytorch · Nov 13, 2023 · dc448c2 · dc448c2
1 parent f05f050
commit dc448c2
Show file tree

Hide file tree

Showing 2 changed files with 9 additions and 0 deletions.
diff --git a/_static/img/distributed/fsdp_sharding.png b/_static/img/distributed/fsdp_sharding.png
diff --git a/intermediate_source/FSDP_tutorial.rst b/intermediate_source/FSDP_tutorial.rst
@@ -46,6 +46,15 @@ At a high level FSDP works as follow:
 * Run reduce_scatter to sync gradients
 * Discard parameters. 
 
+One way to view FSDP's sharding is to decompose the DDP gradient all-reduce into reduce-scatter and all-gather. Specifically, during the backward pass, FSDP reduces and scatters gradients, ensuring that each rank possesses a shard of the gradients. Then it updates the corresponding shard of the parameters in the optimizer step. Finally, in the subsequent forward pass, it performs an all-gather operation to collect and combine the updated parameter shards.
+
+.. figure:: /_static/img/distributed/fsdp_sharding.png
+   :width: 100%
+   :align: center
+   :alt: FSDP allreduce
+
+   FSDP Allreduce
+
 How to use FSDP
 --------------
 Here we use a toy model to run training on the MNIST dataset for demonstration purposes. The APIs and logic can be applied to training larger models as well.