enhancements and fixes for FSDP and DeepSpeed #532

pacman100 · 2022-07-19T11:59:54Z

What does this PR do?

Enables StateDictType functionality of FSDP for checkpointing Full and Sharded model and optimizer states.
Enabled DeepSpeed checkpointing functionality to save and load Full and Sharded model, optimizer and scheduler states.
Fixes Issues with saving model/optimizer and loading them back #285 related to deepspeed and the optimizer issue of Don't unwrap in save_state() #489 when using deepspeed
Renaming all FSDP arguments with fsdp prefix for clarity and readability
Fixing accelerator.wait_for_everyone to account for FSDP

One can now use accelerator.save_state(output_dir) and accelerator.load_state(input_dir) without having to deal with complexities of DeepSpeed and FSDP.

ToDo

Write tests
Run experiments to verify
update the docs related to this PR

HuggingFaceDocBuilderDev · 2022-07-19T12:02:55Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks for your PR! I left a few comments, but mainly take care of not breaking backward compatibility by changing public functions.

src/accelerate/checkpointing.py

src/accelerate/commands/config/cluster.py

src/accelerate/commands/launch.py

src/accelerate/utils/dataclasses.py

1. Adding deprecation args and warnings in launcher for FSDP 2. Handling old configs to work with new launcher args wrt FSDP. 3. Reverting changes to public methods in `checkpointing.py` and handling it in `Accelerator` 4. Explicitly writing the defaults of various FSDP options in `dataclasses` for readability.

1. FSDP wrapped model being added to the `_models`. 2. Not passing the env variables when args are None.

sgugger

Thanks for iterating!

1. Removes mrpc datafiles and directly relies on HF datasets as it was throwing `file not found` error when running from within `tests` folder. Updating `moke_dataloaders` as a result. 2. adding `test_performance.py`, `test_memory.py` and `test_checkpointing.py` for multi-gpu FSDP and DeepSpeed tests

…which causing flaky behaviour

pacman100 · 2022-07-26T07:55:48Z

make test_deepspeed results below (~6 minutes) :

make test_fsdp results below (~4 minutes):

sgugger

LGTM, thanks for working on this!

checkpointing enhancements and fixes for FSDP and DeepSpeed

0639d1c

sgugger reviewed Jul 19, 2022

View reviewed changes

pacman100 added 4 commits July 20, 2022 12:06

fixes

a59223f

1. FSDP wrapped model being added to the `_models`. 2. Not passing the env variables when args are None.

resolving comments

2109a2d

adding FSDP for all the collective operations

94967ba

pacman100 changed the title ~~checkpointing enhancements and fixes for FSDP and DeepSpeed~~ enhancements and fixes for FSDP and DeepSpeed Jul 20, 2022

sgugger approved these changes Jul 20, 2022

View reviewed changes

pacman100 marked this pull request as ready for review July 22, 2022 15:14

pacman100 added 12 commits July 22, 2022 20:58

Merge branch 'main' into smangrul/saving-and-loading-utilities

b7515d9

reverting mocked_dataloader changes

3d25ac5

adding FSDP tests

0645a09

data files revert

ce4aa1e

excluding fsdp tests from tests_core

0bbb981

try 2

35301bc

adding time delay to avoid torchrun from crashing at times leading …

6d85654

…which causing flaky behaviour

reducing the time of tests

9a8103b

fixes

77d6a77

fix

f42a6e0

fixes and reduce time further

b585109

reduce time further and minor fixes

d159a53

adding a deepspeed basic e2e test for single gpu setup

436e06b

sgugger approved these changes Jul 26, 2022

View reviewed changes

pacman100 merged commit 0c6bdc2 into huggingface:main Jul 26, 2022

pacman100 deleted the smangrul/saving-and-loading-utilities branch July 27, 2022 04:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancements and fixes for FSDP and DeepSpeed #532

enhancements and fixes for FSDP and DeepSpeed #532

pacman100 commented Jul 19, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 19, 2022 •

edited

Loading

sgugger left a comment

sgugger left a comment

pacman100 commented Jul 26, 2022

sgugger left a comment

enhancements and fixes for FSDP and DeepSpeed #532

enhancements and fixes for FSDP and DeepSpeed #532

Conversation

pacman100 commented Jul 19, 2022 • edited Loading

What does this PR do?

ToDo

HuggingFaceDocBuilderDev commented Jul 19, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

pacman100 commented Jul 26, 2022

sgugger left a comment

Choose a reason for hiding this comment

pacman100 commented Jul 19, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 19, 2022 •

edited

Loading