-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancements and fixes for FSDP and DeepSpeed #532
enhancements and fixes for FSDP and DeepSpeed #532
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR! I left a few comments, but mainly take care of not breaking backward compatibility by changing public functions.
1. Adding deprecation args and warnings in launcher for FSDP 2. Handling old configs to work with new launcher args wrt FSDP. 3. Reverting changes to public methods in `checkpointing.py` and handling it in `Accelerator` 4. Explicitly writing the defaults of various FSDP options in `dataclasses` for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating!
1. Removes mrpc datafiles and directly relies on HF datasets as it was throwing `file not found` error when running from within `tests` folder. Updating `moke_dataloaders` as a result. 2. adding `test_performance.py`, `test_memory.py` and `test_checkpointing.py` for multi-gpu FSDP and DeepSpeed tests
…which causing flaky behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working on this!
What does this PR do?
StateDictType
functionality of FSDP for checkpointingFull
andSharded
model and optimizer states.Full
andSharded
model, optimizer and scheduler states.fsdp
prefix for clarity and readabilityaccelerator.wait_for_everyone
to account for FSDPOne can now use
accelerator.save_state(output_dir)
andaccelerator.load_state(input_dir)
without having to deal with complexities of DeepSpeed and FSDP.ToDo