-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP with full state dict #7487
Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
…oint_consolidate Update test_all_gather_grad.py
This reverts commit 9d4a2b8.
This reverts commit 0d23d75.
This reverts commit 70fe5da.
This reverts commit a9aae99.
This reverts commit ea74906.
This reverts commit bf70e43.
This reverts commit f172101.
This reverts commit 536c132.
This reverts commit 3a9fde9.
This reverts commit 7a369f4.
This reverts commit 8222dc9.
This reverts commit 6c095b2.
This reverts commit 250d0aa.
This reverts commit 8651d54.
This reverts commit dcdcd29.
""" | ||
|
||
model = TestFSDPModel() | ||
ck = ModelCheckpoint(save_last=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @SeanNaren for the fix!!! I tried with ModelCheckpoint before, but use ModelCheckpoint(dirpath=tmpdir, save_last=True)
, wonder why removing dirpath
works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome work @shuyingsunshine21 and @SeanNaren !
if not self.on_gpu: | ||
raise MisconfigurationException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be easily unit tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds great, adding unit test for it.
# as precision_plugin is dependent on training_type_plugin, make sure | ||
# that we first select training_type_plugin, then precision_plugin | ||
return acc_cls( | ||
precision_plugin=self.precision_plugin, | ||
training_type_plugin=self.training_type_plugin, | ||
precision_plugin=self.precision_plugin, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to move the precision inside before we add the next major plugin.
trainer = Trainer( | ||
default_root_dir=tmpdir, | ||
fast_dev_run=True, | ||
plugins="fsdp", | ||
) | ||
assert isinstance(trainer.accelerator.training_type_plugin, DDPFullyShardedPlugin) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it's not supported on CPU?
Can't we evaluate compatibility in the accelerator connector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch here, it passed as our assertion for GPU happens in setup_distributed
, not when initialize Trainer. I think we could do it in accelerator connector, but personal feeling is that file is becoming too giant and pretty complex, try to avoid additional logic there, feel it is specific plugin/accelerator strategy's responsibility to check when environment is setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting plugin
great effort from everybody
disabled auto merge in case you want to address some of the comments before merge but IMO this is unblocked and rdy to go. thx |
What does this PR do?
Co-authored-by: @SeanNaren and @shuyingsunshine21
Integrates FSDP, #6152
Discussed with @SeanNaren , for V1, we only support usage where user will configure the wrap strategy in
configure_sharded_model
. And we currently do not support sharded checkpointing.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃