FSDP with full state dict #7487

shuyingsunshine21 · 2021-05-11T17:17:13Z

What does this PR do?

Co-authored-by: @SeanNaren and @shuyingsunshine21

Integrates FSDP, #6152

Discussed with @SeanNaren , for V1, we only support usage where user will configure the wrap strategy in configure_sharded_model. And we currently do not support sharded checkpointing.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

…lightning pull latest code

…oint_consolidate Update test_all_gather_grad.py

This reverts commit 9d4a2b8.

…1-checkpoint_consolidate" This reverts commit c5053da, reversing changes made to 0d23d75.

This reverts commit 0d23d75.

This reverts commit 70fe5da.

This reverts commit a9aae99.

This reverts commit ea74906.

This reverts commit bf70e43.

This reverts commit f172101.

This reverts commit 536c132.

This reverts commit 3a9fde9.

This reverts commit 7a369f4.

…lightning

This reverts commit 8222dc9.

This reverts commit 6c095b2.

This reverts commit 250d0aa.

This reverts commit 8651d54.

This reverts commit dcdcd29.

…-lightning

…lightning into fsdp

shuyingsunshine21 · 2021-05-21T15:46:25Z

tests/plugins/test_ddp_fully_sharded_with_full_state_dict.py

+    """
+
+    model = TestFSDPModel()
+    ck = ModelCheckpoint(save_last=True)


thanks @SeanNaren for the fix!!! I tried with ModelCheckpoint before, but use ModelCheckpoint(dirpath=tmpdir, save_last=True), wonder why removing dirpath works

ananthsub

awesome work @shuyingsunshine21 and @SeanNaren !

awaelchli · 2021-05-21T21:07:49Z

pytorch_lightning/plugins/training_type/fully_sharded.py

+        if not self.on_gpu:
+            raise MisconfigurationException(


this could be easily unit tested

sounds great, adding unit test for it.

awaelchli · 2021-05-21T21:11:01Z

pytorch_lightning/trainer/connectors/accelerator_connector.py

+        # as precision_plugin is dependent on training_type_plugin, make sure
+        # that we first select training_type_plugin, then precision_plugin
        return acc_cls(
-            precision_plugin=self.precision_plugin,
            training_type_plugin=self.training_type_plugin,
+            precision_plugin=self.precision_plugin,
        )


Let's try to move the precision inside before we add the next major plugin.

awaelchli · 2021-05-21T21:17:49Z

tests/plugins/test_ddp_fully_sharded_with_full_state_dict.py

+    trainer = Trainer(
+        default_root_dir=tmpdir,
+        fast_dev_run=True,
+        plugins="fsdp",
+    )
+    assert isinstance(trainer.accelerator.training_type_plugin, DDPFullyShardedPlugin)
+


I thought it's not supported on CPU?
Can't we evaluate compatibility in the accelerator connector?

nice catch here, it passed as our assertion for GPU happens in setup_distributed, not when initialize Trainer. I think we could do it in accelerator connector, but personal feeling is that file is becoming too giant and pretty complex, try to avoid additional logic there, feel it is specific plugin/accelerator strategy's responsibility to check when environment is setup.

awaelchli

interesting plugin
great effort from everybody

awaelchli · 2021-05-21T21:23:54Z

disabled auto merge in case you want to address some of the comments before merge but IMO this is unblocked and rdy to go. thx

pytorch_lightning/plugins/training_type/fully_sharded.py

for more information, see https://pre-commit.ci

…ightning into fsdp

Shuying Sun and others added 30 commits March 23, 2021 12:06

Fix some test errors

89f284d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

80cfbff

…lightning pull latest code

checkpoint consolidation

536c132

Update ddp_spawn.py

f172101

Update test_metric_result_integration.py

bf70e43

Update test_results.py

ea74906

Update utils.py

a9aae99

Update utils.py

70fe5da

Update test_all_gather_grad.py

0d23d75

Update test_all_gather_grad.py

ca6f98b

Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkp…

c5053da

…oint_consolidate Update test_all_gather_grad.py

Update test_results.py

9d4a2b8

Revert "Update test_results.py"

7635b4f

This reverts commit 9d4a2b8.

Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine2…

d64f90c

…1-checkpoint_consolidate" This reverts commit c5053da, reversing changes made to 0d23d75.

Revert "Update test_all_gather_grad.py"

dcdcd29

This reverts commit 0d23d75.

Revert "Update utils.py"

8651d54

This reverts commit 70fe5da.

Revert "Update utils.py"

15f4b9e

This reverts commit a9aae99.

Revert "Update test_results.py"

250d0aa

This reverts commit ea74906.

Revert "Update test_metric_result_integration.py"

6c095b2

This reverts commit bf70e43.

Revert "Update ddp_spawn.py"

8222dc9

This reverts commit f172101.

Revert "checkpoint consolidation"

3a9fde9

This reverts commit 536c132.

Revert "Revert "checkpoint consolidation""

7a369f4

This reverts commit 3a9fde9.

Revert "Revert "Revert "checkpoint consolidation"""

b4a0b9e

This reverts commit 7a369f4.

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

5cf1db1

…lightning

Revert "Revert "Update ddp_spawn.py""

0ce7e05

This reverts commit 8222dc9.

Revert "Revert "Update test_metric_result_integration.py""

fe9736d

This reverts commit 6c095b2.

Revert "Revert "Update test_results.py""

c314ef6

This reverts commit 250d0aa.

Revert "Revert "Update utils.py""

c3feda0

This reverts commit 8651d54.

Revert "Revert "Update test_all_gather_grad.py""

c759477

This reverts commit dcdcd29.

Merge branch 'master' of https://github.com/shuyingsunshine21/pytorch…

7a8e540

…-lightning

Shuying Sun and others added 11 commits May 19, 2021 17:37

test

af31f6a

fix

a044f58

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

78897a2

…lightning into fsdp

update

5fac05c

testing remove special for multi gpu

8fb38dc

assert gpu

53be8f4

add assertion for gpu

2fdab94

fix

ffb985d

Re-enable special test, use ModelCheckpoint

3189f4c

Fix paths

c33fbcb

Fix path passing

0067aeb

SeanNaren enabled auto-merge (squash) May 21, 2021 15:13

SeanNaren added the ready PRs ready to be merged label May 21, 2021

shuyingsunshine21 commented May 21, 2021

View reviewed changes

ananthsub approved these changes May 21, 2021

View reviewed changes

awaelchli reviewed May 21, 2021

View reviewed changes

awaelchli approved these changes May 21, 2021

View reviewed changes

awaelchli disabled auto-merge May 21, 2021 21:23

carmocca reviewed May 21, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/fully_sharded.py Outdated Show resolved Hide resolved

Shuying Sun and others added 7 commits May 21, 2021 18:59

test

320c35d

test

5ea37a5

fix test

9e599c2

fix

6c96625

pre-commit format

32cc552

[pre-commit.ci] auto fixes from pre-commit.com hooks

25329fe

for more information, see https://pre-commit.ci

Merge branch 'fsdp' of https://github.com/shuyingsunshine21/pytorch-l…

f43a36f

…ightning into fsdp

ethanwharris merged commit 299f2c4 into Lightning-AI:master May 24, 2021

SeanNaren mentioned this pull request Jun 1, 2021

Add FSDP docs #7791

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP with full state dict #7487

FSDP with full state dict #7487

shuyingsunshine21 commented May 11, 2021 •

edited

Loading

shuyingsunshine21 May 21, 2021

ananthsub left a comment

awaelchli May 21, 2021

shuyingsunshine21 May 21, 2021

awaelchli May 21, 2021

awaelchli May 21, 2021

shuyingsunshine21 May 21, 2021

awaelchli left a comment

awaelchli commented May 21, 2021

FSDP with full state dict #7487

FSDP with full state dict #7487

Conversation

shuyingsunshine21 commented May 11, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

shuyingsunshine21 May 21, 2021

Choose a reason for hiding this comment

ananthsub left a comment

Choose a reason for hiding this comment

awaelchli May 21, 2021

Choose a reason for hiding this comment

shuyingsunshine21 May 21, 2021

Choose a reason for hiding this comment

awaelchli May 21, 2021

Choose a reason for hiding this comment

awaelchli May 21, 2021

Choose a reason for hiding this comment

shuyingsunshine21 May 21, 2021

Choose a reason for hiding this comment

awaelchli left a comment

Choose a reason for hiding this comment

awaelchli commented May 21, 2021

shuyingsunshine21 commented May 11, 2021 •

edited

Loading