Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove call_configure_sharded_model lifecycle property #9612

Merged
merged 7 commits into from
Sep 24, 2021

Conversation

ananthsub
Copy link
Contributor

@ananthsub ananthsub commented Sep 20, 2021

What does this PR do?

Part of #8722

Changes:

  • Remove the property guarding the hook: The Trainer will always call configure_sharded_model in each of fit validate test etc. This avoids subtle side-effects from prior runs and makes the call order consistent
  • In turn, we strongly recommend users to implement configure_sharded_model idempotently. The update to TestFSDPModel demonstrates how one can check if the layers are already wrapped with FSDP and return early if so. This is the same spirit of Avoid rewrapping LightningModules in plugins #8593 but in user-land

@SeanNaren

Does your PR introduce any breaking changes? If yes, please list them.

  • Calls configure_sharded_model unconditionally
  • This removes the property on the accelerator/training type plugin
  • Removes support for call_configure_sharded_model_hook on the LightningModule (which is not officially part of the LightningModule API)

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@ananthsub ananthsub added distributed Generic distributed-related topic refactor breaking change Includes a breaking change labels Sep 20, 2021
@ananthsub ananthsub added this to the v1.5 milestone Sep 20, 2021
@ananthsub ananthsub changed the title Removec call_configure_sharded_model lifecycle property Remove call_configure_sharded_model lifecycle property Sep 20, 2021
@codecov
Copy link

codecov bot commented Sep 20, 2021

Codecov Report

Merging #9612 (2562888) into master (eb6aa7a) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #9612    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         179     179            
  Lines       15307   15306     -1     
=======================================
- Hits        14199   13583   -616     
- Misses       1108    1723   +615     

Copy link
Contributor

@SeanNaren SeanNaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple calls to configure_sharded_model should be fine for DeepSpeed as well as there is a guard in place to not re-partition parameters that have already been partitioned: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L499

@ananthsub is it important to update the docs now? or should we wait till further changes are made to support a persistent model state across Trainer stages?

@mergify mergify bot added the ready PRs ready to be merged label Sep 22, 2021
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@tchaton tchaton enabled auto-merge (squash) September 22, 2021 12:18
@tchaton
Copy link
Contributor

tchaton commented Sep 22, 2021

Hey @ananthsub,

Any progress on the failing Azure tests ?

Best,
T.C

@ananthsub
Copy link
Contributor Author

Hey @ananthsub,

Any progress on the failing Azure tests ?

Best,
T.C

@tchaton I'm unable to reproduce the failures locally. is there any interleaving of tests that could cause this to fail in CI but not locally?

@awaelchli
Copy link
Contributor

awaelchli commented Sep 23, 2021

@ananthsub I can reproduce locally by running py.test -v tests/plugins.

the first test that fails seems to be test_deepspeed_skip_backward_raises.

When calling the tests individually, they pass. This is not unfamiliar as we had this a few times before. Since distributed logic is all globally shared in the e.g. in the torch package, things can leak from one test to the other (is my interpretation).

tests/plugins/test_deepspeed_plugin.py::test_deepspeed_skip_backward_raises FAILED                                                                                                                                                                                                                                                    [ 54%]
tests/plugins/test_deepspeed_plugin.py::test_deepspeed_warn_train_dataloader_called SKIPPED (Requires: [Special execution])                                                                                                                                                                                                           [ 54%]
tests/plugins/test_deepspeed_plugin.py::test_deepspeed_setup_train_dataloader SKIPPED (Requires: [Special execution])                                                                                                                                                                                                                 [ 55%]
tests/plugins/test_double_plugin.py::test_double_precision[DoublePrecisionBoringModel] FAILED                                                                                                                                                                                                                                         [ 56%]
tests/plugins/test_double_plugin.py::test_double_precision[DoublePrecisionBoringModelNoForward] FAILED                                                                                                                                                                                                                                [ 56%]
tests/plugins/test_double_plugin.py::test_double_precision[DoublePrecisionBoringModelComplexBuffer] PASSED                                                                                                                                                                                                                            [ 57%]
tests/plugins/test_double_plugin.py::test_double_precision_ddp FAILED
tests/plugins/test_sharded_plugin.py::test_configure_ddp FAILED                                                                                                                                                                                                                                                                       [ 77%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded[DDPShardedPlugin] FAILED                                                                                                                                                                                                                                             [ 77%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded[DDPSpawnShardedPlugin] FAILED                                                                                                                                                                                                                                        [ 78%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded_reduce_buffer_size[1-params0-0] FAILED                                                                                                                                                                                                                               [ 78%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded_reduce_buffer_size[1-params1-128] FAILED                                                                                                                                                                                                                             [ 79%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded_reduce_buffer_size[2-params0-0] FAILED                                                                                                                                                                                                                               [ 80%]
tests/plugins/test_sharded_plugin.py::test_custom_kwargs_sharded_reduce_buffer_size[2-params1-128] FAILED                                                                                                                                                                                                                             [ 80%]
tests/plugins/test_sharded_plugin.py::test_block_backward_sync PASSED                                                                                                                                                                                                                                                                 [ 81%]
tests/plugins/test_single_device_plugin.py::test_single_cpu PASSED                                                                                                                                                                                                                                                                    [ 81%]
tests/plugins/test_single_device_plugin.py::test_single_gpu FAILED

Looking at the error messages, I see something disturbing:

tests/plugins/test_sharded_plugin.py:303:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pytorch_lightning/plugins/training_type/sharded.py:46: in configure_ddp
    LightningShardedDataParallel(self.model),
../../anaconda3/envs/pl/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py:240: in wrapper
    print_rank_0(f'Before initializing {module.__class__.__name__}',

The error message is originating from deepspeed 🤣 No clue how that can be. Need to investigate the code path.

@ananthsub
Copy link
Contributor Author

This is not unfamiliar as we had this a few times before. Since distributed logic is all globally shared in the e.g. in the torch package, things can leak from one test to the other (is my interpretation).

Do you remember how this was solved before?

@carmocca
Copy link
Contributor

carmocca commented Sep 24, 2021

Do you remember how this was solved before?

There is no solution without something like #8080. The workaround is to run the tests per process (what special_tests.sh does) and/or unrolling parametrizations

@tchaton tchaton merged commit 41e3be1 into Lightning-AI:master Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Includes a breaking change distributed Generic distributed-related topic ready PRs ready to be merged refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants