[BUG] `estimated_stepping_batches` requires distributed comms in `configure_optimizers` for `DeepSpeedStrategy` #13350

SeanNaren · 2022-06-21T14:46:03Z

What does this PR do?

This is a solution to fixing estimated_stepping_batches for the DeepSpeedStrategy.

This solution is optimal, however, touches internals which may prove error-prone in the future. The reason we do not use model_to_device is that we do not want to move the entire model to the device when using DeepSpeed Stage 3. We rely on DeepSpeed to manage device sharding/assignment.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @Borda @SeanNaren @awaelchli @rohitgr7 @akihironitta

for more information, see https://pre-commit.ci

src/pytorch_lightning/strategies/deepspeed.py

Borda

lgtm

…figure_optimizers` for `DeepSpeedStrategy` (#13350)

* update NGC docker (#13136) * update docker * Apply suggestions from code review Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Decouple pulling legacy checkpoints from existing GHA workflows and docker files (#13185) * Add pull-legacy-checkpoints action * Replace pulls with the new action and script * Simplify * Merge pull request #13250 from PyTorchLightning/ci/rm-base CI: Remove simple test `ci_test-base.yml` * Update rich requirement from !=10.15.*,<=12.0.0,>=10.2.2 to >=10.2.2,!=10.15.0.a,<13.0.0 in /requirements (#13047) * Update rich requirement in /requirements Updates the requirements on [rich](https://github.com/willmcgugan/rich) to permit the latest version. - [Release notes](https://github.com/willmcgugan/rich/releases) - [Changelog](https://github.com/Textualize/rich/blob/master/CHANGELOG.md) - [Commits](Textualize/rich@v10.2.2...v12.4.1) --- updated-dependencies: - dependency-name: rich dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix torch.distributed._sharded_tensor DeprecationWarning (#13261) * update tutorials (#13268) * [BUG] `estimated_stepping_batches` requires distributed comms in `configure_optimizers` for `DeepSpeedStrategy` (#13350) * Update torchmetrics requirement from <=0.7.2,>=0.4.1 to >=0.4.1,<0.9.2 in /requirements (#13275) Update torchmetrics requirement in /requirements Updates the requirements on [torchmetrics](https://github.com/PyTorchLightning/metrics) to permit the latest version. - [Release notes](https://github.com/PyTorchLightning/metrics/releases) - [Changelog](https://github.com/PyTorchLightning/metrics/blob/master/CHANGELOG.md) - [Commits](Lightning-AI/torchmetrics@v0.4.1...v0.9.1) --- updated-dependencies: - dependency-name: torchmetrics dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix mypy errors for model summary utilities (#13384) * rename org Lightning AI * Modified python version check to accommodate for legacy version styles (#13420) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> (cherry picked from commit b332b66) * Call `set_epoch` for distributed batch samplers (#13396) Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> (cherry picked from commit 2dd332f) * _RICH_AVAILABLE * _FAIRSCALE_AVAILABLE * _BAGUA_AVAILABLE * redefine * chlog spaces * CI: Fix `fatal: unsafe repository` (#13515) * update release date * CI: azure rename * Restore log step during restart (#13467) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * remove redundant test * Update CI setup (#13291) * drop mamba * use legacy GPU machines * fix schema check Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com> Co-authored-by: Sean Naren <sean@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: Martino Sorbaro <martinosorb@users.noreply.github.com>

SeanNaren added 2 commits June 21, 2022 15:35

Fixes deepspeed and estimated_stepping_batches

68af578

Swap to setting the device manually to exclude moving weights to device

7262fa9

SeanNaren added bug Something isn't working strategy: deepspeed labels Jun 21, 2022

SeanNaren added this to the pl:1.6.x milestone Jun 21, 2022

SeanNaren requested review from tchaton and awaelchli as code owners June 21, 2022 14:46

SeanNaren self-assigned this Jun 21, 2022

SeanNaren requested review from justusschock, kaushikb11, williamFalcon, Borda, carmocca and rohitgr7 as code owners June 21, 2022 14:46

SeanNaren changed the title ~~Fix/deepspeed~~ [BUG] estimated_stepping_batches requires distributed comms in configure_optimizers for DeepSpeedStrategy Jun 21, 2022

SeanNaren and others added 2 commits June 21, 2022 15:48

Add CHANGELOG.md

fc2bfea

[pre-commit.ci] auto fixes from pre-commit.com hooks

519ef01

for more information, see https://pre-commit.ci

justusschock approved these changes Jun 21, 2022

View reviewed changes

src/pytorch_lightning/strategies/deepspeed.py Show resolved Hide resolved

carmocca approved these changes Jun 21, 2022

View reviewed changes

mergify bot added the ready PRs ready to be merged label Jun 21, 2022

Borda approved these changes Jun 21, 2022

View reviewed changes

SeanNaren merged commit 89e2e69 into master Jun 21, 2022

SeanNaren deleted the fix/deepspeed branch June 21, 2022 16:48

SeanNaren mentioned this pull request Jun 23, 2022

Fix FSDP [1/n] Refactor update_properties in device mixins to not use self.apply #13387

Closed

12 tasks

rohitgr7 mentioned this pull request Jul 1, 2022

Weekly patch release v1.6.5 #13481

Merged

12 tasks

rohitgr7 pushed a commit that referenced this pull request Jul 1, 2022

[BUG] estimated_stepping_batches requires distributed comms in `con…

541392f

…figure_optimizers` for `DeepSpeedStrategy` (#13350)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `estimated_stepping_batches` requires distributed comms in `configure_optimizers` for `DeepSpeedStrategy` #13350

[BUG] `estimated_stepping_batches` requires distributed comms in `configure_optimizers` for `DeepSpeedStrategy` #13350

SeanNaren commented Jun 21, 2022 •

edited by github-actions bot

Loading

Borda left a comment

[BUG] estimated_stepping_batches requires distributed comms in configure_optimizers for DeepSpeedStrategy #13350

[BUG] estimated_stepping_batches requires distributed comms in configure_optimizers for DeepSpeedStrategy #13350

Conversation

SeanNaren commented Jun 21, 2022 • edited by github-actions bot Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

Borda left a comment

Choose a reason for hiding this comment

[BUG] `estimated_stepping_batches` requires distributed comms in `configure_optimizers` for `DeepSpeedStrategy` #13350

[BUG] `estimated_stepping_batches` requires distributed comms in `configure_optimizers` for `DeepSpeedStrategy` #13350

SeanNaren commented Jun 21, 2022 •

edited by github-actions bot

Loading