Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] estimated_stepping_batches requires distributed comms in configure_optimizers for DeepSpeedStrategy #13350

Merged
merged 4 commits into from
Jun 21, 2022

Conversation

SeanNaren
Copy link
Contributor

@SeanNaren SeanNaren commented Jun 21, 2022

What does this PR do?

Closes #12317.

This is a solution to fixing estimated_stepping_batches for the DeepSpeedStrategy.

This solution is optimal, however, touches internals which may prove error-prone in the future. The reason we do not use model_to_device is that we do not want to move the entire model to the device when using DeepSpeed Stage 3. We rely on DeepSpeed to manage device sharding/assignment.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @Borda @SeanNaren @awaelchli @rohitgr7 @akihironitta

@SeanNaren SeanNaren added bug Something isn't working strategy: deepspeed labels Jun 21, 2022
@SeanNaren SeanNaren added this to the pl:1.6.x milestone Jun 21, 2022
@SeanNaren SeanNaren self-assigned this Jun 21, 2022
@SeanNaren SeanNaren changed the title Fix/deepspeed [BUG] estimated_stepping_batches requires distributed comms in configure_optimizers for DeepSpeedStrategy Jun 21, 2022
@mergify mergify bot added the ready PRs ready to be merged label Jun 21, 2022
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@SeanNaren SeanNaren merged commit 89e2e69 into master Jun 21, 2022
@SeanNaren SeanNaren deleted the fix/deepspeed branch June 21, 2022 16:48
@rohitgr7 rohitgr7 mentioned this pull request Jul 1, 2022
12 tasks
rohitgr7 pushed a commit that referenced this pull request Jul 1, 2022
lexierule pushed a commit that referenced this pull request Jul 12, 2022
* update NGC docker (#13136)

* update docker
* Apply suggestions from code review

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Decouple pulling legacy checkpoints from existing GHA workflows and docker files (#13185)

* Add pull-legacy-checkpoints action
* Replace pulls with the new action and script
* Simplify

* Merge pull request #13250 from PyTorchLightning/ci/rm-base

CI: Remove simple test `ci_test-base.yml`

* Update rich requirement from !=10.15.*,<=12.0.0,>=10.2.2 to >=10.2.2,!=10.15.0.a,<13.0.0 in /requirements (#13047)

* Update rich requirement in /requirements

Updates the requirements on [rich](https://github.com/willmcgugan/rich) to permit the latest version.
- [Release notes](https://github.com/willmcgugan/rich/releases)
- [Changelog](https://github.com/Textualize/rich/blob/master/CHANGELOG.md)
- [Commits](Textualize/rich@v10.2.2...v12.4.1)

---
updated-dependencies:
- dependency-name: rich
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Fix torch.distributed._sharded_tensor DeprecationWarning (#13261)

* update tutorials (#13268)

* [BUG] `estimated_stepping_batches` requires distributed comms in `configure_optimizers` for `DeepSpeedStrategy` (#13350)

* Update torchmetrics requirement from <=0.7.2,>=0.4.1 to >=0.4.1,<0.9.2 in /requirements (#13275)

Update torchmetrics requirement in /requirements

Updates the requirements on [torchmetrics](https://github.com/PyTorchLightning/metrics) to permit the latest version.
- [Release notes](https://github.com/PyTorchLightning/metrics/releases)
- [Changelog](https://github.com/PyTorchLightning/metrics/blob/master/CHANGELOG.md)
- [Commits](Lightning-AI/torchmetrics@v0.4.1...v0.9.1)

---
updated-dependencies:
- dependency-name: torchmetrics
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix mypy errors for model summary utilities (#13384)

* rename org Lightning AI

* Modified python version check to accommodate for legacy version styles (#13420)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

(cherry picked from commit b332b66)

* Call `set_epoch` for distributed batch samplers (#13396)

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

(cherry picked from commit 2dd332f)

* _RICH_AVAILABLE

* _FAIRSCALE_AVAILABLE

* _BAGUA_AVAILABLE

* redefine

* chlog spaces

* CI: Fix `fatal: unsafe repository` (#13515)

* update release date

* CI: azure rename

* Restore log step during restart (#13467)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* remove redundant test

* Update CI setup (#13291)

* drop mamba
* use legacy GPU machines

* fix schema check

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Sean Naren <sean@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Martino Sorbaro <martinosorb@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged strategy: deepspeed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't use estimated_stepping_batches in configure_optimizers with DDP
4 participants