Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update deepspeed requirement from <0.6.0 to <0.7.0 #13191

Closed
carmocca opened this issue May 31, 2022 · 4 comments · Fixed by #13859
Closed

Update deepspeed requirement from <0.6.0 to <0.7.0 #13191

carmocca opened this issue May 31, 2022 · 4 comments · Fixed by #13859
Assignees
Labels
ci Continuous Integration strategy: deepspeed
Milestone

Comments

@carmocca
Copy link
Contributor

carmocca commented May 31, 2022

🚀 Feature

Re-land #13048
Blocked by a failing Lite test:

pytorch_lightning/lite/lite.py:407: in _run_impl
    return run_method(*args, **kwargs)
pytorch_lightning/lite/lite.py:412: in _run_with_strategy_setup
    return run_method(*args, **kwargs)
tests/lite/test_lite.py:402: in run
    model, optimizer = self.setup(model, optimizer)
pytorch_lightning/lite/lite.py:173: in setup
    model, optimizers = self._strategy._setup_model_and_optimizers(model, list(optimizers))
pytorch_lightning/strategies/deepspeed.py:414: in _setup_model_and_optimizers
    self.model, optimizer = self._setup_model_and_optimizer(model, optimizers[0])
pytorch_lightning/strategies/deepspeed.py:426: in _setup_model_and_optimizer
    deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
/usr/local/lib/python3.9/dist-packages/deepspeed/__init__.py:120: in initialize
    engine = DeepSpeedEngine(args=args,
/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/partition_parameters.py:377: in wrapper
    if not hasattr(module, "_ds_child_entered"):
/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py:432: in __getattr__
    if name in dir(self):
/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1847: in __dir__
    parameters = list(self._parameters.keys())
/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py:432: in __getattr__
    if name in dir(self):
/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1847: in __dir__
    parameters = list(self._parameters.keys())
E   RecursionError: maximum recursion depth exceeded while calling a Python object
!!! Recursion detected (same locals & position)
=========================== short test summary info ============================
FAILED tests/lite/test_lite.py::test_deepspeed_multiple_models - RecursionErr...

If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @carmocca @akihironitta @Borda @SeanNaren @awaelchli @rohitgr7

@carmocca
Copy link
Contributor Author

carmocca commented Jul 26, 2022

The test fails in the 0.6.0 to 0.6.4 releases with the following error:

        for mw_b, mw_a in zip(state_dict.values(), model.state_dict().values()):
>           assert not torch.allclose(mw_b, mw_a)
E           RuntimeError: The size of tensor a (0) must match the size of tensor b (32) at non-singleton dimension 1

lite/test_lite.py:420: RuntimeError

And in 0.6.5+ with the error reported in the top post. I will try git bisecting deepspeed to see if I can find the patch that changed the error

@carmocca
Copy link
Contributor Author

carmocca commented Jul 26, 2022

Git bisect points to microsoft/DeepSpeed#1915 for the RecursionError and microsoft/DeepSpeed#1453 for the RuntimeError

@awaelchli
Copy link
Contributor

I opened microsoft/DeepSpeed#2139 for the recursion error in 0.6.5. It can be reproduced with pure PyTorch+DeepSpeed and I don't know if the change was intentional, but suspect not.

@carmocca
Copy link
Contributor Author

#13863 uncovered a new error:

../../src/pytorch_lightning/plugins/precision/deepspeed.py:115: in optimizer_step
    return deepspeed_engine.step(**kwargs)
DeepSpeed/deepspeed/runtime/engine.py:1887: in step
    self._take_model_step(lr_kwargs)
DeepSpeed/deepspeed/runtime/engine.py:1788: in _take_model_step
    self.optimizer.step()
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1640: in step
    self.check_overflow()
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1895: in check_overflow
    self._check_overflow(partition_gradients)
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1800: in _check_overflow
    self.overflow = self.has_overflow(partition_gradients)
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1819: in has_overflow
    overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer object at 0x7fe5d0f230a0>

    def has_overflow_partitioned_grads_serial(self):
        for i in range(len(self.bit16_groups)):
>           for j, grad in enumerate(self.averaged_gradients[i]):
E           KeyError: 0

DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1812: KeyError

Running
PL_RUN_STANDALONE_TESTS=1 pytest -v models/test_hooks.py::test_trainer_model_hook_system_fit[False-kwargs3]

git bisect points to microsoft/DeepSpeed#1801

@carmocca carmocca modified the milestones: pl:future, pl:1.7 Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration strategy: deepspeed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants