Update deepspeed requirement from <0.6.0 to <0.7.0 #13191

carmocca · 2022-05-31T17:50:28Z

🚀 Feature

Re-land #13048
Blocked by a failing Lite test:

pytorch_lightning/lite/lite.py:407: in _run_impl
    return run_method(*args, **kwargs)
pytorch_lightning/lite/lite.py:412: in _run_with_strategy_setup
    return run_method(*args, **kwargs)
tests/lite/test_lite.py:402: in run
    model, optimizer = self.setup(model, optimizer)
pytorch_lightning/lite/lite.py:173: in setup
    model, optimizers = self._strategy._setup_model_and_optimizers(model, list(optimizers))
pytorch_lightning/strategies/deepspeed.py:414: in _setup_model_and_optimizers
    self.model, optimizer = self._setup_model_and_optimizer(model, optimizers[0])
pytorch_lightning/strategies/deepspeed.py:426: in _setup_model_and_optimizer
    deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
/usr/local/lib/python3.9/dist-packages/deepspeed/__init__.py:120: in initialize
    engine = DeepSpeedEngine(args=args,
/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/partition_parameters.py:377: in wrapper
    if not hasattr(module, "_ds_child_entered"):
/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py:432: in __getattr__
    if name in dir(self):
/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1847: in __dir__
    parameters = list(self._parameters.keys())
/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py:432: in __getattr__
    if name in dir(self):
/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1847: in __dir__
    parameters = list(self._parameters.keys())
E   RecursionError: maximum recursion depth exceeded while calling a Python object
!!! Recursion detected (same locals & position)
=========================== short test summary info ============================
FAILED tests/lite/test_lite.py::test_deepspeed_multiple_models - RecursionErr...

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @carmocca @akihironitta @Borda @SeanNaren @awaelchli @rohitgr7

carmocca · 2022-07-26T16:35:17Z

The test fails in the 0.6.0 to 0.6.4 releases with the following error:

        for mw_b, mw_a in zip(state_dict.values(), model.state_dict().values()):
>           assert not torch.allclose(mw_b, mw_a)
E           RuntimeError: The size of tensor a (0) must match the size of tensor b (32) at non-singleton dimension 1

lite/test_lite.py:420: RuntimeError

And in 0.6.5+ with the error reported in the top post. I will try git bisecting deepspeed to see if I can find the patch that changed the error

carmocca · 2022-07-26T16:48:43Z

Git bisect points to microsoft/DeepSpeed#1915 for the RecursionError and microsoft/DeepSpeed#1453 for the RuntimeError

awaelchli · 2022-07-26T17:49:21Z

I opened microsoft/DeepSpeed#2139 for the recursion error in 0.6.5. It can be reproduced with pure PyTorch+DeepSpeed and I don't know if the change was intentional, but suspect not.

carmocca · 2022-07-26T23:15:13Z

#13863 uncovered a new error:

../../src/pytorch_lightning/plugins/precision/deepspeed.py:115: in optimizer_step
    return deepspeed_engine.step(**kwargs)
DeepSpeed/deepspeed/runtime/engine.py:1887: in step
    self._take_model_step(lr_kwargs)
DeepSpeed/deepspeed/runtime/engine.py:1788: in _take_model_step
    self.optimizer.step()
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1640: in step
    self.check_overflow()
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1895: in check_overflow
    self._check_overflow(partition_gradients)
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1800: in _check_overflow
    self.overflow = self.has_overflow(partition_gradients)
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1819: in has_overflow
    overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer object at 0x7fe5d0f230a0>

    def has_overflow_partitioned_grads_serial(self):
        for i in range(len(self.bit16_groups)):
>           for j, grad in enumerate(self.averaged_gradients[i]):
E           KeyError: 0

DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1812: KeyError

Running
PL_RUN_STANDALONE_TESTS=1 pytest -v models/test_hooks.py::test_trainer_model_hook_system_fit[False-kwargs3]

git bisect points to microsoft/DeepSpeed#1801

carmocca added ci Continuous Integration strategy: deepspeed labels May 31, 2022

carmocca added this to the future milestone May 31, 2022

carmocca assigned awaelchli May 31, 2022

This was referenced Jul 19, 2022

Update deepspeed requirement from <0.6.0 to <0.7.0 in /requirements #13367

Closed

Support DeepSpeed <0.7.0 #13859

Merged

carmocca mentioned this issue Jul 26, 2022

Support DeepSpeed >=0.6.0, <0.6.5 #13863

Merged

10 tasks

awaelchli mentioned this issue Jul 26, 2022

deepspeed.zero.Init causes infinite recursion error microsoft/DeepSpeed#2139

Closed

carmocca modified the milestones: pl:future, pl:1.7 Jul 27, 2022

carmocca closed this as completed in #13859 Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update deepspeed requirement from <0.6.0 to <0.7.0 #13191

Update deepspeed requirement from <0.6.0 to <0.7.0 #13191

carmocca commented May 31, 2022 •

edited

Loading

carmocca commented Jul 26, 2022 •

edited

Loading

carmocca commented Jul 26, 2022 •

edited

Loading

awaelchli commented Jul 26, 2022

carmocca commented Jul 26, 2022

Update deepspeed requirement from <0.6.0 to <0.7.0 #13191

Update deepspeed requirement from <0.6.0 to <0.7.0 #13191

Comments

carmocca commented May 31, 2022 • edited Loading

🚀 Feature

If you enjoy Lightning, check out our other projects! ⚡

carmocca commented Jul 26, 2022 • edited Loading

carmocca commented Jul 26, 2022 • edited Loading

awaelchli commented Jul 26, 2022

carmocca commented Jul 26, 2022

carmocca commented May 31, 2022 •

edited

Loading

carmocca commented Jul 26, 2022 •

edited

Loading

carmocca commented Jul 26, 2022 •

edited

Loading