-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update deepspeed requirement from <0.6.0 to <0.7.0 #13191
Comments
The test fails in the 0.6.0 to 0.6.4 releases with the following error: for mw_b, mw_a in zip(state_dict.values(), model.state_dict().values()):
> assert not torch.allclose(mw_b, mw_a)
E RuntimeError: The size of tensor a (0) must match the size of tensor b (32) at non-singleton dimension 1
lite/test_lite.py:420: RuntimeError And in 0.6.5+ with the error reported in the top post. I will try git bisecting deepspeed to see if I can find the patch that changed the error |
Git bisect points to microsoft/DeepSpeed#1915 for the |
I opened microsoft/DeepSpeed#2139 for the recursion error in 0.6.5. It can be reproduced with pure PyTorch+DeepSpeed and I don't know if the change was intentional, but suspect not. |
#13863 uncovered a new error: ../../src/pytorch_lightning/plugins/precision/deepspeed.py:115: in optimizer_step
return deepspeed_engine.step(**kwargs)
DeepSpeed/deepspeed/runtime/engine.py:1887: in step
self._take_model_step(lr_kwargs)
DeepSpeed/deepspeed/runtime/engine.py:1788: in _take_model_step
self.optimizer.step()
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1640: in step
self.check_overflow()
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1895: in check_overflow
self._check_overflow(partition_gradients)
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1800: in _check_overflow
self.overflow = self.has_overflow(partition_gradients)
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1819: in has_overflow
overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer object at 0x7fe5d0f230a0>
def has_overflow_partitioned_grads_serial(self):
for i in range(len(self.bit16_groups)):
> for j, grad in enumerate(self.averaged_gradients[i]):
E KeyError: 0
DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py:1812: KeyError Running git bisect points to microsoft/DeepSpeed#1801 |
🚀 Feature
Re-land #13048
Blocked by a failing Lite test:
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @carmocca @akihironitta @Borda @SeanNaren @awaelchli @rohitgr7
The text was updated successfully, but these errors were encountered: