-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepspeed.zero.Init causes infinite recursion error #2139
deepspeed.zero.Init causes infinite recursion error #2139
Comments
@awaelchli, thanks for reporting this issue. |
We are providing a wrapper around the user code in Lightning Lite. It looks something like this: class Lite(LightningLite):
def run(self):
# users code goes here (model, training loop, etc.)
model = ...
model, optimizer = self.setup(model, optimizer)
# train model
if __name__ == "__main__":
lite = Lite(accelerator="gpu", devices=2)
lite.run() We have a deepspeed integration. Without any major changes to their code, the user can turn deepspeed on by changing this: Lite(accelerator="gpu", devices=2, strategy=DeepspeedStrategy(stage=3, ...)) Internally, we wrap the run method with the It would be great if this could still be supported. If not, we'd need to introduce a context manager in Lite that the user has to call over their model instantiation. cc @carmocca |
@awaelchli, thanks for sharing this context. Honestly, I don't know why it worked previously as wrapping |
@awaelchli can you give this #2150 a try? In our local tests it seems to have fixed your issue though. |
@jeffra Thank you very much for the quick response on the issue. I just tried the fix in my simple test and it works well! ❤️ |
@awaelchli please re-open if this issue isn't resolved for you. The related PR is now merged into master |
Fantastic. Thank you for resolving this so quickly! |
Describe the bug
When the
deepspeed.zero.Init
wraps not only the model but also the deepspeed.initialize call, a RecursionError is raised.This happens in deepspeed 0.6.5 but NOT in 0.6.4. It blocks the integration with Lightning Lite where we until now wrapped the entire run() method with the context.
To Reproduce
Output (may need to press ctrl+c on hang):
Expected behavior
This worked in 0.6.4, so my assumption is that the change was unintentional. Git blame points to #1915. We weren't able to spot exactly which lines caused it, but suspect the getattr changes on the deepspeed engine.
ds_report output
System info (please complete the following information):
Launcher context
I'm launching using torch.multiprocessing for simplicity in reproducing, but the bug is unrelated to how it is getting launched.
Docker context
No docker
Additional context
The text was updated successfully, but these errors were encountered: