-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support wrapping multiple models in Accelerator.accumulate() #1708
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR assumes all models have the same gradient accumulation steps which might not always be the case.
@pacman100 But can accelerate handle different gradient accumulation steps for different models without this PR? My hope is that this PR will not break any old behavior. And when I train a GAN, I can do something like |
While this pr is good, I recall more is needed for this to work. Let me dig through my stuff |
The issue here is the gradient synchronization will be off. We were waiting on feedback from the torch team for if our solution was a "bug" or a feature, and since they haven't responded we'll go with feature and I'll open a PR today. As sadly no, it's not as simple as you would hope |
@muellerzr I don't quite get the thing about "bug or feature" (maybe I need more background), but isn't disabling gradient synchronization exactly what one needs for gradient accumulation? At least that is for my use case |
@yuxinyuan please see #1726 |
@muellerzr I go through #1726 and I'm a little bit confused with the example use case. For now, the example use case is like:
What is the difference between using
I guess, in general, And regarding this PR, I don't think it conflicts with existing features. It just gives more flexibility to |
I also tried the script below. It seems to me
|
It can be, but shouldn't be really I don't think. The difference is So it's more than no_grad, because gradients are still calculated. |
@muellerzr I might not be clear enough here. What I mean is that I understand the importance of disabling synchronization in gradient accumulation. I just cannot picture a senario where we actually need Anyway, back to this PR. What are the current shortcomings or limitations in your opinion? |
I am also facing this issue, while training multiple controlnets. Acclerate is not able to accumulate multiple models, Is there any workaround until this problem is solved? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yuxinyuan, thinking on it more let's go with this as well. A few nits in terms of variable names, otherwise looks great to me 🤗
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
Awesome! Sending to @sgugger for a final review :) |
Thanks for your contribution! |
Sorry but I don't quite get the result -- does accelerate support accumulation on multiple models on multiple GPUs with correct gradient synchronization now? |
@muellerzr @yuxinyuan so as asked "does accelerate support accumulation on multiple models on multiple GPUs with correct gradient synchronization now?" and more specifically, if i am training a diffusion model and text encoder can i just do accelerator.accumulate(unet, text_encoder)? |
as i'm preparing to fix this issue in the diffusers example training scripts, i would also like to know whether the results are considered "correct" |
Yes, that is all you need |
This PR updates
Accelerator.accumulate()
so that users can pass in multiple models into the context manager for skipping gradient sync (by callingno_sync()
)