-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle gradient accumulation with multiple models ? #668
Comments
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Re-opening this issue again. For doing grad accum with with accumulate(model1) as _, with accumulate(model2) as _:
training_step() |
Does this currently work? Or is that a feature request, meaning it currently wouldn't work, but would work in the future? Sorry, I got a bit confused by the feature request tag and your comment. |
Very interested in this. I'm training two models at once and can only use batch sizes of less than 5 on my machine... So gradient accumulation would be great |
… save a little VRAM
I "solved" it by creating one |
As I think, in this case writing accumulation by yourself maybe more flexible. loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if (index+1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad() |
This is pretty simple actually, just use "with accelerator.accumulate(model1), accelerator.accumulate(model2): " this is the mechanism of "with", the following code will be in this two contexts, so just simply put them together with comma. |
This apparently is not working. I printed AdamW statistics of parameter groups from different models, and one of them will go out of sync between GPUs with this setup, which from my point of view should not happen in DDP. Quoting from
Format:
After wrapping the models in a SuperModel module it no longer goes async. |
TL; DR: don't do gradient accumulation with multiple models. Wrap them in a wrapper model and do accelerator stuff with it. Move relevant forward logic inside the wrapper model. Edit: creating an accelerator for each model as @LvanderGoten suggests could also work. Personally I prefer the wrapper model. |
@eliphatfs would you please show your solution (wrapping models together) in a sudo code? |
Basically, if you have this in your main training loop: states = text_encoder(input_ids)
pred = unet(noisy_latents, states, timesteps)
loss = F.mse_loss(pred, targets)
# now loss.backward() will corrupt gradients if you are using accumulation on multi-gpu Change it into: class SuperModel(nn.Module):
def __init__(self, unet: UNet2DConditionModel, text_encoder: nn.Module) -> None:
super().__init__()
self.unet = unet
self.text_encoder = text_encoder
def forward(self, input_ids, noisy_latents, timesteps):
states = text_encoder(input_ids)
return unet(noisy_latents, states, timesteps) When constructing models, construct a pred = supermodel(noisy_latents, states, timesteps)
loss = F.mse_loss(pred, targets) You may also need to change the final saving: supermodel: SuperModel = accelerator.unwrap_model(supermodel)
supermodel.text_encoder.save_pretrained(os.path.join(args.output_dir, 'text_encoder'))
supermodel.unet.save_pretrained(os.path.join(args.output_dir, 'unet')) I have not yet a good idea how to do with LoRA layers yet. It seems that LoRA layers on multiple modules are causing more problems since only |
You mean create two accelerator objects and use nested accumulate for training loop? with accel_1.accumulate(model1):
with accel_2.accumulate(model2):
training_steps |
Does it now support gradient accumulation for multiple models? |
Can we use gradient accumulation for multiple models in distributed training? |
Yes, just wrap them all in the accumulate function as shown in the earlier PR linked |
don't forget to delete with accelerator.accumulate(unet): and Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, guys |
To do gradient accumulation with
accelerate
we wrap the model inaccelerator.accumulate
context. But what would be the right way to achieve this when multiple models are involved ?For example, when training latent diffusion models we have 3 separate models, a vae, text encoder and a unet, as you can see in this script. Of which only the text_encoder is being trained (but could also train others as well).
The obvious way to do this would be to create a wrapper model, but curious to know if this can be achieved without using the wrapper model.
cc @muellerzr
The text was updated successfully, but these errors were encountered: