How to handle gradient accumulation with multiple models ? #668

patil-suraj · 2022-08-31T12:44:22Z

To do gradient accumulation with accelerate we wrap the model in accelerator.accumulate context. But what would be the right way to achieve this when multiple models are involved ?

For example, when training latent diffusion models we have 3 separate models, a vae, text encoder and a unet, as you can see in this script. Of which only the text_encoder is being trained (but could also train others as well).

The obvious way to do this would be to create a wrapper model, but curious to know if this can be achieved without using the wrapper model.

cc @muellerzr

The text was updated successfully, but these errors were encountered:

github-actions · 2022-09-30T15:07:47Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patil-suraj · 2022-10-17T09:38:44Z

Re-opening this issue again. For doing grad accum with accelerator.accumulate with two models (both are being trained) can we use two context managers like this

with accumulate(model1) as _, with accumulate(model2) as _:
     training_step()

Lime-Cakes · 2022-11-10T09:32:01Z

Re-opening this issue again. For doing grad accum with accelerator.accumulate with two models (both are being trained) can we use two context managers like this
with accumulate(model1) as _, with accumulate(model2) as _:
     training_step()

Does this currently work? Or is that a feature request, meaning it currently wouldn't work, but would work in the future? Sorry, I got a bit confused by the feature request tag and your comment.

pfeatherstone · 2022-12-05T13:42:21Z

Very interested in this. I'm training two models at once and can only use batch sizes of less than 5 on my machine... So gradient accumulation would be great

… save a little VRAM

LvanderGoten · 2023-01-19T10:18:40Z

I "solved" it by creating one Accelerator per model. If you use only one and register the models via accelerator.prepare(model1, ..., modelN) at least one of the models is not learning anything. This might be a bug.

meng-wenlong · 2023-02-21T09:31:39Z

As I think, in this case writing accumulation by yourself maybe more flexible. Accelerator.accumulate() is not necessary. Just write code like:

loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if (index+1) % gradient_accumulation_steps == 0:
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

PrinceRay7 · 2023-03-11T12:43:25Z

This is pretty simple actually, just use "with accelerator.accumulate(model1), accelerator.accumulate(model2): " this is the mechanism of "with", the following code will be in this two contexts, so just simply put them together with comma.

eliphatfs · 2023-06-02T23:37:49Z

This is pretty simple actually, just use "with accelerator.accumulate(model1), accelerator.accumulate(model2): " this is the mechanism of "with", the following code will be in this two contexts, so just simply put them together with comma.

This apparently is not working. I printed AdamW statistics of parameter groups from different models, and one of them will go out of sync between GPUs with this setup, which from my point of view should not happen in DDP.

Quoting from torch forum:

That said, if a single call to backward involves gradient accumulation for more than 1 DDP wrapped module, then you’ll have to use a different process group for each of them to avoid interference.

Format:

Sync [GPU ID]
Min Max Mean of parameter group exp_sq_avg.sqrt()
...

After wrapping the models in a SuperModel module it no longer goes async.

eliphatfs · 2023-06-02T23:40:30Z

TL; DR: don't do gradient accumulation with multiple models. Wrap them in a wrapper model and do accelerator stuff with it. Move relevant forward logic inside the wrapper model.

Edit: creating an accelerator for each model as @LvanderGoten suggests could also work. Personally I prefer the wrapper model.

mahdip72 · 2023-06-29T17:00:06Z

@eliphatfs would you please show your solution (wrapping models together) in a sudo code?
I am working on training controlnet + SD modules together.

eliphatfs · 2023-06-30T02:56:07Z

Basically, if you have this in your main training loop:

states = text_encoder(input_ids)
pred = unet(noisy_latents, states, timesteps)
loss = F.mse_loss(pred, targets)
# now loss.backward() will corrupt gradients if you are using accumulation on multi-gpu

Change it into:

class SuperModel(nn.Module):
    def __init__(self, unet: UNet2DConditionModel, text_encoder: nn.Module) -> None:
        super().__init__()
        self.unet = unet
        self.text_encoder = text_encoder
  def forward(self, input_ids, noisy_latents, timesteps):
        states = text_encoder(input_ids)
        return unet(noisy_latents, states, timesteps)

When constructing models, construct a SuperModel after you do with the modules.
When accelerator.prepare, only do it on the SuperModel. Same with optimizer and clip grad norm (or may be these are not important).
And in the main loop replace the two lines with a single call to SuperModel forward:

pred = supermodel(noisy_latents, states, timesteps)
loss = F.mse_loss(pred, targets)

You may also need to change the final saving:

        supermodel: SuperModel = accelerator.unwrap_model(supermodel)
        supermodel.text_encoder.save_pretrained(os.path.join(args.output_dir, 'text_encoder'))
        supermodel.unet.save_pretrained(os.path.join(args.output_dir, 'unet'))

I have not yet a good idea how to do with LoRA layers yet. It seems that LoRA layers on multiple modules are causing more problems since only AttnProcLayers get prepare-d.

mahdip72 · 2023-06-30T12:05:09Z

I "solved" it by creating one Accelerator per model. If you use only one and register the models via accelerator.prepare(model1, ..., modelN) at least one of the models is not learning anything. This might be a bug.

You mean create two accelerator objects and use nested accumulate for training loop?

with accel_1.accumulate(model1):
     with accel_2.accumulate(model2):
          training_steps

cfeng16 · 2023-08-20T22:15:02Z

Does it now support gradient accumulation for multiple models?

hkunzhe · 2023-08-23T03:59:07Z

Does it now support gradient accumulation for multiple models?

I think #1708 should fix it according to comment.

cfeng16 · 2023-10-29T02:16:02Z

Can we use gradient accumulation for multiple models in distributed training?

muellerzr · 2023-10-29T02:50:15Z

Yes, just wrap them all in the accumulate function as shown in the earlier PR linked

Chao0511 · 2024-01-04T14:44:13Z

As I think, in this case writing accumulation by yourself maybe more flexible. Accelerator.accumulate() is not necessary. Just write code like:
loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if (index+1) % gradient_accumulation_steps == 0:
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

don't forget to delete with accelerator.accumulate(unet): and Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, guys

muellerzr self-assigned this Aug 31, 2022

github-actions bot closed this as completed Oct 9, 2022

patil-suraj reopened this Oct 17, 2022

muellerzr added the feature request Request for a new feature to be added to Accelerate label Oct 17, 2022

brian6091 referenced this issue in d8ahazard/sd_dreambooth_extension Dec 19, 2022

Accumulate text encoder too to actually do gradient checkpointing and…

1426d14

… save a little VRAM

brian6091 mentioned this issue Jan 3, 2023

Accelerate accumulation context for multiple models not supported yet brian6091/Dreambooth#22

Open

muellerzr mentioned this issue Jun 30, 2023

Use grad accumulation for multi models #1667

Closed

eliphatfs mentioned this issue Jul 13, 2023

[BUG] train_text_to_image_lora.py not support Multi-nodes or Multi-gpus training. huggingface/diffusers#4046

Closed

muellerzr removed the feature request Request for a new feature to be added to Accelerate label Oct 5, 2023

muellerzr closed this as completed Oct 5, 2023

lehduong mentioned this issue Oct 18, 2023

Question regarding the gradient accumulation kxhit/zero123-hf#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle gradient accumulation with multiple models ? #668

How to handle gradient accumulation with multiple models ? #668

patil-suraj commented Aug 31, 2022

github-actions bot commented Sep 30, 2022

patil-suraj commented Oct 17, 2022

Lime-Cakes commented Nov 10, 2022

pfeatherstone commented Dec 5, 2022

LvanderGoten commented Jan 19, 2023

meng-wenlong commented Feb 21, 2023

PrinceRay7 commented Mar 11, 2023

eliphatfs commented Jun 2, 2023

eliphatfs commented Jun 2, 2023 •

edited

Loading

mahdip72 commented Jun 29, 2023 •

edited

Loading

eliphatfs commented Jun 30, 2023 •

edited

Loading

mahdip72 commented Jun 30, 2023

cfeng16 commented Aug 20, 2023

hkunzhe commented Aug 23, 2023

cfeng16 commented Oct 29, 2023

muellerzr commented Oct 29, 2023 •

edited

Loading

Chao0511 commented Jan 4, 2024

How to handle gradient accumulation with multiple models ? #668

How to handle gradient accumulation with multiple models ? #668

Comments

patil-suraj commented Aug 31, 2022

github-actions bot commented Sep 30, 2022

patil-suraj commented Oct 17, 2022

Lime-Cakes commented Nov 10, 2022

pfeatherstone commented Dec 5, 2022

LvanderGoten commented Jan 19, 2023

meng-wenlong commented Feb 21, 2023

PrinceRay7 commented Mar 11, 2023

eliphatfs commented Jun 2, 2023

eliphatfs commented Jun 2, 2023 • edited Loading

mahdip72 commented Jun 29, 2023 • edited Loading

eliphatfs commented Jun 30, 2023 • edited Loading

mahdip72 commented Jun 30, 2023

cfeng16 commented Aug 20, 2023

hkunzhe commented Aug 23, 2023

cfeng16 commented Oct 29, 2023

muellerzr commented Oct 29, 2023 • edited Loading

Chao0511 commented Jan 4, 2024

eliphatfs commented Jun 2, 2023 •

edited

Loading

mahdip72 commented Jun 29, 2023 •

edited

Loading

eliphatfs commented Jun 30, 2023 •

edited

Loading

muellerzr commented Oct 29, 2023 •

edited

Loading