Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle gradient accumulation with multiple models ? #668

Closed
patil-suraj opened this issue Aug 31, 2022 · 17 comments
Closed

How to handle gradient accumulation with multiple models ? #668

patil-suraj opened this issue Aug 31, 2022 · 17 comments
Assignees

Comments

@patil-suraj
Copy link

To do gradient accumulation with accelerate we wrap the model in accelerator.accumulate context. But what would be the right way to achieve this when multiple models are involved ?

For example, when training latent diffusion models we have 3 separate models, a vae, text encoder and a unet, as you can see in this script. Of which only the text_encoder is being trained (but could also train others as well).

The obvious way to do this would be to create a wrapper model, but curious to know if this can be achieved without using the wrapper model.

cc @muellerzr

@muellerzr muellerzr self-assigned this Aug 31, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Oct 9, 2022
@patil-suraj patil-suraj reopened this Oct 17, 2022
@patil-suraj
Copy link
Author

Re-opening this issue again. For doing grad accum with accelerator.accumulate with two models (both are being trained) can we use two context managers like this

with accumulate(model1) as _, with accumulate(model2) as _:
     training_step()

@muellerzr muellerzr added the feature request Request for a new feature to be added to Accelerate label Oct 17, 2022
@Lime-Cakes
Copy link

Re-opening this issue again. For doing grad accum with accelerator.accumulate with two models (both are being trained) can we use two context managers like this

with accumulate(model1) as _, with accumulate(model2) as _:
     training_step()

Does this currently work? Or is that a feature request, meaning it currently wouldn't work, but would work in the future? Sorry, I got a bit confused by the feature request tag and your comment.

@pfeatherstone
Copy link

Very interested in this. I'm training two models at once and can only use batch sizes of less than 5 on my machine... So gradient accumulation would be great

@LvanderGoten
Copy link

I "solved" it by creating one Accelerator per model. If you use only one and register the models via accelerator.prepare(model1, ..., modelN) at least one of the models is not learning anything. This might be a bug.

@meng-wenlong
Copy link

As I think, in this case writing accumulation by yourself maybe more flexible. Accelerator.accumulate() is not necessary. Just write code like:

loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if (index+1) % gradient_accumulation_steps == 0:
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

@PrinceRay7
Copy link

This is pretty simple actually, just use "with accelerator.accumulate(model1), accelerator.accumulate(model2): " this is the mechanism of "with", the following code will be in this two contexts, so just simply put them together with comma.

@eliphatfs
Copy link

This is pretty simple actually, just use "with accelerator.accumulate(model1), accelerator.accumulate(model2): " this is the mechanism of "with", the following code will be in this two contexts, so just simply put them together with comma.

This apparently is not working. I printed AdamW statistics of parameter groups from different models, and one of them will go out of sync between GPUs with this setup, which from my point of view should not happen in DDP.

Quoting from torch forum:

That said, if a single call to backward involves gradient accumulation for more than 1 DDP wrapped module, then you’ll have to use a different process group for each of them to avoid interference.

Format:

Sync [GPU ID]
Min Max Mean of parameter group exp_sq_avg.sqrt()
...

image

After wrapping the models in a SuperModel module it no longer goes async.

image

@eliphatfs
Copy link

eliphatfs commented Jun 2, 2023

TL; DR: don't do gradient accumulation with multiple models. Wrap them in a wrapper model and do accelerator stuff with it. Move relevant forward logic inside the wrapper model.

Edit: creating an accelerator for each model as @LvanderGoten suggests could also work. Personally I prefer the wrapper model.

@mahdip72
Copy link

mahdip72 commented Jun 29, 2023

@eliphatfs would you please show your solution (wrapping models together) in a sudo code?
I am working on training controlnet + SD modules together.

@eliphatfs
Copy link

eliphatfs commented Jun 30, 2023

Basically, if you have this in your main training loop:

states = text_encoder(input_ids)
pred = unet(noisy_latents, states, timesteps)
loss = F.mse_loss(pred, targets)
# now loss.backward() will corrupt gradients if you are using accumulation on multi-gpu

Change it into:

class SuperModel(nn.Module):
    def __init__(self, unet: UNet2DConditionModel, text_encoder: nn.Module) -> None:
        super().__init__()
        self.unet = unet
        self.text_encoder = text_encoder
  def forward(self, input_ids, noisy_latents, timesteps):
        states = text_encoder(input_ids)
        return unet(noisy_latents, states, timesteps)

When constructing models, construct a SuperModel after you do with the modules.
When accelerator.prepare, only do it on the SuperModel. Same with optimizer and clip grad norm (or may be these are not important).
And in the main loop replace the two lines with a single call to SuperModel forward:

pred = supermodel(noisy_latents, states, timesteps)
loss = F.mse_loss(pred, targets)

You may also need to change the final saving:

        supermodel: SuperModel = accelerator.unwrap_model(supermodel)
        supermodel.text_encoder.save_pretrained(os.path.join(args.output_dir, 'text_encoder'))
        supermodel.unet.save_pretrained(os.path.join(args.output_dir, 'unet'))

I have not yet a good idea how to do with LoRA layers yet. It seems that LoRA layers on multiple modules are causing more problems since only AttnProcLayers get prepare-d.

@mahdip72
Copy link

I "solved" it by creating one Accelerator per model. If you use only one and register the models via accelerator.prepare(model1, ..., modelN) at least one of the models is not learning anything. This might be a bug.

You mean create two accelerator objects and use nested accumulate for training loop?

with accel_1.accumulate(model1):
     with accel_2.accumulate(model2):
          training_steps

@cfeng16
Copy link

cfeng16 commented Aug 20, 2023

Does it now support gradient accumulation for multiple models?

@hkunzhe
Copy link
Contributor

hkunzhe commented Aug 23, 2023

Does it now support gradient accumulation for multiple models?

I think #1708 should fix it according to comment.

@muellerzr muellerzr removed the feature request Request for a new feature to be added to Accelerate label Oct 5, 2023
@cfeng16
Copy link

cfeng16 commented Oct 29, 2023

Can we use gradient accumulation for multiple models in distributed training?

@muellerzr
Copy link
Collaborator

muellerzr commented Oct 29, 2023

Yes, just wrap them all in the accumulate function as shown in the earlier PR linked

@Chao0511
Copy link

Chao0511 commented Jan 4, 2024

As I think, in this case writing accumulation by yourself maybe more flexible. Accelerator.accumulate() is not necessary. Just write code like:

loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if (index+1) % gradient_accumulation_steps == 0:
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

don't forget to delete with accelerator.accumulate(unet): and Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, guys

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests