[BUG] train_text_to_image_lora.py not support Multi-nodes or Multi-gpus training. #4046

WindVChen · 2023-07-11T16:39:32Z

In train_text_to_image_lora.py, I notice that the LORA parameters are extracted into an AttnProcsLayers class:

518    lora_layers = AttnProcsLayers(unet.attn_processors)

And it is only the lora_layers that is wrapped by DistributedDataParallel in the following code:

670    lora_layers, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
            lora_layers, optimizer, train_dataloader, lr_scheduler
          )

In the training process, it seems that the lora_layers are not explicitly used but only the unet is used:

776    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample

My question is that when using Multi-GPUs or Multi-Machines, will the gradients be successfully averaged across all processes in the above way?

It is true that in each process, the gradients will be backward to unet.attn_processors, and these gradients will be shared by lora_layers, so we can use optimizer to update the weights. However, since we actually use unet.attn_processors to do the forward operation, but not the wrapped lora_layers, can the gradients be correctly averaged? From here, it seems that a wrapped module will have a different forward compared to its original forward operation.

I am not quite familiar with torch.nn.parallel.DistributedDataParallel wrapper, and I do worry about whether the current code in train_text_to_image_lora.py will lead to different LORA weights in different processes (if the gradients failed to broadcast among processes).

Hope to find some help here, thank you.

The text was updated successfully, but these errors were encountered:

WindVChen · 2023-07-12T03:42:54Z

After carefully printing out the gradients and weights in different processes, it seems quite sure that the current LORA training script fails to be applied to Multi-nodes or Multi-GPUs training: The gradients failed to broadcast among the processes, which then lead to different LORA weights in different processes.

In the following, I give a printout when I start a 2-GPU task:

The first 3 number denotes process/epoch/step, and the following two tensors are the gradients and weights of lora_layers.layers[31].to_out_lora.up (A linear layer). It is obvious that the gradients are different in different processes. Thus, after three steps, it will turn to:

Obviously, the weights of LORA are different in different processes.

I think it is a bug since there is no description of that in the script. People who use the script to train on multi-processes will finally and actually get the result on a single process.

patrickvonplaten · 2023-07-12T11:25:56Z

cc @williamberman @sayakpaul

sayakpaul · 2023-07-12T11:37:06Z

@WindVChen

Thanks for elaborating on this!

From gauging this briefly, it seems like passing unet to the prepare step rather than just the LoRA layers might just fix this.

Could you confirm this once?

WindVChen · 2023-07-12T11:54:22Z

Hi @sayakpaul ,

Yes. Passing unet to the prepare step can quickly fix it. However, it also can bring some inconvenience: 1) The batch size has to be reduced by half, or the training will lead to out-of-memory; 2) The checkpoint stored intermediate is much larger (~3G compared to the original ~3M), as it needs to store the whole unet structure.

I wonder if there is a more elegant way to solve it without sacrificing memory and storage. I also tried wrapping each "loraAttnprocessor" manually, but found it required a lot of source code modification, so I gave up 😢.

sayakpaul · 2023-07-12T11:55:55Z

Thanks for sharing!

I also tried wrapping each "loraAttnprocessor" manually, but found it required a lot of source code modification, so I gave up

Could you expand a bit more on this?

WindVChen · 2023-07-12T12:22:38Z

Yes.

It seems that the problem above happened just because the training script used the unwrapped unet to do the forward, not the wrapped lora_layers, so my previous idea is to wrap every loraAttnprocessor in the unet one by one (directly use accelerate.prepare) and then replace the original unwrapped loraAttnprocessors in unet with these wrapped loraAttnprocessors. In this way, maybe we can expect that in the training phase, the gradients can broadcast among different processes.

However, since I'm not quite familiar with torch.nn.parallel.DistributedDataParallel, I'm not sure whether this solution can work.

sayakpaul · 2023-07-12T12:28:14Z

Hmm. I will defer to what @williamberman has to point out here. Also ccing @muellerzr from the accelerate team here.

muellerzr · 2023-07-12T12:35:11Z

When using Accelerate any models that get gradient/weight updates should be passed to .prepare. Do note that gradient accumulation currently won't work with multiple models, we're adding that very soon, but hope that helps.

WindVChen · 2023-07-12T12:39:38Z

Hi @muellerzr ,

Could you expand a bit more about "gradient accumulation currently won't work with multiple models"? Or are there any blogs or issues related to this? Because I am going to use gradient accumulation for multiple models.:sweat_smile:

muellerzr · 2023-07-12T12:41:44Z

No one has quite pointed out this "issue" yet actually, once the PR is opened (probably today or tomorrow) I'll link to it but basically multiple forward passes followed by multiple backward passes (for each model's loss, for instance) leads to some headaches in torch distributed.

WindVChen · 2023-07-12T12:59:07Z

OK, thanks.

Hope to be sure. So does that mean that a GAN training script like the one below will cause some BUGs?

"Suppose gradient_accumulation is set to 2"
optimizer_gen = optim(generator.parameters())
optimizer_disc = optim(discriminator.parameters())
with accelerator.accumulate(generator):
        outputs = optimizer_gen(input)
        loss = loss_func(outputs)
        loss.backward()
        optimizer_gen.step()
        optimizer_gen.zero_grad()

        outputs = optimizer_disc(input)
        loss = loss_func(outputs)
        loss.backward()
        optimizer_disc.step()
        optimizer_disc.zero_grad()

muellerzr · 2023-07-12T13:01:15Z

Yes because technically the discriminator should also be under accumulate. So the discriminator might still be getting updated each step. (Though again, working on support there)

WindVChen · 2023-07-12T13:21:14Z

Ah, I see. Maybe I can double-wrap the pipeline for a temporary fix before the PR? Like this:

with accelerator.accumulate(generator):
    with accelerator.accumulate(discriminator):
        outputs = optimizer_gen(input)
        ...

muellerzr · 2023-07-12T13:24:21Z

Double wrapping also does not work, hence the need for a more complex solution. (see discussion here: huggingface/accelerate#1708). So just wait a day or two :)

WindVChen · 2023-07-12T13:29:28Z

OK, thanks a lot. Look forward to the solution. 😊

eliphatfs · 2023-07-13T13:27:27Z

I have a solution that is not elegant but works: wrapping around it with another network.

class SuperNet(torch.nn.ModuleDict):
    def forward(self, text_encoder, unet, batch, class_labels, noisy_model_input, timesteps):
        # Get the text embedding for conditioning
        encoder_hidden_states = encode_prompt(
            text_encoder,
            batch["input_ids"],
            None
        )
        # Predict the noise residual
        return unet(
            noisy_model_input, timesteps, encoder_hidden_states, class_labels=class_labels
        ).sample

Only use this module with accelerate and in the training loop instead of the original code call this module.

This originates as a solution to training text encoder and unet simultaneously for dreambooth. You could look here: huggingface/accelerate#668 (comment)

WindVChen · 2023-07-13T16:32:02Z

Hi @eliphatfs ,

Thanks for sharing. Just want to confirm that the code given is actually solving the gradient accumulation problem of multiple models when using DDP, not the problem of applying multiple GPUs/nodes to the LORA code, is it? (There are two problems mentioned in this issue 😂 )

eliphatfs · 2023-07-13T23:57:02Z

Hi @eliphatfs ,

Thanks for sharing. Just want to confirm that the code given is actually solving the gradient accumulation problem of multiple models when using DDP, not the problem of applying multiple GPUs/nodes to the LORA code, is it? (There are two problems mentioned in this issue 😂 )

You can do similar things for LoRA. This is from one of my custom pipelines, but can be quickly changed to any. Note the base class is AttnProcsLayers so you can use it in place of the original AttnProcsLayers and supports inference loaders.

class SuperNet(AttnProcsLayers):
    def forward(self, image_encoder, unet, image, noisy_model_input, timesteps):
        encoded = image_encoder(image)
        return unet(
            noisy_model_input,
            timesteps,
            encoder_hidden_states=encoded.last_hidden_state,
            class_labels=noise_image_embeddings(encoded.image_embeds, 0)
        ).sample

I think the thing is to make sure that

No mutable parameters are accessed outside the DDP wrapper forward and all of them are parameters of that wrapped module.
Only 1 DDP wrapper exists for each process group (or accelerator instance).

WindVChen · 2023-07-14T03:26:11Z

Thanks for the further description.

But from my understanding, this may still fail to solve the problem listed here? Have any thoughts on that problem? Or am I missing something?

eliphatfs · 2023-07-14T04:02:06Z

You are not passing the complete unet into accelerate, it is passed as an argument at forward time, so the parameters will not be stored in the checkpoint. You have to make sure all optimized parameters are registered in the SuperNet of course.

github-actions · 2023-08-11T15:04:18Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

eliphatfs · 2023-08-13T00:47:00Z

Anyone can verify if this has been fixed in accelerate?

thuanz123 · 2023-08-16T04:46:22Z

Hi so the quick fix for this is including the unet as well as the lora_layers to the prepare step right ?

Edit: I mean the multi-node/multi-gpus training support for LoRA not the gradient accumulation for multiple models

sayakpaul · 2023-08-16T04:49:14Z

Cc: @muellerzr

hkunzhe · 2023-08-21T11:29:38Z

Any updates?

muellerzr · 2023-08-21T15:28:08Z

This should be fixed by passing multiple models with accelerator.accumulate, yes @hkunzhe

hkunzhe · 2023-08-21T15:40:11Z

This should be fixed by passing multiple models with accelerator.accumulate, yes @hkunzhe

I got it!

williamberman · 2023-08-21T18:30:14Z

Hi yes, @WindVChen this is correct. The issue is that accelerate prepare works by wrapping the passed class in ddp and then we're supposed to call the returned ddp class. Similarly accelerate mixed precision works by monkey patching the forward method of the passed in class.

Any script where we use the AttnProcsLayers class will not properly work with accelerate because that class just holds the given parameters but it isn't actually used as a part of the model.

I fixed this for the dreambooth lora script here: #3778

We should really remove the AttnProcsLayers class and always pass the top level model to accelerate.prepare. I'm going to open an issue better documenting this but unfortunately I can't get to it right away as these cross training script refactors are relatively involved

github-actions · 2023-10-18T15:16:24Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ai1361720220000 · 2024-03-08T08:59:59Z

This should be fixed by passing multiple models with accelerator.accumulate, yes @hkunzhe

Hi @muellerzr ,
What should be done when it comes to accelerator.prepare()? Does multi models are all put into "prepare" function or only one of them is ok?like accelerator.prepare(model1, model2, optimizer,train_dataloader, lr_scheduler) or accelerator.prepare(model1, optimizer,train_dataloader, lr_scheduler) , accelerator.prepare(model2, optimizer,train_dataloader, lr_scheduler).

muellerzr · 2024-03-08T15:41:45Z

All of them should be put into prepare, specifically all the ones that expect to have their gradients updated. Those same ones should then also be passed to accumulate.

You can send both into prepare at the same time.

WindVChen changed the title ~~Question about DistributedDataParallel in train_text_to_image_lora.py.~~ [BUG!] train_text_to_image_lora.py not support Multi-nodes or Multi-gpus training. Jul 12, 2023

WindVChen changed the title ~~[BUG!] train_text_to_image_lora.py not support Multi-nodes or Multi-gpus training.~~ [BUG] train_text_to_image_lora.py not support Multi-nodes or Multi-gpus training. Jul 12, 2023

github-actions bot added the stale Issues that haven't received updates label Aug 11, 2023

williamberman mentioned this issue Aug 21, 2023

Remove all usage of AttnProcsLayers #4699

Open

This was referenced Aug 22, 2023

Suggest to use larger gradient accumulation steps instead of multi GPUs kvablack/ddpo-pytorch#10

Closed

How to handle gradient accumulation with multiple models ? huggingface/accelerate#668

Closed

lehduong mentioned this issue Oct 18, 2023

Question regarding the gradient accumulation kxhit/zero123-hf#7

Closed

sayakpaul mentioned this issue Nov 7, 2023

LoRA can not train text encoder in "train_text_to_image_lora_sdxl.py"? #5012

Closed

github-actions bot closed this as completed Nov 9, 2023

Dvirbeno mentioned this issue Jan 4, 2024

Gradient Accumulation - Multi-GPU training of SDXL-LoRA #6457

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] train_text_to_image_lora.py not support Multi-nodes or Multi-gpus training. #4046

[BUG] train_text_to_image_lora.py not support Multi-nodes or Multi-gpus training. #4046

WindVChen commented Jul 11, 2023

WindVChen commented Jul 12, 2023

patrickvonplaten commented Jul 12, 2023

sayakpaul commented Jul 12, 2023

WindVChen commented Jul 12, 2023

sayakpaul commented Jul 12, 2023

WindVChen commented Jul 12, 2023

sayakpaul commented Jul 12, 2023

muellerzr commented Jul 12, 2023

WindVChen commented Jul 12, 2023

muellerzr commented Jul 12, 2023

WindVChen commented Jul 12, 2023

muellerzr commented Jul 12, 2023

WindVChen commented Jul 12, 2023

muellerzr commented Jul 12, 2023

WindVChen commented Jul 12, 2023

eliphatfs commented Jul 13, 2023

WindVChen commented Jul 13, 2023

eliphatfs commented Jul 13, 2023 •

edited

Loading

WindVChen commented Jul 14, 2023

eliphatfs commented Jul 14, 2023 •

edited

Loading

github-actions bot commented Aug 11, 2023

eliphatfs commented Aug 13, 2023

thuanz123 commented Aug 16, 2023 •

edited

Loading

sayakpaul commented Aug 16, 2023

hkunzhe commented Aug 21, 2023

muellerzr commented Aug 21, 2023

hkunzhe commented Aug 21, 2023

williamberman commented Aug 21, 2023

github-actions bot commented Oct 18, 2023

ai1361720220000 commented Mar 8, 2024

muellerzr commented Mar 8, 2024

[BUG] train_text_to_image_lora.py not support Multi-nodes or Multi-gpus training. #4046

[BUG] train_text_to_image_lora.py not support Multi-nodes or Multi-gpus training. #4046

Comments

WindVChen commented Jul 11, 2023

WindVChen commented Jul 12, 2023

patrickvonplaten commented Jul 12, 2023

sayakpaul commented Jul 12, 2023

WindVChen commented Jul 12, 2023

sayakpaul commented Jul 12, 2023

WindVChen commented Jul 12, 2023

sayakpaul commented Jul 12, 2023

muellerzr commented Jul 12, 2023

WindVChen commented Jul 12, 2023

muellerzr commented Jul 12, 2023

WindVChen commented Jul 12, 2023

muellerzr commented Jul 12, 2023

WindVChen commented Jul 12, 2023

muellerzr commented Jul 12, 2023

WindVChen commented Jul 12, 2023

eliphatfs commented Jul 13, 2023

WindVChen commented Jul 13, 2023

eliphatfs commented Jul 13, 2023 • edited Loading

WindVChen commented Jul 14, 2023

eliphatfs commented Jul 14, 2023 • edited Loading

github-actions bot commented Aug 11, 2023

eliphatfs commented Aug 13, 2023

thuanz123 commented Aug 16, 2023 • edited Loading

sayakpaul commented Aug 16, 2023

hkunzhe commented Aug 21, 2023

muellerzr commented Aug 21, 2023

hkunzhe commented Aug 21, 2023

williamberman commented Aug 21, 2023

github-actions bot commented Oct 18, 2023

ai1361720220000 commented Mar 8, 2024

muellerzr commented Mar 8, 2024

eliphatfs commented Jul 13, 2023 •

edited

Loading

eliphatfs commented Jul 14, 2023 •

edited

Loading

thuanz123 commented Aug 16, 2023 •

edited

Loading