train_dreambooth_lora_sdxl.py cannot resume training from checkpoint ! ! model freezed ! ! #5840

yuxu915 · 2023-11-17T07:10:35Z

Describe the bug

When resume training from a middle lora checkpoint, it stops update the model( i.e. checkpionts remain the same as the middle checkpoint).
For reproducing the bug, just turn on the --resume_from_checkpoint flag.
All experimental settings are based on default configurations, using the latest version of the Diffusers library.
Thanks for help.
@patrickvonplaten @sayakpaul @yiyixuxu @DN6
Maybe related to https://github.com/huggingface/diffusers/issues/5004

Reproduction

https://colab.research.google.com/drive/17zNvqJZ8ChJaYZr6XIfsJBduKtb5FbOT#scrollTo=N14_vgURsNMY

Logs

No response

System Info

diffusers version: 0.24.0.dev0
Platform: Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.10
Python version: 3.8.5
PyTorch version (GPU?): 2.0.0+cu117 (True)
Huggingface_hub version: 0.16.4
Transformers version: 4.33.0
Accelerate version: 0.20.3
xFormers version: 0.0.18
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

The text was updated successfully, but these errors were encountered:

sayakpaul · 2023-11-17T08:57:07Z

This PR should resolve these issues: #5388. Could you please check that?

yuxu915 · 2023-11-17T15:06:31Z

hi, thanks for your kind responce, @sayakpaul ,however, I tried the PR in https://github.com/huggingface/diffusers/pull/5388 , results seem not satisfied as the main branch（the output is not like the training dog at all even after 1500 training）, all training settings are based on default configurations in https://github.com/younesbelkada/diffusers/blob/b21064f68ffad648455da116ba4b6bb669d1a223/examples/dreambooth/README_sdxl.md?plain=1#L79.
It will be really nice if you could help debug in main branch, thanks. 😊

sayakpaul · 2023-11-17T15:11:31Z

Cc: @younesbelkada for the configs he tried.

younesbelkada · 2023-11-17T15:24:34Z

@yuxu915 do you use by any chance --use-gradient-checkpointing ? can you share the full command and I can try to repro

yuxu915 · 2023-11-17T15:53:25Z

hi, @younesbelkada , thanks for helping, but I cannot find --use-gradient-checkpointing in train_dreambooth_lora_sdxl.py🤔️, I trained on https://github.com/younesbelkada/diffusers . Am I still using wrong repo?
My trianing command is as follow:

export MODEL_NAME="stable-diffusion-xl-base-1.0/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="datasets/image_instance/dog_1"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="model/stable-diffusion-xl-base-1.0/sdxl-vae-fp16-fix"

accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --pretrained_vae_model_name_or_path=$VAE_PATH \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=256 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-5 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=500 \
  --checkpointing_steps=100 \
  --seed="0" \
  --mixed_precision="fp16"

yuxu915 · 2023-11-19T19:40:08Z

hi, @younesbelkada , do you mean --gradient_checkpointing? I turn it on but seems having same results as before.

younesbelkada · 2023-11-20T14:12:22Z

Hi @yuxu915 ,
I had a look a the training scripts in detail. It appears that in PEFT we do initialize lora layers differently than in diffusers.
In PEFT we use kaiming with a=sqrt(5) : https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py#L158 and in diffusers we use torch.nn.init.normal with std= 1 / rank: https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/lora.py#L223
If one uses the default hyper-parameters, the model indeed struggles to converge after 500 steps; I managed to get a nice convergence by using a higher LR (2e-4) and cosine as LR scheduler. Below is the full command that I used

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"
export CUDA_VISIBLE_DEVICES="2"

accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --pretrained_vae_model_name_or_path=$VAE_PATH \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=2 \
  --gradient_accumulation_steps=4 \
  --learning_rate=2e-4 \
  --report_to="wandb" \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=25 \
  --seed="0" \
  --push_to_hub

And images that i get after ~150 steps:

younesbelkada · 2023-11-20T14:37:50Z

Results after ~470 steps:

younesbelkada · 2023-11-20T15:22:03Z

Also confirmed that it works even when using gradient_checkpointing with same config:

yuxu915 · 2023-11-21T02:09:26Z

hi, @younesbelkada thanks for your kind responce, I tried your training command, and get results like:

The results appear to be not entirely similar to the images in the training set. I will try more combinations of hyperparameters in an attempt to achieve better results. Another problem is that, I tried to save intermediate loras during the training process by setting`--checkpointing_steps` to 25, however, during the inference stage, I sequentially read each lora and generate images. These images are different from those generated during the validation process( see in wandb) and not similar from the images in the training set. Inference scripts is : ``` from huggingface_hub.repocard import RepoCard from diffusers import DiffusionPipeline import torch

base_model_id = '/model/stable-diffusion-xl-base-1.0/stable-diffusion-xl-base-1.0'
pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16)

for step in range(25, 501, 25):
pipe.load_lora_weights(f"/diffusers/examples/dreambooth/lora-trained-xl/checkpoint-{step}")
image = pipe("A picture of a sks dog in a bucket", num_inference_steps=25).images[0]
image.save(f"sks_dog_{step}.png")

github-actions · 2023-12-26T15:06:30Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yuxu915 added the bug Something isn't working label Nov 17, 2023

younesbelkada mentioned this issue Nov 21, 2023

Training is successfull, output image does not contain the person whose 10 images i used to train it huggingface/peft#1158

Closed

github-actions bot added the stale Issues that haven't received updates label Dec 26, 2023

github-actions bot closed this as completed Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_dreambooth_lora_sdxl.py cannot resume training from checkpoint ! ! model freezed ! ! #5840

train_dreambooth_lora_sdxl.py cannot resume training from checkpoint ! ! model freezed ! ! #5840

yuxu915 commented Nov 17, 2023

sayakpaul commented Nov 17, 2023

yuxu915 commented Nov 17, 2023

sayakpaul commented Nov 17, 2023

younesbelkada commented Nov 17, 2023 •

edited

Loading

yuxu915 commented Nov 17, 2023

yuxu915 commented Nov 19, 2023

younesbelkada commented Nov 20, 2023

younesbelkada commented Nov 20, 2023

younesbelkada commented Nov 20, 2023

yuxu915 commented Nov 21, 2023 •

edited

Loading

github-actions bot commented Dec 26, 2023

train_dreambooth_lora_sdxl.py cannot resume training from checkpoint ! ! model freezed ! ! #5840

train_dreambooth_lora_sdxl.py cannot resume training from checkpoint ! ! model freezed ! ! #5840

Comments

yuxu915 commented Nov 17, 2023

Describe the bug

Reproduction

Logs

System Info

Who can help?

sayakpaul commented Nov 17, 2023

yuxu915 commented Nov 17, 2023

sayakpaul commented Nov 17, 2023

younesbelkada commented Nov 17, 2023 • edited Loading

yuxu915 commented Nov 17, 2023

yuxu915 commented Nov 19, 2023

younesbelkada commented Nov 20, 2023

younesbelkada commented Nov 20, 2023

younesbelkada commented Nov 20, 2023

yuxu915 commented Nov 21, 2023 • edited Loading

github-actions bot commented Dec 26, 2023

younesbelkada commented Nov 17, 2023 •

edited

Loading

yuxu915 commented Nov 21, 2023 •

edited

Loading