Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_dreambooth_lora_sdxl.py cannot resume training from checkpoint ! ! model freezed ! ! #5840

Closed
yuxu915 opened this issue Nov 17, 2023 · 11 comments
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@yuxu915
Copy link

yuxu915 commented Nov 17, 2023

Describe the bug

When resume training from a middle lora checkpoint, it stops update the model( i.e. checkpionts remain the same as the middle checkpoint).
For reproducing the bug, just turn on the --resume_from_checkpoint flag.
All experimental settings are based on default configurations, using the latest version of the Diffusers library.
Thanks for help.
@patrickvonplaten @sayakpaul @yiyixuxu @DN6
Maybe related to https://github.com/huggingface/diffusers/issues/5004

Reproduction

https://colab.research.google.com/drive/17zNvqJZ8ChJaYZr6XIfsJBduKtb5FbOT#scrollTo=N14_vgURsNMY

Logs

No response

System Info

  • diffusers version: 0.24.0.dev0
  • Platform: Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.10
  • Python version: 3.8.5
  • PyTorch version (GPU?): 2.0.0+cu117 (True)
  • Huggingface_hub version: 0.16.4
  • Transformers version: 4.33.0
  • Accelerate version: 0.20.3
  • xFormers version: 0.0.18
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response

@yuxu915 yuxu915 added the bug Something isn't working label Nov 17, 2023
@sayakpaul
Copy link
Member

This PR should resolve these issues: #5388. Could you please check that?

@yuxu915
Copy link
Author

yuxu915 commented Nov 17, 2023

hi, thanks for your kind responce, @sayakpaul ,however, I tried the PR in https://github.com/huggingface/diffusers/pull/5388 , results seem not satisfied as the main branch(the output is not like the training dog at all even after 1500 training), all training settings are based on default configurations in https://github.com/younesbelkada/diffusers/blob/b21064f68ffad648455da116ba4b6bb669d1a223/examples/dreambooth/README_sdxl.md?plain=1#L79.
It will be really nice if you could help debug in main branch, thanks. 😊

@sayakpaul
Copy link
Member

Cc: @younesbelkada for the configs he tried.

@younesbelkada
Copy link
Contributor

younesbelkada commented Nov 17, 2023

@yuxu915 do you use by any chance --use-gradient-checkpointing ? can you share the full command and I can try to repro

@yuxu915
Copy link
Author

yuxu915 commented Nov 17, 2023

hi, @younesbelkada , thanks for helping, but I cannot find --use-gradient-checkpointing in train_dreambooth_lora_sdxl.py🤔️, I trained on https://github.com/younesbelkada/diffusers . Am I still using wrong repo?
My trianing command is as follow:

export MODEL_NAME="stable-diffusion-xl-base-1.0/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="datasets/image_instance/dog_1"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="model/stable-diffusion-xl-base-1.0/sdxl-vae-fp16-fix"

accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --pretrained_vae_model_name_or_path=$VAE_PATH \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=256 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-5 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=500 \
  --checkpointing_steps=100 \
  --seed="0" \
  --mixed_precision="fp16" 

@yuxu915
Copy link
Author

yuxu915 commented Nov 19, 2023

hi, @younesbelkada , do you mean --gradient_checkpointing? I turn it on but seems having same results as before.

@younesbelkada
Copy link
Contributor

Hi @yuxu915 ,
I had a look a the training scripts in detail. It appears that in PEFT we do initialize lora layers differently than in diffusers.
In PEFT we use kaiming with a=sqrt(5) : https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py#L158 and in diffusers we use torch.nn.init.normal with std= 1 / rank: https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/lora.py#L223
If one uses the default hyper-parameters, the model indeed struggles to converge after 500 steps; I managed to get a nice convergence by using a higher LR (2e-4) and cosine as LR scheduler. Below is the full command that I used

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"
export CUDA_VISIBLE_DEVICES="2"

accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --pretrained_vae_model_name_or_path=$VAE_PATH \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=2 \
  --gradient_accumulation_steps=4 \
  --learning_rate=2e-4 \
  --report_to="wandb" \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=25 \
  --seed="0" \
  --push_to_hub 

And images that i get after ~150 steps:

Screenshot 2023-11-20 at 15 12 13

@younesbelkada
Copy link
Contributor

Results after ~470 steps:

Screenshot 2023-11-20 at 15 37 37

@younesbelkada
Copy link
Contributor

Also confirmed that it works even when using gradient_checkpointing with same config:

Screenshot 2023-11-20 at 16 21 55

@yuxu915
Copy link
Author

yuxu915 commented Nov 21, 2023

hi, @younesbelkada thanks for your kind responce, I tried your training command, and get results like:
image

image The results appear to be not entirely similar to the images in the training set. I will try more combinations of hyperparameters in an attempt to achieve better results. Another problem is that, I tried to save intermediate loras during the training process by setting`--checkpointing_steps` to 25, however, during the inference stage, I sequentially read each lora and generate images. These images are different from those generated during the validation process( see in wandb) and not similar from the images in the training set. Inference scripts is : ``` from huggingface_hub.repocard import RepoCard from diffusers import DiffusionPipeline import torch

base_model_id = '/model/stable-diffusion-xl-base-1.0/stable-diffusion-xl-base-1.0'
pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16)

for step in range(25, 501, 25):
pipe.load_lora_weights(f"/diffusers/examples/dreambooth/lora-trained-xl/checkpoint-{step}")
image = pipe("A picture of a sks dog in a bucket", num_inference_steps=25).images[0]
image.save(f"sks_dog_{step}.png")

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Dec 26, 2023
@github-actions github-actions bot closed this as completed Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

3 participants