-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train_dreambooth_lora_sdxl.py cannot resume training from checkpoint ! ! model freezed ! ! #5840
Comments
This PR should resolve these issues: #5388. Could you please check that? |
hi, thanks for your kind responce, @sayakpaul ,however, I tried the PR in https://github.com/huggingface/diffusers/pull/5388 , results seem not satisfied as the main branch(the output is not like the training dog at all even after 1500 training), all training settings are based on default configurations in https://github.com/younesbelkada/diffusers/blob/b21064f68ffad648455da116ba4b6bb669d1a223/examples/dreambooth/README_sdxl.md?plain=1#L79. |
Cc: @younesbelkada for the configs he tried. |
@yuxu915 do you use by any chance |
hi, @younesbelkada , thanks for helping, but I cannot find
|
hi, @younesbelkada , do you mean |
Hi @yuxu915 , export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="lora-trained-xl"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"
export CUDA_VISIBLE_DEVICES="2"
accelerate launch train_dreambooth_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--pretrained_vae_model_name_or_path=$VAE_PATH \
--output_dir=$OUTPUT_DIR \
--mixed_precision="fp16" \
--instance_prompt="a photo of sks dog" \
--resolution=1024 \
--train_batch_size=2 \
--gradient_accumulation_steps=4 \
--learning_rate=2e-4 \
--report_to="wandb" \
--lr_scheduler="cosine" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A photo of sks dog in a bucket" \
--validation_epochs=25 \
--seed="0" \
--push_to_hub And images that i get after ~150 steps: |
hi, @younesbelkada thanks for your kind responce, I tried your training command, and get results like: base_model_id = '/model/stable-diffusion-xl-base-1.0/stable-diffusion-xl-base-1.0' for step in range(25, 501, 25):
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Describe the bug
When resume training from a middle lora checkpoint, it stops update the model( i.e. checkpionts remain the same as the middle checkpoint).
For reproducing the bug, just turn on the
--resume_from_checkpoint
flag.All experimental settings are based on default configurations, using the latest version of the Diffusers library.
Thanks for help.
@patrickvonplaten @sayakpaul @yiyixuxu @DN6
Maybe related to https://github.com/huggingface/diffusers/issues/5004
Reproduction
https://colab.research.google.com/drive/17zNvqJZ8ChJaYZr6XIfsJBduKtb5FbOT#scrollTo=N14_vgURsNMY
Logs
No response
System Info
diffusers
version: 0.24.0.dev0Who can help?
No response
The text was updated successfully, but these errors were encountered: