Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for fine-tuning with LoRA (text2image example) #2002

Closed
wants to merge 57 commits into from

Conversation

sayakpaul
Copy link
Member

@sayakpaul sayakpaul commented Jan 16, 2023

Most of it is as same as #1884. I guess the only script that needs reviewing is examples/text_to_image/train_text_to_image_lora.py.

@sayakpaul sayakpaul requested a review from patil-suraj January 16, 2023 09:26
@sayakpaul
Copy link
Member Author

sayakpaul commented Jan 16, 2023

With the following:

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch --gpu_ids="0," \
   ./train_text_to_image_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME --caption_column="text" \
  --resolution=512 --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=100 --checkpointing_steps=5000 \
  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --seed=42 \
  --save_sample_prompt="cute Sundar Pichai creature" --report_to="wandb" \

still leading to:

Steps:   0%|                                                                                                                                                  | 1/83300 [00:08<204:55:05,  8.86s/it, lr=0.0001, step_loss=0.209]Traceback (most recent call last):
  File "./train_text_to_image_lora.py", line 891, in <module>
    main()
  File "./train_text_to_image_lora.py", line 819, in main
    accelerator.backward(loss)
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/accelerate/accelerator.py", line 1316, in backward
    loss.backward(**kwargs)
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 14.56 GiB total capacity; 12.59 GiB already allocated; 500.44 MiB free; 13.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

With xformers, still leading to:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 14.56 GiB total capacity; 12.59 GiB already allocated; 474.44 MiB free; 13.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The above experiments were run on a single T4 machine.

On a V100 with xformers, it works. Logs will be here: https://wandb.ai/sayakpaul/stable_diffusion_ft_lora/runs/0b88cwxc.

@sayakpaul
Copy link
Member Author

When I tried enabling mixed-precision on T4, it led to:

Traceback (most recent call last):
  File "./train_text_to_image_lora.py", line 891, in <module>
    main()
  File "./train_text_to_image_lora.py", line 822, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/accelerate/accelerator.py", line 1373, in clip_grad_norm_
    self.unscale_gradients()
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/accelerate/accelerator.py", line 1336, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 282, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 210, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

@sayakpaul
Copy link
Member Author

sayakpaul commented Jan 17, 2023

@patil-suraj the training is completed and the results look good: https://wandb.ai/sayakpaul/stable_diffusion_ft_lora/reports/LoRA-fine-tuning-of-text2image--VmlldzozMzUxNjI5

Let me know if it makes sense to continue to work on this PR and add LoRA support formally to our text2image fine-tuning script. Happy to take care of it :)

Update: Talked to Suraj offline. I will continue working on this PR and let y'all know (@patrickvonplaten @patil-suraj) when it's ready for reviews.

@sayakpaul sayakpaul self-assigned this Jan 17, 2023
@patil-suraj
Copy link
Contributor

Thanks a lot for working on this! Feel free to continue on the PR. We could add the train_lora_text_to_image.py script under the text_to_image directory.

@sayakpaul
Copy link
Member Author

Things seem to be working on both T4 and V100.

My command:

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch \
  train_text_to_image_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME --caption_column="text" \
  --resolution=512 --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=100 --checkpointing_steps=5000 \
  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --seed=42 \
  --enable_xformers_memory_efficient_attention \
  --validation_prompt="cute Sundar Pichai creature" --report_to="wandb" \
  --output_dir="sd-model-finetuned-lora-v100" \
  --push_to_hub && sudo shutdown now

The final weights will be pushed to https://huggingface.co/sayakpaul/sd-model-finetuned-lora-v100/tree/main and an experimentation run is available here: https://wandb.ai/sayakpaul/text2image-fine-tune/runs/782txylu (currently running). Once these are done, I will update the appropriate sections in the README.

@sayakpaul sayakpaul marked this pull request as ready for review January 18, 2023 09:48
@sayakpaul
Copy link
Member Author

Closing this PR since the merge conflicts are a little too brutal to resolve. I will create a fresh PR.

@sayakpaul sayakpaul closed this Jan 18, 2023
@sayakpaul sayakpaul deleted the feat/lora-fit branch January 18, 2023 17:55
@sayakpaul
Copy link
Member Author

#2031

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants