LoRA training for sdxl on diffusers CUDA out of memory? #4368

frankchieng · 2023-07-30T09:38:07Z

Describe the bug

when i train lora thr Zero-2 stage of deepspeed and offload optimizer states and parameters to CPU, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 14.75 GiB total capacity; 14.06 GiB already allocated; 8.81 MiB free; 14.38 GiB reserved in total by PyTorch)
why the GPU VRAM always occupied so highly after i run deepspeed for memory saving?

Reproduction

accelerate config

accelerate launch train_dreambooth_lora_sdxl.py
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0"
--instance_data_dir="dog"
--output_dir="lora-trained-xl"
--mixed_precision="fp16"
--instance_prompt="a photo of sks dog"
--resolution=1024
--train_batch_size=1
--gradient_accumulation_steps=1 --gradient_checkpointing
--enable_xformers_memory_efficient_attention
--learning_rate=1e-4
--report_to="wandb"
--lr_scheduler="constant"
--lr_warmup_steps=0
--max_train_steps=5
--validation_prompt="A photo of sks dog in a bucket"
--validation_epochs=25
--seed="0"
--push_to_hub

Logs

No response

System Info

Who can help?

No response

aycaecemgul · 2023-08-03T11:54:03Z

I had the same problem with T4 but then I noticed....
Our experiments were conducted on a single 40GB A100 GPU.

Even though i wanted to train text encoder, i removed --train_text_encoder and used all the optimization still I couldn't make it work.

!accelerate launch /content/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of zwx man" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-4 \
  --gradient_accumulation_steps=1 \
  --train_batch_size=1 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1000 \
  --validation_prompt="A photo of zwx man" \
  --validation_epochs=25 \
  --gradient_checkpointing \
  --seed="0" \
  --num_validation_images=4 \
  --use_8bit_adam \
  --checkpointing_steps=500 \
  --enable_xformers_memory_efficient_attention

LOGS

2023-08-03 11:41:45.710843: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-03 11:41:51.320662: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
08/03/2023 11:41:53 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'thresholding', 'clip_sample_range', 'variance_type', 'dynamic_thresholding_ratio'} was not found in config. Values will be initialized to default values.
08/03/2023 11:43:11 - INFO - __main__ - ***** Running training *****
08/03/2023 11:43:11 - INFO - __main__ -   Num examples = 10
08/03/2023 11:43:11 - INFO - __main__ -   Num batches each epoch = 10
08/03/2023 11:43:11 - INFO - __main__ -   Num Epochs = 100
08/03/2023 11:43:11 - INFO - __main__ -   Instantaneous batch size per device = 1
08/03/2023 11:43:11 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
08/03/2023 11:43:11 - INFO - __main__ -   Gradient Accumulation steps = 1
08/03/2023 11:43:11 - INFO - __main__ -   Total optimization steps = 1000
Steps:   1% 10/1000 [00:29<47:17,  2.87s/it, loss=0.15, lr=0.0001] 08/03/2023 11:43:40 - INFO - __main__ - Running validation... 
 Generating 4 images with prompt: A photo of zwx man.
{'add_watermarker'} was not found in config. Values will be initialized to default values.

Loading pipeline components...:   0% 0/7 [00:00<?, ?it/s]Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.

Loading pipeline components...:  57% 4/7 [00:00<00:00, 35.93it/s]Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100% 7/7 [00:00<00:00, 38.76it/s]
{'solver_type', 'algorithm_type', 'lower_order_final', 'thresholding', 'variance_type', 'dynamic_thresholding_ratio', 'solver_order', 'lambda_min_clipped'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
  File "/content/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1351, in <module>
    main(args)
  File "/content/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1090, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_condition.py", line 970, in forward
    sample = upsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_blocks.py", line 2156, in forward
    hidden_states = resnet(hidden_states, temb)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 599, in forward
    hidden_states = self.nonlinearity(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 396, in forward
    return F.silu(input, inplace=self.inplace)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2059, in silu
    return torch._C._nn.silu(input)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.75 GiB total capacity; 14.22 GiB already allocated; 832.00 KiB free; 14.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps:   1% 10/1000 [04:40<7:42:44, 28.04s/it, loss=0.15, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir=/content/lora-sd-xl', '--output_dir=/content/drive/MyDrive/stable_diffusion_weights/berkay-xl', '--mixed_precision=fp16', '--instance_prompt=a photo of zwx man', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--learning_rate=1e-4', '--gradient_accumulation_steps=1', '--train_batch_size=1', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=1000', '--validation_prompt=A photo of zwx man', '--validation_epochs=25', '--gradient_checkpointing', '--seed=0', '--num_validation_images=4', '--use_8bit_adam', '--checkpointing_steps=500', '--enable_xformers_memory_efficient_attention']' returned non-zero exit status 1.

patrickvonplaten · 2023-08-03T15:06:26Z

Do we know what layer is the memory bottleneck here? Maybe flash attention v2 can come to our rescue here - think it'll be in xformers 0.21.0 which should be released soon: facebookresearch/xformers#795

frankchieng · 2023-08-04T06:03:38Z

xformers 0,20.0 or scaled dot-production attension is okay either after i tested with kohya scripts, idk if it's the optimization type problem,cuz i choose the AdaFactor instead of AdamW,it works well on T4 16GB GPU RAM

aycaecemgul · 2023-08-04T06:48:39Z

xformers 0,20.0 or scaled dot-production attension is okay either after i tested with kohya scripts, idk if it's the optimization type problem,cuz i choose the AdaFactor instead of AdamW,it works well on T4 16GB GPU RAM

can you share your training command and torch version? did you manually update the optimizer ? I couldn't find it as a parameter in the script.

patrickvonplaten · 2023-08-04T10:48:55Z

Also related: #4377 - this should give us a nice memory saving boost will update today

frankchieng · 2023-08-05T03:49:36Z

xformers 0,20.0 or scaled dot-production attension is okay either after i tested with kohya scripts, idk if it's the optimization type problem,cuz i choose the AdaFactor instead of AdamW,it works well on T4 16GB GPU RAM

can you share your training command and torch version? did you manually update the optimizer ? I couldn't find it as a parameter in the script.

colab torch version is 2.0.1+cu118,but i installed the xformers 0.0.20 for colab T4

BrynCooke · 2023-08-05T15:49:45Z

For me I still can't train on a 4090. Seems to allocate past 24gb on Loading unet

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="lora-trained-xl"

accelerate launch train_dreambooth_lora_sdxl.py \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=1 \
  --seed="0"

Logs:

[2023-08-05 16:44:41,308] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-05 16:44:42,906] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
08/05/2023 16:44:43 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'thresholding', 'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio'} was not found in config. Values will be initialized to default values.
08/05/2023 16:44:49 - INFO - __main__ - ***** Running training *****
08/05/2023 16:44:49 - INFO - __main__ -   Num examples = 5
08/05/2023 16:44:49 - INFO - __main__ -   Num batches each epoch = 5
08/05/2023 16:44:49 - INFO - __main__ -   Num Epochs = 1
08/05/2023 16:44:49 - INFO - __main__ -   Instantaneous batch size per device = 1
08/05/2023 16:44:49 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
08/05/2023 16:44:49 - INFO - __main__ -   Gradient Accumulation steps = 4
08/05/2023 16:44:49 - INFO - __main__ -   Total optimization steps = 1
Steps: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.51s/it, loss=0.0105, lr=0.0001]08/05/2023 16:44:53 - INFO - __main__ - Running validation... 
 Generating 4 images with prompt: A photo of sks dog in a bucket.
                                                                                                                                                                                                                        Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                                           | 0/7 [00:00<?, ?it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 148.56it/s]
{'variance_type', 'solver_type', 'lower_order_final', 'solver_order', 'dynamic_thresholding_ratio', 'lambda_min_clipped', 'thresholding', 'algorithm_type'} was not found in config. Values will be initialized to default values.
/home/bryn/git/stable-diffusion/src/diffusers/image_processor.py:65: RuntimeWarning: invalid value encountered in cast
  images = (images * 255).round().astype("uint8")
Model weights saved in lora-trained-xl/pytorch_lora_weights.bin
                                                                                                                                                                                                                        Loaded unet as UNet2DConditionModel from `unet` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                                              | 0/7 [00:00<?, ?it/s]
                                                                                                                                                                                                                        Loaded text_encoder_2 as CLIPTextModelWithProjection from `text_encoder_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                           | 1/7 [00:01<00:07,  1.30s/it]
                                                                                                                                                                                                                        Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                             | 2/7 [00:01<00:04,  1.18it/s]
                                                                                                                                                                                                                        Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.██████████████████▋                                                                | 4/7 [00:02<00:01,  2.64it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.37it/s]
{'variance_type', 'solver_type', 'lower_order_final', 'solver_order', 'dynamic_thresholding_ratio', 'lambda_min_clipped', 'thresholding', 'algorithm_type'} was not found in config. Values will be initialized to default values.
Loading unet.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:03<00:00,  6.93it/s]
Traceback (most recent call last):███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:03<00:00,  6.93it/s]
  File "/home/bryn/git/stable-diffusion/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1368, in <module>
    main(args)
  File "/home/bryn/git/stable-diffusion/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1327, in main
    images = [
  File "/home/bryn/git/stable-diffusion/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1328, in <listcomp>
    pipeline(args.validation_prompt, num_inference_steps=25, generator=generator).images[0]
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/bryn/git/stable-diffusion/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py", line 845, in __call__
    image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
  File "/home/bryn/git/stable-diffusion/src/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/home/bryn/git/stable-diffusion/src/diffusers/models/autoencoder_kl.py", line 270, in decode
    decoded = self._decode(z).sample
  File "/home/bryn/git/stable-diffusion/src/diffusers/models/autoencoder_kl.py", line 257, in _decode
    dec = self.decoder(z)
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bryn/git/stable-diffusion/src/diffusers/models/vae.py", line 271, in forward
    sample = up_block(sample, latent_embeds)
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bryn/git/stable-diffusion/src/diffusers/models/unet_2d_blocks.py", line 2336, in forward
    hidden_states = upsampler(hidden_states)
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bryn/git/stable-diffusion/src/diffusers/models/resnet.py", line 169, in forward
    hidden_states = self.conv(hidden_states)
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bryn/git/stable-diffusion/src/diffusers/models/lora.py", line 102, in forward
    return F.conv2d(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 23.65 GiB total capacity; 21.32 GiB already allocated; 605.38 MiB free; 22.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:49<00:00, 49.31s/it, loss=0.0105, lr=0.0001]
Traceback (most recent call last):
  File "/home/bryn/git/stable-diffusion/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/bryn/git/stable-diffusion/venv/bin/python', 'train_dreambooth_lora_sdxl.py', '--enable_xformers_memory_efficient_attention', '--gradient_checkpointing', '--use_8bit_adam', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir=dog', '--output_dir=lora-trained-xl', '--mixed_precision=fp16', '--instance_prompt=a photo of sks dog', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--learning_rate=1e-4', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=1', '--validation_prompt=A photo of sks dog in a bucket', '--validation_epochs=1', '--seed=0']' returned non-zero exit status 1.

EDIT: Adding --pretrained_vae_model_name_or_path "madebyollin/sdxl-vae-fp16-fix" made the issues go away.

sayakpaul · 2023-08-09T08:59:41Z

https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/SDXL_DreamBooth_LoRA_.ipynb should fix all these problems.

frankchieng added the bug Something isn't working label Jul 30, 2023

patrickvonplaten mentioned this issue Aug 4, 2023

[SDXL] Allow SDXL LoRA to be run with less than 16GB of VRAM #4470

Merged

sayakpaul closed this as completed Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA training for sdxl on diffusers CUDA out of memory? #4368

LoRA training for sdxl on diffusers CUDA out of memory? #4368

frankchieng commented Jul 30, 2023

aycaecemgul commented Aug 3, 2023

patrickvonplaten commented Aug 3, 2023

frankchieng commented Aug 4, 2023

aycaecemgul commented Aug 4, 2023 •

edited

Loading

patrickvonplaten commented Aug 4, 2023

frankchieng commented Aug 5, 2023

BrynCooke commented Aug 5, 2023 •

edited

Loading

sayakpaul commented Aug 9, 2023

LoRA training for sdxl on diffusers CUDA out of memory? #4368

LoRA training for sdxl on diffusers CUDA out of memory? #4368

Comments

frankchieng commented Jul 30, 2023

Describe the bug

Reproduction

Logs

System Info

Who can help?

aycaecemgul commented Aug 3, 2023

patrickvonplaten commented Aug 3, 2023

frankchieng commented Aug 4, 2023

aycaecemgul commented Aug 4, 2023 • edited Loading

patrickvonplaten commented Aug 4, 2023

frankchieng commented Aug 5, 2023

BrynCooke commented Aug 5, 2023 • edited Loading

sayakpaul commented Aug 9, 2023

aycaecemgul commented Aug 4, 2023 •

edited

Loading

BrynCooke commented Aug 5, 2023 •

edited

Loading