Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRA training for sdxl on diffusers CUDA out of memory? #4368

Closed
frankchieng opened this issue Jul 30, 2023 · 8 comments
Closed

LoRA training for sdxl on diffusers CUDA out of memory? #4368

frankchieng opened this issue Jul 30, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@frankchieng
Copy link

Describe the bug

when i train lora thr Zero-2 stage of deepspeed and offload optimizer states and parameters to CPU, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 14.75 GiB total capacity; 14.06 GiB already allocated; 8.81 MiB free; 14.38 GiB reserved in total by PyTorch)
why the GPU VRAM always occupied so highly after i run deepspeed for memory saving?
image
image

Reproduction

accelerate config

accelerate launch train_dreambooth_lora_sdxl.py
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0"
--instance_data_dir="dog"
--output_dir="lora-trained-xl"
--mixed_precision="fp16"
--instance_prompt="a photo of sks dog"
--resolution=1024
--train_batch_size=1
--gradient_accumulation_steps=1 --gradient_checkpointing
--enable_xformers_memory_efficient_attention
--learning_rate=1e-4
--report_to="wandb"
--lr_scheduler="constant"
--lr_warmup_steps=0
--max_train_steps=5
--validation_prompt="A photo of sks dog in a bucket"
--validation_epochs=25
--seed="0"
--push_to_hub

Logs

No response

System Info

1690709879042

Who can help?

No response

@frankchieng frankchieng added the bug Something isn't working label Jul 30, 2023
@aycaecemgul
Copy link

I had the same problem with T4 but then I noticed....
Our experiments were conducted on a single 40GB A100 GPU.

Even though i wanted to train text encoder, i removed --train_text_encoder and used all the optimization still I couldn't make it work.

!accelerate launch /content/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of zwx man" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-4 \
  --gradient_accumulation_steps=1 \
  --train_batch_size=1 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1000 \
  --validation_prompt="A photo of zwx man" \
  --validation_epochs=25 \
  --gradient_checkpointing \
  --seed="0" \
  --num_validation_images=4 \
  --use_8bit_adam \
  --checkpointing_steps=500 \
  --enable_xformers_memory_efficient_attention

LOGS

2023-08-03 11:41:45.710843: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-03 11:41:51.320662: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
08/03/2023 11:41:53 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'thresholding', 'clip_sample_range', 'variance_type', 'dynamic_thresholding_ratio'} was not found in config. Values will be initialized to default values.
08/03/2023 11:43:11 - INFO - __main__ - ***** Running training *****
08/03/2023 11:43:11 - INFO - __main__ -   Num examples = 10
08/03/2023 11:43:11 - INFO - __main__ -   Num batches each epoch = 10
08/03/2023 11:43:11 - INFO - __main__ -   Num Epochs = 100
08/03/2023 11:43:11 - INFO - __main__ -   Instantaneous batch size per device = 1
08/03/2023 11:43:11 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
08/03/2023 11:43:11 - INFO - __main__ -   Gradient Accumulation steps = 1
08/03/2023 11:43:11 - INFO - __main__ -   Total optimization steps = 1000
Steps:   1% 10/1000 [00:29<47:17,  2.87s/it, loss=0.15, lr=0.0001] 08/03/2023 11:43:40 - INFO - __main__ - Running validation... 
 Generating 4 images with prompt: A photo of zwx man.
{'add_watermarker'} was not found in config. Values will be initialized to default values.

Loading pipeline components...:   0% 0/7 [00:00<?, ?it/s]Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.

Loading pipeline components...:  57% 4/7 [00:00<00:00, 35.93it/s]Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100% 7/7 [00:00<00:00, 38.76it/s]
{'solver_type', 'algorithm_type', 'lower_order_final', 'thresholding', 'variance_type', 'dynamic_thresholding_ratio', 'solver_order', 'lambda_min_clipped'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
  File "/content/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1351, in <module>
    main(args)
  File "/content/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1090, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_condition.py", line 970, in forward
    sample = upsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_blocks.py", line 2156, in forward
    hidden_states = resnet(hidden_states, temb)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 599, in forward
    hidden_states = self.nonlinearity(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 396, in forward
    return F.silu(input, inplace=self.inplace)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2059, in silu
    return torch._C._nn.silu(input)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.75 GiB total capacity; 14.22 GiB already allocated; 832.00 KiB free; 14.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps:   1% 10/1000 [04:40<7:42:44, 28.04s/it, loss=0.15, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir=/content/lora-sd-xl', '--output_dir=/content/drive/MyDrive/stable_diffusion_weights/berkay-xl', '--mixed_precision=fp16', '--instance_prompt=a photo of zwx man', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--learning_rate=1e-4', '--gradient_accumulation_steps=1', '--train_batch_size=1', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=1000', '--validation_prompt=A photo of zwx man', '--validation_epochs=25', '--gradient_checkpointing', '--seed=0', '--num_validation_images=4', '--use_8bit_adam', '--checkpointing_steps=500', '--enable_xformers_memory_efficient_attention']' returned non-zero exit status 1.

@patrickvonplaten
Copy link
Contributor

Do we know what layer is the memory bottleneck here? Maybe flash attention v2 can come to our rescue here - think it'll be in xformers 0.21.0 which should be released soon: facebookresearch/xformers#795

@frankchieng
Copy link
Author

xformers 0,20.0 or scaled dot-production attension is okay either after i tested with kohya scripts, idk if it's the optimization type problem,cuz i choose the AdaFactor instead of AdamW,it works well on T4 16GB GPU RAM

@aycaecemgul
Copy link

aycaecemgul commented Aug 4, 2023

xformers 0,20.0 or scaled dot-production attension is okay either after i tested with kohya scripts, idk if it's the optimization type problem,cuz i choose the AdaFactor instead of AdamW,it works well on T4 16GB GPU RAM

can you share your training command and torch version? did you manually update the optimizer ? I couldn't find it as a parameter in the script.

@patrickvonplaten
Copy link
Contributor

Also related: #4377 - this should give us a nice memory saving boost will update today

@frankchieng
Copy link
Author

xformers 0,20.0 or scaled dot-production attension is okay either after i tested with kohya scripts, idk if it's the optimization type problem,cuz i choose the AdaFactor instead of AdamW,it works well on T4 16GB GPU RAM

can you share your training command and torch version? did you manually update the optimizer ? I couldn't find it as a parameter in the script.

colab torch version is 2.0.1+cu118,but i installed the xformers 0.0.20 for colab T4

@BrynCooke
Copy link

BrynCooke commented Aug 5, 2023

For me I still can't train on a 4090. Seems to allocate past 24gb on Loading unet

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="lora-trained-xl"

accelerate launch train_dreambooth_lora_sdxl.py \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=1 \
  --seed="0"

Logs:

[2023-08-05 16:44:41,308] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-05 16:44:42,906] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
08/05/2023 16:44:43 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'thresholding', 'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio'} was not found in config. Values will be initialized to default values.
08/05/2023 16:44:49 - INFO - __main__ - ***** Running training *****
08/05/2023 16:44:49 - INFO - __main__ -   Num examples = 5
08/05/2023 16:44:49 - INFO - __main__ -   Num batches each epoch = 5
08/05/2023 16:44:49 - INFO - __main__ -   Num Epochs = 1
08/05/2023 16:44:49 - INFO - __main__ -   Instantaneous batch size per device = 1
08/05/2023 16:44:49 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
08/05/2023 16:44:49 - INFO - __main__ -   Gradient Accumulation steps = 4
08/05/2023 16:44:49 - INFO - __main__ -   Total optimization steps = 1
Steps: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.51s/it, loss=0.0105, lr=0.0001]08/05/2023 16:44:53 - INFO - __main__ - Running validation... 
 Generating 4 images with prompt: A photo of sks dog in a bucket.
                                                                                                                                                                                                                        Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                                           | 0/7 [00:00<?, ?it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 148.56it/s]
{'variance_type', 'solver_type', 'lower_order_final', 'solver_order', 'dynamic_thresholding_ratio', 'lambda_min_clipped', 'thresholding', 'algorithm_type'} was not found in config. Values will be initialized to default values.
/home/bryn/git/stable-diffusion/src/diffusers/image_processor.py:65: RuntimeWarning: invalid value encountered in cast
  images = (images * 255).round().astype("uint8")
Model weights saved in lora-trained-xl/pytorch_lora_weights.bin
                                                                                                                                                                                                                        Loaded unet as UNet2DConditionModel from `unet` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                                              | 0/7 [00:00<?, ?it/s]
                                                                                                                                                                                                                        Loaded text_encoder_2 as CLIPTextModelWithProjection from `text_encoder_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                           | 1/7 [00:01<00:07,  1.30s/it]
                                                                                                                                                                                                                        Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                             | 2/7 [00:01<00:04,  1.18it/s]
                                                                                                                                                                                                                        Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.██████████████████▋                                                                | 4/7 [00:02<00:01,  2.64it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.37it/s]
{'variance_type', 'solver_type', 'lower_order_final', 'solver_order', 'dynamic_thresholding_ratio', 'lambda_min_clipped', 'thresholding', 'algorithm_type'} was not found in config. Values will be initialized to default values.
Loading unet.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:03<00:00,  6.93it/s]
Traceback (most recent call last):███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:03<00:00,  6.93it/s]
  File "/home/bryn/git/stable-diffusion/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1368, in <module>
    main(args)
  File "/home/bryn/git/stable-diffusion/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1327, in main
    images = [
  File "/home/bryn/git/stable-diffusion/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1328, in <listcomp>
    pipeline(args.validation_prompt, num_inference_steps=25, generator=generator).images[0]
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/bryn/git/stable-diffusion/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py", line 845, in __call__
    image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
  File "/home/bryn/git/stable-diffusion/src/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/home/bryn/git/stable-diffusion/src/diffusers/models/autoencoder_kl.py", line 270, in decode
    decoded = self._decode(z).sample
  File "/home/bryn/git/stable-diffusion/src/diffusers/models/autoencoder_kl.py", line 257, in _decode
    dec = self.decoder(z)
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bryn/git/stable-diffusion/src/diffusers/models/vae.py", line 271, in forward
    sample = up_block(sample, latent_embeds)
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bryn/git/stable-diffusion/src/diffusers/models/unet_2d_blocks.py", line 2336, in forward
    hidden_states = upsampler(hidden_states)
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bryn/git/stable-diffusion/src/diffusers/models/resnet.py", line 169, in forward
    hidden_states = self.conv(hidden_states)
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bryn/git/stable-diffusion/src/diffusers/models/lora.py", line 102, in forward
    return F.conv2d(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 23.65 GiB total capacity; 21.32 GiB already allocated; 605.38 MiB free; 22.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:49<00:00, 49.31s/it, loss=0.0105, lr=0.0001]
Traceback (most recent call last):
  File "/home/bryn/git/stable-diffusion/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/home/bryn/git/stable-diffusion/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/bryn/git/stable-diffusion/venv/bin/python', 'train_dreambooth_lora_sdxl.py', '--enable_xformers_memory_efficient_attention', '--gradient_checkpointing', '--use_8bit_adam', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir=dog', '--output_dir=lora-trained-xl', '--mixed_precision=fp16', '--instance_prompt=a photo of sks dog', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--learning_rate=1e-4', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=1', '--validation_prompt=A photo of sks dog in a bucket', '--validation_epochs=1', '--seed=0']' returned non-zero exit status 1.

EDIT: Adding --pretrained_vae_model_name_or_path "madebyollin/sdxl-vae-fp16-fix" made the issues go away.

@sayakpaul
Copy link
Member

https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/SDXL_DreamBooth_LoRA_.ipynb should fix all these problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants