Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDXL Lora training Extremely slow on Rtx 4090 #1288

Closed
etha302 opened this issue Jul 28, 2023 · 140 comments
Closed

SDXL Lora training Extremely slow on Rtx 4090 #1288

etha302 opened this issue Jul 28, 2023 · 140 comments

Comments

@etha302
Copy link

etha302 commented Jul 28, 2023

As the title says, training lora for sdxl on 4090 is painfully slow. It needs at least 15-20 seconds to complete 1 single step, so it is impossible to train. i dont know whether i am doing something wrong, but here are screenshot of my settings.
Also it is using full 24gb of ram, but it is so slow that even gpu fans are not spinning.
4
3
1
2

@etha302
Copy link
Author

etha302 commented Jul 28, 2023

a little faster with these optimizer args: scale_parameter=False relative_step=False warmup_init=False
but still slow, especially because it is 4090, cant imagine how slow it would be on slower cards. so i think this has to be addressed, unless i am doing something terribly wrong.

@AIrtistry
Copy link

same problem with me with a 4080

@v0xie
Copy link

v0xie commented Jul 29, 2023

Here's a configuration file that's been working well for me on a 4090. Outputs a ~5 MB LoRA that works with A1111 and ComfyUI.

The parameters to tweak are:

  • T_max : set to the total number of steps
  • train_batch_size: set according to your dataset size (1-4 works, 4+ works but is slow)
  • gradient_accumulation_steps: higher is usually faster, not sure what the upper bound is

What I like to do is start up Tensorboard and do a couple of test runs with different train_batch_size/gradient_accumulation_step combinations and see what settings train the fastest with the size of my dataset.

This Rentry article was used as reference: https://rentry.co/ProdiAgy

As far as I can tell, IA^3 training does not work for SDXL yet, and that is anecdotally much faster than training other LoRA.

As benchmark, I can train a LoKR LoRA with dataset size of 20 for 250 epochs (1 repeat / epoch) in about 1.5 hours.

{  
  "LoRA_type": "LyCORIS/LoKr",
  "adaptive_noise_scale": 0,
  "additional_parameters": "--network_train_unet_only --lr_scheduler_type \"CosineAnnealingLR\" --lr_scheduler_args \"T_max=975\" \"eta_min=0.000\"",
  "block_alphas": "",
  "block_dims": "",
  "block_lr_zero_threshold": "",
  "bucket_no_upscale": true,
  "bucket_reso_steps": 32,
  "cache_latents": true,
  "cache_latents_to_disk": true,
  "caption_dropout_every_n_epochs": 0.0,
  "caption_dropout_rate": 0,
  "caption_extension": ".txt",
  "clip_skip": "1",
  "color_aug": false,
  "conv_alpha": 64,
  "conv_block_alphas": "",
  "conv_block_dims": "",
  "conv_dim": 64,
  "decompose_both": false,
  "dim_from_weights": false,
  "down_lr_weight": "",
  "enable_bucket": true,
  "epoch": 300,
  "factor": -1,
  "flip_aug": false,
  "full_bf16": true,
  "full_fp16": false,
  "gradient_accumulation_steps": 4.0,
  "gradient_checkpointing": true,
  "keep_tokens": 1,
  "learning_rate": 1.0,
  "logging_dir": "",
  "lora_network_weights": "",
  "lr_scheduler": "cosine",
  "lr_scheduler_num_cycles": "",
  "lr_scheduler_power": "",
  "lr_warmup": 0,
  "max_bucket_reso": 2048,
  "max_data_loader_n_workers": "0",
  "max_resolution": "1024,1024",
  "max_timestep": 1000,
  "max_token_length": "75",
  "max_train_epochs": "",
  "mem_eff_attn": false,
  "mid_lr_weight": "",
  "min_bucket_reso": 256,
  "min_snr_gamma": 3,
  "min_timestep": 0,
  "mixed_precision": "bf16",
  "model_list": "custom",
  "module_dropout": 0,
  "multires_noise_discount": 0.2,
  "multires_noise_iterations": 8,
  "network_alpha": 64,
  "network_dim": 64,
  "network_dropout": 0,
  "no_token_padding": false,
  "noise_offset": 0.0357,
  "noise_offset_type": "Original",
  "num_cpu_threads_per_process": 8,
  "optimizer": "Prodigy",
  "optimizer_args": "\"betas=0.9,0.999\" \"d0=1e-2\" \"d_coef=1.0\" \"weight_decay=0.400\" \"use_bias_correction=False\" \"safeguard_warmup=False\"",
  "output_dir": "",
  "output_name": "",
  "persistent_data_loader_workers": false,
  "pretrained_model_name_or_path": "stabilityai/stable-diffusion-xl-base-1.0",
  "prior_loss_weight": 1.0,
  "random_crop": false,
  "rank_dropout": 0,
  "reg_data_dir": "",
  "resume": "",
  "sample_prompts": "",
  "sample_sampler": "euler_a",
  "save_every_n_epochs": 5,
  "save_every_n_steps": 0,
  "save_last_n_steps": 0,
  "save_last_n_steps_state": 0,
  "save_model_as": "safetensors",
  "save_precision": "bf16",
  "save_state": false,
  "scale_v_pred_loss_like_noise_pred": false,
  "scale_weight_norms": 1,
  "sdxl": true,
  "sdxl_cache_text_encoder_outputs": false,
  "sdxl_no_half_vae": true,
  "seed": "31337",
  "shuffle_caption": false,
  "stop_text_encoder_training": 0,
  "text_encoder_lr": 1.0,
  "train_batch_size": 1,
  "train_data_dir": "",
  "train_on_input": true,
  "training_comment": "",
  "unet_lr": 1.0,
  "unit": 1,
  "up_lr_weight": "",
  "use_cp": true,
  "use_wandb": false,
  "v2": false,
  "v_parameterization": false,
  "vae_batch_size": 0,
  "wandb_api_key": "",
  "weighted_captions": false,
  "xformers": true
}

@etha302
Copy link
Author

etha302 commented Jul 29, 2023

Here's a configuration file that's been working well for me on a 4090. Outputs a ~5 MB LoRA that works with A1111 and ComfyUI.

The parameters to tweak are:

  • T_max : set to the total number of steps
  • train_batch_size: set according to your dataset size (1-4 works, 4+ works but is slow)
  • gradient_accumulation_steps: higher is usually faster, not sure what the upper bound is

What I like to do is start up Tensorboard and do a couple of test runs with different train_batch_size/gradient_accumulation_step combinations and see what settings train the fastest with the size of my dataset.

This Rentry article was used as reference: https://rentry.co/ProdiAgy

As far as I can tell, IA^3 training does not work for SDXL yet, and that is anecdotally much faster than training other LoRA.

As benchmark, I can train a LoKR LoRA with dataset size of 20 for 250 epochs (1 repeat / epoch) in about 1.5 hours.

{  
  "LoRA_type": "LyCORIS/LoKr",
  "adaptive_noise_scale": 0,
  "additional_parameters": "--network_train_unet_only --lr_scheduler_type \"CosineAnnealingLR\" --lr_scheduler_args \"T_max=975\" \"eta_min=0.000\"",
  "block_alphas": "",
  "block_dims": "",
  "block_lr_zero_threshold": "",
  "bucket_no_upscale": true,
  "bucket_reso_steps": 32,
  "cache_latents": true,
  "cache_latents_to_disk": true,
  "caption_dropout_every_n_epochs": 0.0,
  "caption_dropout_rate": 0,
  "caption_extension": ".txt",
  "clip_skip": "1",
  "color_aug": false,
  "conv_alpha": 64,
  "conv_block_alphas": "",
  "conv_block_dims": "",
  "conv_dim": 64,
  "decompose_both": false,
  "dim_from_weights": false,
  "down_lr_weight": "",
  "enable_bucket": true,
  "epoch": 300,
  "factor": -1,
  "flip_aug": false,
  "full_bf16": true,
  "full_fp16": false,
  "gradient_accumulation_steps": 4.0,
  "gradient_checkpointing": true,
  "keep_tokens": 1,
  "learning_rate": 1.0,
  "logging_dir": "",
  "lora_network_weights": "",
  "lr_scheduler": "cosine",
  "lr_scheduler_num_cycles": "",
  "lr_scheduler_power": "",
  "lr_warmup": 0,
  "max_bucket_reso": 2048,
  "max_data_loader_n_workers": "0",
  "max_resolution": "1024,1024",
  "max_timestep": 1000,
  "max_token_length": "75",
  "max_train_epochs": "",
  "mem_eff_attn": false,
  "mid_lr_weight": "",
  "min_bucket_reso": 256,
  "min_snr_gamma": 3,
  "min_timestep": 0,
  "mixed_precision": "bf16",
  "model_list": "custom",
  "module_dropout": 0,
  "multires_noise_discount": 0.2,
  "multires_noise_iterations": 8,
  "network_alpha": 64,
  "network_dim": 64,
  "network_dropout": 0,
  "no_token_padding": false,
  "noise_offset": 0.0357,
  "noise_offset_type": "Original",
  "num_cpu_threads_per_process": 8,
  "optimizer": "Prodigy",
  "optimizer_args": "\"betas=0.9,0.999\" \"d0=1e-2\" \"d_coef=1.0\" \"weight_decay=0.400\" \"use_bias_correction=False\" \"safeguard_warmup=False\"",
  "output_dir": "",
  "output_name": "",
  "persistent_data_loader_workers": false,
  "pretrained_model_name_or_path": "stabilityai/stable-diffusion-xl-base-1.0",
  "prior_loss_weight": 1.0,
  "random_crop": false,
  "rank_dropout": 0,
  "reg_data_dir": "",
  "resume": "",
  "sample_prompts": "",
  "sample_sampler": "euler_a",
  "save_every_n_epochs": 5,
  "save_every_n_steps": 0,
  "save_last_n_steps": 0,
  "save_last_n_steps_state": 0,
  "save_model_as": "safetensors",
  "save_precision": "bf16",
  "save_state": false,
  "scale_v_pred_loss_like_noise_pred": false,
  "scale_weight_norms": 1,
  "sdxl": true,
  "sdxl_cache_text_encoder_outputs": false,
  "sdxl_no_half_vae": true,
  "seed": "31337",
  "shuffle_caption": false,
  "stop_text_encoder_training": 0,
  "text_encoder_lr": 1.0,
  "train_batch_size": 1,
  "train_data_dir": "",
  "train_on_input": true,
  "training_comment": "",
  "unet_lr": 1.0,
  "unit": 1,
  "up_lr_weight": "",
  "use_cp": true,
  "use_wandb": false,
  "v2": false,
  "v_parameterization": false,
  "vae_batch_size": 0,
  "wandb_api_key": "",
  "weighted_captions": false,
  "xformers": true
}

thanks for sharing the workflow, will def. give it a try. i am only a little concerned on network size, not sure if its good for training realistic. also 1.5 hours, is still allot for training a lora. And this has to get better overtime, imagine people with 3070, 3080, 4060 and so on, they will be training one lora whole day. After 4090 you are entering enterprise class, and then we are talking 5000-10000 usd for one card. And i believe most people including me cannot afford that, however lora with settings i used above, although it took hours and hours to train, it gave out very decent results, especially for the first try.
edit: also i am looking forward, to actually start finetuning sdxl base model, once that --no_half_vae error is fixed.

@bmaltais
Copy link
Owner

SDXL training is also quite slow on my 3090... but not that slow. Not sure why it is this slow for you.

@etha302
Copy link
Author

etha302 commented Jul 31, 2023

SDXL training is also quite slow on my 3090... but not that slow. Not sure why it is this slow for you.

no idea. i tried everything. i trained another lora, with 6000 steps. and it took 13 hours to complete, about 7 seconds per step. results arent bad at all, but why is it so slow i just dont know. also it uses 24 gb of vram, no matter what i change. and this starts with caching latents, first it goes fast and after a few seconds when vram usage goes to 24 gb it gets super slow. then it stays like that through the whole training.

@FurkanGozukara
Copy link
Contributor

SDXL training is also quite slow on my 3090... but not that slow. Not sure why it is this slow for you.

1.4 second / it for me

rtx 3090 ti - batch size 1 gradient 1

on rtx 3060 it is about 2.8 second / it

but on rtx 3060 i train 32 network rank , with rtx 3090 ti 256 network rank

@MrPlatnum
Copy link

I don't even get past the "caching latents" Stage. My PC just completely freezes when the 24 GB are full.
Also on an 4090

@etha302
Copy link
Author

etha302 commented Aug 1, 2023

I don't even get past the "caching latents" Stage. My PC just completely freezes when the 24 GB are full. Also on an 4090

interesting, so it is even worse for you. this is why i didnt even attempt to use reg images yet, because this is probably what might happen. what i tried yesterday: reinstall xformers, cuda, new cudnn .dlls, different nvidia drivers and nothing. Whether something is really wrong with the script, but i dont think it is because not everyone has the same problem, what i will try today: reinstall windows, and install everything from scratch, and i will report back if this solves the problem.
got discord then add me: davidk35, i will send my .json files for you to try

@FurkanGozukara
Copy link
Contributor

FurkanGozukara commented Aug 1, 2023

I did a full test today

RTX 3060 is 2.4 second / it : https://twitter.com/GozukaraFurkan/status/1686296023751094273

RTX 3090 TI is 1.23 second / it : https://twitter.com/GozukaraFurkan/status/1686305740401541121

@etha302
Copy link
Author

etha302 commented Aug 1, 2023

I did a full test today

RTX 3060 is 2.4 second / it : https://twitter.com/GozukaraFurkan/status/1686296023751094273

RTX 3090 TI is 1.23 second / it : https://twitter.com/GozukaraFurkan/status/1686305740401541121

so after reinstalling windows and kohya speed is on 4090 is exactly the same, and that is around 1.9s-2s/it. so much slower than 3090 and barely any faster than 3060.

@etha302
Copy link
Author

etha302 commented Aug 1, 2023

it is clearly 4090 specific problem at this point, i was talking to many people and most have the same problem. different drivers dont help, new cudnn .dll files also not. Chosing adamw 8 bit optimizer uses only 12gb of vram and speed is at around 1.3s/it, so still very slow for a 4090 and results probably arent going to be good for SDXL. at this point, i really dont know what else to do.

@wen020
Copy link

wen020 commented Aug 2, 2023

Does the training picture have to be 1024x1024?

@bmaltais
Copy link
Owner

bmaltais commented Aug 2, 2023

Does the training picture have to be 1024x1024?

Don't have to be square... But the total number of pixels in the image should be equal or greater than 1024 x 1024.

@Thom293
Copy link

Thom293 commented Aug 2, 2023

Here's a configuration file that's been working well for me on a 4090. Outputs a ~5 MB LoRA that works with A1111 and ComfyUI.

The parameters to tweak are:

* `T_max` : set to the total number of steps

* `train_batch_size`: set according to your dataset size (1-4 works, 4+ works but is slow)

* `gradient_accumulation_steps`: higher is usually faster, not sure what the upper bound is

What I like to do is start up Tensorboard and do a couple of test runs with different train_batch_size/gradient_accumulation_step combinations and see what settings train the fastest with the size of my dataset.

This Rentry article was used as reference: https://rentry.co/ProdiAgy

As far as I can tell, IA^3 training does not work for SDXL yet, and that is anecdotally much faster than training other LoRA.

As benchmark, I can train a LoKR LoRA with dataset size of 20 for 250 epochs (1 repeat / epoch) in about 1.5 hours.

I used this method and it worked. I trained my first LORA. Thank you very much. Running a second one now. However, on a 4090 it is extremely slow.

20 images, 300 epocs, 10 repeats has taken most of a day. Id love some advice on how to make it quicker. This seems really slow. I havent trained a lora before but traning a TI I could do 50+ images on a mobiel 3080 16gb in a few hours.

Im around 250 and this is what the speed shows:

image

@etha302
Copy link
Author

etha302 commented Aug 2, 2023

Here's a configuration file that's been working well for me on a 4090. Outputs a ~5 MB LoRA that works with A1111 and ComfyUI.
The parameters to tweak are:

* `T_max` : set to the total number of steps

* `train_batch_size`: set according to your dataset size (1-4 works, 4+ works but is slow)

* `gradient_accumulation_steps`: higher is usually faster, not sure what the upper bound is

What I like to do is start up Tensorboard and do a couple of test runs with different train_batch_size/gradient_accumulation_step combinations and see what settings train the fastest with the size of my dataset.
This Rentry article was used as reference: https://rentry.co/ProdiAgy
As far as I can tell, IA^3 training does not work for SDXL yet, and that is anecdotally much faster than training other LoRA.
As benchmark, I can train a LoKR LoRA with dataset size of 20 for 250 epochs (1 repeat / epoch) in about 1.5 hours.

I used this method and it worked. I trained my first LORA. Thank you very much. Running a second one now. However, on a 4090 it is extremely slow.

20 images, 300 epocs, 10 repeats has taken most of a day. Id love some advice on how to make it quicker. This seems really slow. I havent trained a lora before but traning a TI I could do 50+ images on a mobiel 3080 16gb in a few hours.

Im around 250 and this is what the speed shows:

image

It’s even slower for you than me. I get around 2s/it which is still ridiculously slow. Everything points to nvidia drivers at this point, testing today on linux. Will report back later

@MiloMindbender
Copy link

I am having a similar issue, on 4090 SDXL lora training it is going about 1.82s/it and using all 24gb of ram even with a batch size of 1. This is the first SDXL training I have tried, and a new computer with 4090 I have not used for training before so I'm wondering if this speed an RAM use is normal for 4090?

settingsV2.txt

@brianiup
Copy link

brianiup commented Aug 3, 2023

I'm seeing this also with my 4090, all 24GB of VRAM get used up CUDA usage is at 99% and it's training at 10.2s/it for a 1024x1024 res

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="E:/Models/Stable-Diffusion/Checkpoints/SDXL1.0/sd_xl_base_1.0. safetensors" --train_data_dir="E:/Models/Stable-Diffusion/Training/Lora/test\img" --resolution="1024,1024" --output_dir="E:/Models/Stable-Diffusion/Training/Lora/test\model" --logging_dir="E:/Models/Stable-Diffusion/Training/Lora/test\log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0001 --unet_lr=0.0001 --network_dim=256 --output_name="test" --lr_scheduler_num_cycles="5" --no_half_vae --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5800" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --seed="12345" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="4" --bucket_reso_steps=64 --save_state --xformers --bucket_no_upscale --noise_offset=0.0357 --sample_sampler=euler_a --sample_prompts="E:/Models/Stable-Diffusion/Training/Lora/test\model\sample\prompt.txt" --sample_every_n_steps="100"

Version: v21.8.5
Torch 2.0.1+cu118
Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24564 Arch (8, 9) Cores 128

Even when it generates sample prompt images it's way slower then in automatic1111,

height: 1024
width: 1024
sample_steps: 40
scale: 8.0

28%|██████████████████████▌ | 11/40 [00:18<00:49, 1.70s/it]

in A1111 i get 10 it/s with that res and eulera at that res

@FurkanGozukara
Copy link
Contributor

I'm seeing this also with my 4090, all 24GB of VRAM get used up CUDA usage is at 99% and it's training at 10.2s/it for a 1024x1024 res

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="E:/Models/Stable-Diffusion/Checkpoints/SDXL1.0/sd_xl_base_1.0. safetensors" --train_data_dir="E:/Models/Stable-Diffusion/Training/Lora/test\img" --resolution="1024,1024" --output_dir="E:/Models/Stable-Diffusion/Training/Lora/test\model" --logging_dir="E:/Models/Stable-Diffusion/Training/Lora/test\log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0001 --unet_lr=0.0001 --network_dim=256 --output_name="test" --lr_scheduler_num_cycles="5" --no_half_vae --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="5800" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --seed="12345" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="4" --bucket_reso_steps=64 --save_state --xformers --bucket_no_upscale --noise_offset=0.0357 --sample_sampler=euler_a --sample_prompts="E:/Models/Stable-Diffusion/Training/Lora/test\model\sample\prompt.txt" --sample_every_n_steps="100"

Version: v21.8.5
Torch 2.0.1+cu118
Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24564 Arch (8, 9) Cores 128

Even when it generates sample prompt images it's way slower then in automatic1111,

height: 1024 width: 1024 sample_steps: 40 scale: 8.0

28%|██████████████████████▌ | 11/40 [00:18<00:49, 1.70s/it]

in A1111 i get 10 it/s with that res and eulera at that res

wow this is terrible

if you are my patreon supporter i would like to connect your pc and try to help

@6b6a72
Copy link

6b6a72 commented Aug 4, 2023

At work so I can't grab any specifics right now - one thing to check is shared GPU memory usage.

I haven't had tons of time to experiment yet but, in my case, using regularization images from unsplash that I changed to have a max length/height of 2048 pushed the GPU to use shared VRAM during latent caching. Once shared GPU memory is in use, performance will suffer greatly. Latent caching is speedy for maybe 80-100 steps, then it's like the horrible times y'all are sharing (e.g., 10 s/it).

Changing to 1024x1024 images I generated myself from SDXL gets latent caching to a decent 5-8 it/s, and training with whatever my settings are was yielding about 1.2 it/s during training of a standard LoRA.

My numbers might be off a little bit - I can test more later - but check shared GPU memory and make it's not being used. Open Task Manager and check the Performance tab.

@brianiup
Copy link

brianiup commented Aug 4, 2023

I turned bucketing off and went to 1024x1024 fixed size images and now I am getting speeds of 1.22s/it, it seems to have something to do with buckets and adafactor optimizer for me, still trying different combos, but bucketing off has a drastic improvement

@FurkanGozukara
Copy link
Contributor

I turned bucketing off and went to 1024x1024 fixed size images and now I am getting speeds of 1.22s/it, it seems to have something to do with buckets and adafactor optimizer for me, still trying different combos, but bucketing off has a drastic improvement

nice info

i also saw someone were getting errors due to bucketing system bug

@6b6a72
Copy link

6b6a72 commented Aug 5, 2023

Leaving buckets enabled and unchecking Don't upscale bucket resolution in Advanced is providing some acceptable results.

I tested with 10 source images, all at least 1600x1600, and the original resolution Unsplash pics (smallest of which is 1155x1732) for regularization. No shared memory usage (though it was close) during caching, with 4-7 it/s or so. Training steps actually ran the fastest I've seen yet at a very steady 1.53-1.54 it/s.

Not a permanent solution but it's working and I haven't seen anything better yet so ... YMMV

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket
  --min_bucket_reso=256 --max_bucket_reso=1024
  --pretrained_model_name_or_path="D:/sd/models/sdxl_v1/sd_xl_base_1.0.safetensors"
  --train_data_dir="output/img" --reg_data_dir="D:/unsplash/woman"
  --resolution="1024,1024" --output_dir="output/model"
  --logging_dir="output/log" --network_alpha="48"
  --save_model_as=safetensors --network_module=networks.lora --unet_lr=0.0001
  --network_train_unet_only --network_dim=96 --output_name="txmo_xl"
  --lr_scheduler_num_cycles="3" --cache_text_encoder_outputs --no_half_vae --full_bf16
  --learning_rate="0.0001" --lr_scheduler="constant_with_warmup" --lr_warmup_steps="240"
  --train_batch_size="1" --max_train_steps="2400" --save_every_n_epochs="1"
  --mixed_precision="bf16" --save_precision="bf16" --cache_latents --optimizer_type="Adafactor"
  --optimizer_args scale_parameter=False relative_step=False warmup_init=False
  --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --noise_offset=0.0357

@brianiup
Copy link

brianiup commented Aug 6, 2023

At work so I can't grab any specifics right now - one thing to check is shared GPU memory usage.

I haven't had tons of time to experiment yet but, in my case, using regularization images from unsplash that I changed to have a max length/height of 2048 pushed the GPU to use shared VRAM during latent caching. Once shared GPU memory is in use, performance will suffer greatly. Latent caching is speedy for maybe 80-100 steps, then it's like the horrible times y'all are sharing (e.g., 10 s/it).

Changing to 1024x1024 images I generated myself from SDXL gets latent caching to a decent 5-8 it/s, and training with whatever my settings are was yielding about 1.2 it/s during training of a standard LoRA.

My numbers might be off a little bit - I can test more later - but check shared GPU memory and make it's not being used. Open Task Manager and check the Performance tab.

yes, I've noticed that a few times, i've seen total VRAM usage for me get into the 50GB range with my 24GB 4090, when that happens iterations/s get around 50seconds/it

@Thom293
Copy link

Thom293 commented Aug 6, 2023

So confusing. Now it wont use all of my ram. It stops at 14gb. Anyone know a setting that would fix that? The other config I tried used all my ram and got 1.2/its, but it produced black images, and didnt work.

image

image

What driver version are yall using? I had read not to update to the latest one, but now I am not so sure.

@brianiup
Copy link

brianiup commented Aug 7, 2023

people who are having issues, are you on nVIDIA graphics driver 536.67 by any chance? I am and it looks like nVIDIA has a known issue about this

https://us.download.nvidia.com/Windows/536.67/536.67-win11-win10-release-notes.pdf

"This driver implements a fix for creative application stability issues seen during heavy
memory usage. We’ve observed some situations where this fix has resulted in
performance degradation when running Stable Diffusion and DaVinci Resolve. This will
be addressed in an upcoming driver release. [4172676]"

I reverted my drivers to 535.98 studio version and now I am seeing drastically better performance even with buckets enabled I am getting 1.19it/s with 1024x1024 resolution

@Thom293
Copy link

Thom293 commented Aug 7, 2023

I am having a similar issue, on 4090 SDXL lora training it is going about 1.82s/it and using all 24gb of ram even with a batch size of 1. This is the first SDXL training I have tried, and a new computer with 4090 I have not used for training before so I'm wondering if this speed an RAM use is normal for 4090?

settingsV2.txt

thank you for sharing this. Do you get images from this method? I get only grey or black at 30 epochs. I will try 250.

@MartinTremblay
Copy link

Training has become unusable on my 4090. No sure what is happening.

image

@MartinTremblay
Copy link

Every settings is the same, but with 1.5 model

image

@FurkanGozukara
Copy link
Contributor

FurkanGozukara commented Sep 21, 2023

I am testing batch size 4 right now - rank 32 lora

2.99 second / it - gpu is not free though i got some other things open

so speed up is not like linear or something

from like 1.20 second / it to 1.33 it / second - 60% speed up

@MMaster
Copy link

MMaster commented Sep 21, 2023

That really looks like 4090 can get faster than 3090 with higher batch size .. I'm getting 2.76s/it with BS6 on rank 128 lora (with xformers, bucketing, grad. checkpointing & unet only training).

@FurkanGozukara
Copy link
Contributor

FurkanGozukara commented Sep 21, 2023

That really looks like 4090 can get faster than 3090 with higher batch size .. I'm getting 2.76s/it with BS6 on rank 128 lora (with xformers, bucketing, grad. checkpointing & unet only training).

settings may make diff

here my current config attached

MonkaroAI.json.txt

@DarkAlchy
Copy link

In reality the 4090 vs the 3090 (not ti but close enough) the hardware spec shows we should be 57% faster. Not sure where this has gone but Nvidia drivers are a lot of it.

@FurkanGozukara
Copy link
Contributor

In reality the 4090 vs the 3090 (not ti but close enough) the hardware spec shows we should be 57% faster. Not sure where this has gone but Nvidia drivers are a lot of it.

true

NVidia still didnt fix

@MMaster
Copy link

MMaster commented Sep 22, 2023

I've tried many different settings and 4090 is really much faster than the numbers Furkan sent when higher batch numbers are used. Haven't tested his exact settings because I don't have data prepared to not use bucketing.
Closest I'm getting is 3.09s/it with batch size 7 (~0.44 s / step) or 2.82s/it at batch size 6 (0.47 s / step) if I'm using full text + unet training while 3090 gets 2.99s / it with batch size 4 (~0.74 s/step). That comes up to about 60% faster 4090 than 3090.

After everything I've tested I'm convinced this has nothing to do with drivers but the fact that CPU can't give the GPU enough work with lower batch sizes. Even the person that has 4090 and same CPU as Furkan had similar numbers as Furkan had because on 3090 Ti the CPU is already at the edge of what it can do in this process, because only single CPU thread is being used during training.

@FurkanGozukara
Copy link
Contributor

FurkanGozukara commented Sep 22, 2023

I've tried many different settings and 4090 is really much faster than the numbers Furkan sent when higher batch numbers are used. Haven't tested his exact settings because I don't have data prepared to not use bucketing. Closest I'm getting is 3.09s/it with batch size 7 (~0.44 s / step) or 2.82s/it at batch size 6 (0.47 s / step) if I'm using full text + unet training while 3090 gets 2.99s / it with batch size 4 (~0.74 s/step). That comes up to about 60% faster 4090 than 3090.

After everything I've tested I'm convinced this has nothing to do with drivers but the fact that CPU can't give the GPU enough work with lower batch sizes. Even the person that has 4090 and same CPU as Furkan had similar numbers as Furkan had because on 3090 Ti the CPU is already at the edge of what it can do in this process, because only single CPU thread is being used during training.

if what you were telling accurate RTX 4090 wouldn't work faster on RunPod - unix :)

I tested with single batch size on Runpod

1.5 it / second - xFormers on but Gradient checkpoint off full dreambooth training

same settings on runpod RTX 3090 getting 1 it / second - so looks pretty accurate

maybe i should make a video of this

here my tweet about this : https://twitter.com/GozukaraFurkan/status/1704457802997969119

@MMaster
Copy link

MMaster commented Sep 22, 2023

You can't compare windows and unix because windows actually uses 3D acceleration to render the desktop. That's why we are comparing both 3090 with 4090 on windows. As you wrote before even 3090 is faster in unix than in windows.

@FurkanGozukara
Copy link
Contributor

You can't compare windows and unix because windows actually uses 3D acceleration to render the desktop. That's why we are comparing both 3090 with 4090 on windows. As you wrote before even 3090 is faster in unix than in windows.

true

but this is i think due to Triton package mostly. we dont have it in Windows :/

Also difference of RTX 3090 speed on unix is not as significant as RTX 4090

@MMaster
Copy link

MMaster commented Sep 22, 2023

I'm working with Linux professionally for more than 20 years. I'm low level developer of high performance apps, I also wrote stuff for kernel. Performance monitoring and testing is my expertise. I can tell you that Linux can get more out of CPU than windows because windows is cluttered with background services that you have no control over.

It's just not comparable even more so if we are talking about CPU bottleneck. I'm not at all surprised that you can utilize more out of 4090 in Linux if the CPU is the bottleneck.

Everything that we've seen, all the comparisons and monitoring I've done and wrote before point to the fact that under Windows the 4090 is not utilized fully until you get to at least batch size 3 and the only thing that's fully utilized for those small batch sizes is that 100% CPU thread running the training python process. That was the only thing that was running at 100%. I've also confirmed this suspicion by comparing batch size 1 and batch size 2 which both achieved same speed in seconds per iteration.

If I would ever see the process go above 100% or if I hadn't seen it at 100% all the time during batch size 1 and batch size 2 training I wouldn't be so convinced, but this definitely points huge finger at the only thing that's limiting it.

However even with the knowledge I have I still know that there is a lot that I don't know so I can't say it for 100% sure. But if I was the dev of kohya_ss the CPU bottleneck would be definitely primary thing I would try to solve. Since as soon as that stops being the limiting factor (batch size 3+) the GPU starts performing as expected.

There is really nothing more I can add or test as everything just points to this single process using 100% CPU and the only thing I can do is to go into the code and try to figure out if there is any way to make it run in multiple processes so it can utilize multiple cores during training process, but that is a long way to first understand exactly how it works and then figuring out how it can be split effectively and I'm not sure if I will have time and energy to do that .. we will see.

@MMaster
Copy link

MMaster commented Sep 22, 2023

I've just tested with exactly the same settings as Furkan had for batch size 4 but with bucketing enabled (should be slower afaik) and the results are much better than I had before, probably because of AdamW8bit optimizer (I used Adafactor) and not using full bf16, other settings are similar to what I've used.

With these settings I'm getting 1.60s/it at batch size 4 which is almost twice the performance that Furkan reported on 3090. The GPU is fully utilized running at 99%.

But still as soon as I try to lower the batch size to 1 or 2 it behaves the same as before. Both have the exact same speed of about 1-1.1s/it and with batch size 1 the GPU is only 40% utilized while with batch size 2 its about 80% utilized max.

I'm also thinking to try to downclock my CPU to see if it lowers the speed of the training accordingly which would be another confirmation its CPU bound, but currently I'm connected only remotely so that will have to wait.

@FurkanGozukara
Copy link
Contributor

I've just tested with exactly the same settings as Furkan had for batch size 4 but with bucketing enabled (should be slower afaik) and the results are much better than I had before, probably because of AdamW8bit optimizer (I used Adafactor) and not using full bf16, other settings are similar to what I've used.

With these settings I'm getting 1.60s/it at batch size 4 which is almost twice the performance that Furkan reported on 3090. The GPU is fully utilized running at 99%.

But still as soon as I try to lower the batch size to 1 or 2 it behaves the same as before. Both have the exact same speed of about 1-1.1s/it and with batch size 1 the GPU is only 40% utilized while with batch size 2 its about 80% utilized max.

I'm also thinking to try to downclock my CPU to see if it lowers the speed of the training accordingly which would be another confirmation its CPU bound, but currently I'm connected only remotely so that will have to wait.

have you tested batch size 4 uses how many cores and batch size 1 uses how many cores?

I can't test atm

Also this doesnt explain batch size 1 difference that we have on unix

on unix RTX 4090 getting 1.5x of RTX 3090

but not on Windows when batch size is 1

@MMaster
Copy link

MMaster commented Sep 22, 2023

Just as before it never uses more than 1 core, batch size 1 & 2 use 100% of 1 core all the time, batch size 4 uses 70-90% of single CPU core.

As I tried to explain there is no point in comparing Windows and Linux they are different OSes with different CPU scheduling, different interrupt handling, different mitigations against CPU vulnerabilities like Heartbleed, Spectre & Downfall (that may even be completely turned off on those Linux machines which would mean the CPU can do maybe even 2x the work than with those mitigations present), etc etc.

Simple explanation of the difference there can be easily explained if it is really CPU bound:

  • 3090 is close to if not even hitting the CPU limit on windows, but on Linux the CPU is more efficient and also there is no 3D rendering of desktop - meaning you get some boost in performance from both.
  • 4090 is definitely hitting the CPU limit on windows and as I mentioned on optimized Linux the CPU performance can be even twice the performance than on Windows if all the vulnerability mitigations are turned off (which makes sense on machines that are used for GPU performance) it also has different task scheduler, different interrupt handling etc etc all of which can and will influence high performance applications.

In any case if it is able to push much more work to the GPU through that single CPU core then the performance will of course be much better on 4090 and the difference between performance gain between windows and linux on 3090 and 4090 can be easily explained by the fact that in Windows the 3090 is getting used closer to its limit than 4090.

@DarkAlchy
Copy link

DarkAlchy commented Sep 23, 2023

I disagree with most (most, but not all) of what I have just read. Where I am having an issue is where we are not comparing apples to apples. Grab a 3090. Grab a 4090. Same computer, and specs, VERY same machine (iow, runpod is a nono). Run a training session. Remove the 4090 and replace with a 3090. Run the exact same dataset and I bet a fiver to a doughnut you will see what is being glossed over.

I have to agree Python being limited to a single core is total bullshit in this day with 96, and even a 128, core CPU coming next year for the home consumer. Seriously, that global lock nonsense has to go.

@MMaster
Copy link

MMaster commented Sep 23, 2023

That is why I was comparing Windows 4090 to Windows 3090 and not mix in Linux. Explanation of possible differences between Windows vs Linux was just because Furkan kept bringing it up. As I expected it has derailed the discussion.

I don't have 3090 and just running runpod and saying this does X it/s and this does Y it/s doesn't tell us any reason why it happens without proper monitoring (real GPU utilization, CPU use per process per thread, etc).

I believe I have proper monitoring. Everything since the very first post I made here still points to CPU bottleneck. Even if we have different software, different systems if I can see that their 3090 can do ~1it/s at batch size 1 while doing almost 3s/it at batch size 4 while my 4090 can do ~1it/s at batch size 1 and 1.6s/it at batch size 4 (with same settings) it shows me that 4090 can really get utilized properly on Windows when CPU stops being the limiting factor.

Anyway from my side there is nothing more we can discover just testing the runs and comparing. Since there was absolutely 0 metrics pointing to why it wouldn't be CPU bottleneck and everything since the beginning of my testing points at the CPU even new results and tests just kept confirming it. The next steps are actually properly profiling the software to figure out where the time is spent or doing proof of concept implementation that separates the training process to multiple processes. Trust me Python GIL will not go anywhere.
Edit: This doesn't mean its not possible to write native library in C/C++ which can use really parallel multithreading and expose API that can be used from python which I believe is already being used for some parts of stable diffusion, but it looks like thats not the case for this part of sd training - maybe there are some technical reasons for it, I don't know.

Thank you everyone who actually tested their performance and shared their results. I don't know if devs of this repo are able to fix issue like this or if this is really just UI above original kohya_ss sd-scripts. Maybe the discussion is at completely wrong place for anyone relevant to be able to help.

@JAssertz
Copy link

if your issue isn't resolved still downgrade the driver version for your gpu to version 531 from reading patch notes in ~535 ish they broke training speed.

@DarkAlchy
Copy link

if your issue isn't resolved still downgrade the driver version for your gpu to version 531 from reading patch notes in ~535 ish they broke training speed.

Yes they did, and I mean by a ton. Try 200+ seconds per it and still climbing, but back on 531.xx back to the norm. Driver issues already proven just by that, then bottlenecking, then Kohya code and who knows what else.

@DarkAlchy
Copy link

@MMaster This is why I feel for Bmalt because 90% of the tickets, at least, is not the fault of the GUI. The slowness of this ticket is not his fault, or the gui's, as well. People don't understand that this is just the more human interface to the actual working program/scripts it does nothing more than take all of our input to throw it out on the command line for Kohya to run with. Prior to this GUI that was done all by hand. This is why I used the scripts directly in my test that showed the slowdowns do exist in the actual program. TBH, this is a nice discussion but isn't really the fault of the GUI and should not even be here, rather over on the sd-scripts side. I know it is confusing as all get out to me too.

@FurkanGozukara
Copy link
Contributor

@MMaster This is why I feel for Bmalt because 90% of the tickets, at least, is not the fault of the GUI. The slowness of this ticket is not his fault, or the gui's, as well. People don't understand that this is just the more human interface to the actual working program/scripts it does nothing more than take all of our input to throw it out on the command line for Kohya to run with. Prior to this GUI that was done all by hand. This is why I used the scripts directly in my test that showed the slowdowns do exist in the actual program. TBH, this is a nice discussion but isn't really the fault of the GUI and should not even be here, rather over on the sd-scripts side. I know it is confusing as all get out to me too.

true

all back end comes from scripts

gui just turns actions into command for scripts

@Wonderflex
Copy link

if your issue isn't resolved still downgrade the driver version for your gpu to version 531 from reading patch notes in ~535 ish they broke training speed.

Yes they did, and I mean by a ton. Try 200+ seconds per it and still climbing, but back on 531.xx back to the norm. Driver issues already proven just by that, then bottlenecking, then Kohya code and who knows what else.

I'm running a 4090 with a 7950x processor. Currently Im getting 1.13 it/s using the settings from the aitrepreneur SDXL LoRA training video: https://youtu.be/N_zhQSx2Q3c?si=pjGpVkzbzERdnfdS

Should rolling back the drivers give a boost?

@DarkAlchy
Copy link

if your issue isn't resolved still downgrade the driver version for your gpu to version 531 from reading patch notes in ~535 ish they broke training speed.

Yes they did, and I mean by a ton. Try 200+ seconds per it and still climbing, but back on 531.xx back to the norm. Driver issues already proven just by that, then bottlenecking, then Kohya code and who knows what else.

I'm running a 4090 with a 7950x processor. Currently Im getting 1.13 it/s using the settings from the aitrepreneur SDXL LoRA training video: https://youtu.be/N_zhQSx2Q3c?si=pjGpVkzbzERdnfdS

Should rolling back the drivers give a boost?

If on Windows the answer is a resounding yes. His settings are normal settings and for me, on Windows, an 1800 step Kohya training (I have switched to OneTrainer that doesn't use steps) takes about 12-16m. For 3k steps it is less than 25m. I can't give an accurate number now as I forget precisely, but roll back to 531.79 if on Windows and see for yourself. Worst case it takes 20 mins out of your life to roll back and test then if it doesn't just upgrade the drives back to where they were.

@Wonderflex
Copy link

I'm running a 4090 with a 7950x processor. Currently Im getting 1.13 it/s using the settings from the aitrepreneur SDXL LoRA training video: https://youtu.be/N_zhQSx2Q3c?si=pjGpVkzbzERdnfdS
Should rolling back the drivers give a boost?

If on Windows the answer is a resounding yes. His settings are normal settings and for me, on Windows, an 1800 step Kohya training (I have switched to OneTrainer that doesn't use steps) takes about 12-16m. For 3k steps it is less than 25m. I can't give an accurate number now as I forget precisely, but roll back to 531.79 if on Windows and see for yourself. Worst case it takes 20 mins out of your life to roll back and test then if it doesn't just upgrade the drives back to where they were.

Thanks for the tip. I rolled back to 531.79 and it moved me to 1.21s/it instead of 1.13. This put me at 53 minutes instead of 56 minutes for 2600 steps, which is meh. Now it makes me wonder if I have a problem with my Kohya configuration (although I swore I followed his settings), because less than 25 minutes for 3k steps is substantially better than what I'm seeing.

If it makes a difference I'm using these settings:

Train batch size = 1
Mixed precision = bf16
Number of CPU threads per core 2
Cache latents
LR scheduler = constant
Optimizer = Adafactor with scale_parameter=False relative_step=False warmup_init=False
Learning rate of 0.0003
LR warmup = 0
Enable buckets
Text encoder learning rate = 0.0003
Unet learning rate - 0.0003
No half VAE
Network rank = 256
Network alpha = 1

@etha302
Copy link
Author

etha302 commented Sep 26, 2023

if your issue isn't resolved still downgrade the driver version for your gpu to version 531 from reading patch notes in ~535 ish they broke training speed.

Yes they did, and I mean by a ton. Try 200+ seconds per it and still climbing, but back on 531.xx back to the norm. Driver issues already proven just by that, then bottlenecking, then Kohya code and who knows what else.

I'm running a 4090 with a 7950x processor. Currently Im getting 1.13 it/s using the settings from the aitrepreneur SDXL LoRA training video: https://youtu.be/N_zhQSx2Q3c?si=pjGpVkzbzERdnfdS
Should rolling back the drivers give a boost?

If on Windows the answer is a resounding yes. His settings are normal settings and for me, on Windows, an 1800 step Kohya training (I have switched to OneTrainer that doesn't use steps) takes about 12-16m. For 3k steps it is less than 25m. I can't give an accurate number now as I forget precisely, but roll back to 531.79 if on Windows and see for yourself. Worst case it takes 20 mins out of your life to roll back and test then if it doesn't just upgrade the drives back to where they were.

Onetrainer looks promising, definitely going to try it out when i get home. You Got good results? also care to share some settings you used?

@DarkAlchy
Copy link

if your issue isn't resolved still downgrade the driver version for your gpu to version 531 from reading patch notes in ~535 ish they broke training speed.

Yes they did, and I mean by a ton. Try 200+ seconds per it and still climbing, but back on 531.xx back to the norm. Driver issues already proven just by that, then bottlenecking, then Kohya code and who knows what else.

I'm running a 4090 with a 7950x processor. Currently Im getting 1.13 it/s using the settings from the aitrepreneur SDXL LoRA training video: https://youtu.be/N_zhQSx2Q3c?si=pjGpVkzbzERdnfdS
Should rolling back the drivers give a boost?

If on Windows the answer is a resounding yes. His settings are normal settings and for me, on Windows, an 1800 step Kohya training (I have switched to OneTrainer that doesn't use steps) takes about 12-16m. For 3k steps it is less than 25m. I can't give an accurate number now as I forget precisely, but roll back to 531.79 if on Windows and see for yourself. Worst case it takes 20 mins out of your life to roll back and test then if it doesn't just upgrade the drives back to where they were.

Onetrainer looks promising, definitely going to try it out when i get home. You Got good results? also care to share some settings you used?

It has settings in it., SDXL 1.0 LoRA is one of them. It is mainly set up for FT but is branching out. They are adding a lot and with the addition (coming in beta right now) of Adafactor that will be super sweet.

@DarkAlchy
Copy link

I'm running a 4090 with a 7950x processor. Currently Im getting 1.13 it/s using the settings from the aitrepreneur SDXL LoRA training video: https://youtu.be/N_zhQSx2Q3c?si=pjGpVkzbzERdnfdS
Should rolling back the drivers give a boost?

If on Windows the answer is a resounding yes. His settings are normal settings and for me, on Windows, an 1800 step Kohya training (I have switched to OneTrainer that doesn't use steps) takes about 12-16m. For 3k steps it is less than 25m. I can't give an accurate number now as I forget precisely, but roll back to 531.79 if on Windows and see for yourself. Worst case it takes 20 mins out of your life to roll back and test then if it doesn't just upgrade the drives back to where they were.

Thanks for the tip. I rolled back to 531.79 and it moved me to 1.21s/it instead of 1.13. This put me at 53 minutes instead of 56 minutes for 2600 steps, which is meh. Now it makes me wonder if I have a problem with my Kohya configuration (although I swore I followed his settings), because less than 25 minutes for 3k steps is substantially better than what I'm seeing.

If it makes a difference I'm using these settings:

Train batch size = 1 Mixed precision = bf16 Number of CPU threads per core 2 Cache latents LR scheduler = constant Optimizer = Adafactor with scale_parameter=False relative_step=False warmup_init=False Learning rate of 0.0003 LR warmup = 0 Enable buckets Text encoder learning rate = 0.0003 Unet learning rate - 0.0003 No half VAE Network rank = 256 Network alpha = 1

The higher the rank the worse it is and I have never done any LoRA/LoCON/or whatever that needed 256 ranks. I use 32, and NO, going after the fact to reduce the LoRA is not an option as one of my releases I tried that and no matter what I did it destroyed the LoRA as in it became nothing like I trained. I learned my lesson. The rest of your settings seem fine so you are the max your system, and drivers, will allow.

@etha302
Copy link
Author

etha302 commented Sep 26, 2023

if your issue isn't resolved still downgrade the driver version for your gpu to version 531 from reading patch notes in ~535 ish they broke training speed.

Yes they did, and I mean by a ton. Try 200+ seconds per it and still climbing, but back on 531.xx back to the norm. Driver issues already proven just by that, then bottlenecking, then Kohya code and who knows what else.

I'm running a 4090 with a 7950x processor. Currently Im getting 1.13 it/s using the settings from the aitrepreneur SDXL LoRA training video: https://youtu.be/N_zhQSx2Q3c?si=pjGpVkzbzERdnfdS
Should rolling back the drivers give a boost?

If on Windows the answer is a resounding yes. His settings are normal settings and for me, on Windows, an 1800 step Kohya training (I have switched to OneTrainer that doesn't use steps) takes about 12-16m. For 3k steps it is less than 25m. I can't give an accurate number now as I forget precisely, but roll back to 531.79 if on Windows and see for yourself. Worst case it takes 20 mins out of your life to roll back and test then if it doesn't just upgrade the drives back to where they were.

Onetrainer looks promising, definitely going to try it out when i get home. You Got good results? also care to share some settings you used?

It has settings in it., SDXL 1.0 LoRA is one of them. It is mainly set up for FT but is branching out. They are adding a lot and with the addition (coming in beta right now) of Adafactor that will be super sweet.

Yeah just trained my first lora haha, it clearly ovetrained but ok I wasn’t really sure what i was doing, but ui looks sweet. Also since you mentioned before, that we are blaming blmatais for no reason, you are definitely right. I am the author of this topic, so i can close it and open another on sd scripts?

@etha302 etha302 closed this as completed Sep 26, 2023
@etha302
Copy link
Author

etha302 commented Sep 26, 2023

Closed because we are blaming the wrong person for slow perfomance on rtx 4090, will open another one on sd scripts, sorry bmaltais

@DarkAlchy
Copy link

Thank you.

@etha302
Copy link
Author

etha302 commented Sep 26, 2023

kohya-ss/sd-scripts#834

Lets continue here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests