-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SDXL Lora training Extremely slow on Rtx 4090 #1288
Comments
a little faster with these optimizer args: scale_parameter=False relative_step=False warmup_init=False |
same problem with me with a 4080 |
Here's a configuration file that's been working well for me on a 4090. Outputs a ~5 MB LoRA that works with A1111 and ComfyUI. The parameters to tweak are:
What I like to do is start up Tensorboard and do a couple of test runs with different train_batch_size/gradient_accumulation_step combinations and see what settings train the fastest with the size of my dataset. This Rentry article was used as reference: https://rentry.co/ProdiAgy As far as I can tell, IA^3 training does not work for SDXL yet, and that is anecdotally much faster than training other LoRA. As benchmark, I can train a LoKR LoRA with dataset size of 20 for 250 epochs (1 repeat / epoch) in about 1.5 hours.
|
thanks for sharing the workflow, will def. give it a try. i am only a little concerned on network size, not sure if its good for training realistic. also 1.5 hours, is still allot for training a lora. And this has to get better overtime, imagine people with 3070, 3080, 4060 and so on, they will be training one lora whole day. After 4090 you are entering enterprise class, and then we are talking 5000-10000 usd for one card. And i believe most people including me cannot afford that, however lora with settings i used above, although it took hours and hours to train, it gave out very decent results, especially for the first try. |
SDXL training is also quite slow on my 3090... but not that slow. Not sure why it is this slow for you. |
no idea. i tried everything. i trained another lora, with 6000 steps. and it took 13 hours to complete, about 7 seconds per step. results arent bad at all, but why is it so slow i just dont know. also it uses 24 gb of vram, no matter what i change. and this starts with caching latents, first it goes fast and after a few seconds when vram usage goes to 24 gb it gets super slow. then it stays like that through the whole training. |
1.4 second / it for me rtx 3090 ti - batch size 1 gradient 1 on rtx 3060 it is about 2.8 second / it but on rtx 3060 i train 32 network rank , with rtx 3090 ti 256 network rank |
I don't even get past the "caching latents" Stage. My PC just completely freezes when the 24 GB are full. |
interesting, so it is even worse for you. this is why i didnt even attempt to use reg images yet, because this is probably what might happen. what i tried yesterday: reinstall xformers, cuda, new cudnn .dlls, different nvidia drivers and nothing. Whether something is really wrong with the script, but i dont think it is because not everyone has the same problem, what i will try today: reinstall windows, and install everything from scratch, and i will report back if this solves the problem. |
I did a full test today RTX 3060 is 2.4 second / it : https://twitter.com/GozukaraFurkan/status/1686296023751094273 RTX 3090 TI is 1.23 second / it : https://twitter.com/GozukaraFurkan/status/1686305740401541121 |
so after reinstalling windows and kohya speed is on 4090 is exactly the same, and that is around 1.9s-2s/it. so much slower than 3090 and barely any faster than 3060. |
it is clearly 4090 specific problem at this point, i was talking to many people and most have the same problem. different drivers dont help, new cudnn .dll files also not. Chosing adamw 8 bit optimizer uses only 12gb of vram and speed is at around 1.3s/it, so still very slow for a 4090 and results probably arent going to be good for SDXL. at this point, i really dont know what else to do. |
Does the training picture have to be 1024x1024? |
Don't have to be square... But the total number of pixels in the image should be equal or greater than 1024 x 1024. |
I used this method and it worked. I trained my first LORA. Thank you very much. Running a second one now. However, on a 4090 it is extremely slow. 20 images, 300 epocs, 10 repeats has taken most of a day. Id love some advice on how to make it quicker. This seems really slow. I havent trained a lora before but traning a TI I could do 50+ images on a mobiel 3080 16gb in a few hours. Im around 250 and this is what the speed shows: |
It’s even slower for you than me. I get around 2s/it which is still ridiculously slow. Everything points to nvidia drivers at this point, testing today on linux. Will report back later |
I am having a similar issue, on 4090 SDXL lora training it is going about 1.82s/it and using all 24gb of ram even with a batch size of 1. This is the first SDXL training I have tried, and a new computer with 4090 I have not used for training before so I'm wondering if this speed an RAM use is normal for 4090? |
I'm seeing this also with my 4090, all 24GB of VRAM get used up CUDA usage is at 99% and it's training at 10.2s/it for a 1024x1024 res
Even when it generates sample prompt images it's way slower then in automatic1111, height: 1024 28%|██████████████████████▌ | 11/40 [00:18<00:49, 1.70s/it] in A1111 i get 10 it/s with that res and eulera at that res |
wow this is terrible if you are my patreon supporter i would like to connect your pc and try to help |
At work so I can't grab any specifics right now - one thing to check is shared GPU memory usage. I haven't had tons of time to experiment yet but, in my case, using regularization images from unsplash that I changed to have a max length/height of 2048 pushed the GPU to use shared VRAM during latent caching. Once shared GPU memory is in use, performance will suffer greatly. Latent caching is speedy for maybe 80-100 steps, then it's like the horrible times y'all are sharing (e.g., 10 s/it). Changing to 1024x1024 images I generated myself from SDXL gets latent caching to a decent 5-8 it/s, and training with whatever my settings are was yielding about 1.2 it/s during training of a standard LoRA. My numbers might be off a little bit - I can test more later - but check shared GPU memory and make it's not being used. Open Task Manager and check the Performance tab. |
I turned bucketing off and went to 1024x1024 fixed size images and now I am getting speeds of 1.22s/it, it seems to have something to do with buckets and adafactor optimizer for me, still trying different combos, but bucketing off has a drastic improvement |
nice info i also saw someone were getting errors due to bucketing system bug |
Leaving buckets enabled and unchecking I tested with 10 source images, all at least 1600x1600, and the original resolution Unsplash pics (smallest of which is 1155x1732) for regularization. No shared memory usage (though it was close) during caching, with 4-7 it/s or so. Training steps actually ran the fastest I've seen yet at a very steady 1.53-1.54 it/s. Not a permanent solution but it's working and I haven't seen anything better yet so ... YMMV
|
yes, I've noticed that a few times, i've seen total VRAM usage for me get into the 50GB range with my 24GB 4090, when that happens iterations/s get around 50seconds/it |
So confusing. Now it wont use all of my ram. It stops at 14gb. Anyone know a setting that would fix that? The other config I tried used all my ram and got 1.2/its, but it produced black images, and didnt work. What driver version are yall using? I had read not to update to the latest one, but now I am not so sure. |
people who are having issues, are you on nVIDIA graphics driver 536.67 by any chance? I am and it looks like nVIDIA has a known issue about this https://us.download.nvidia.com/Windows/536.67/536.67-win11-win10-release-notes.pdf "This driver implements a fix for creative application stability issues seen during heavy I reverted my drivers to 535.98 studio version and now I am seeing drastically better performance even with buckets enabled I am getting 1.19it/s with 1024x1024 resolution |
thank you for sharing this. Do you get images from this method? I get only grey or black at 30 epochs. I will try 250. |
I am testing batch size 4 right now - rank 32 lora 2.99 second / it - gpu is not free though i got some other things open so speed up is not like linear or something from like 1.20 second / it to 1.33 it / second - 60% speed up |
That really looks like 4090 can get faster than 3090 with higher batch size .. I'm getting 2.76s/it with BS6 on rank 128 lora (with xformers, bucketing, grad. checkpointing & unet only training). |
settings may make diff here my current config attached |
In reality the 4090 vs the 3090 (not ti but close enough) the hardware spec shows we should be 57% faster. Not sure where this has gone but Nvidia drivers are a lot of it. |
true NVidia still didnt fix |
I've tried many different settings and 4090 is really much faster than the numbers Furkan sent when higher batch numbers are used. Haven't tested his exact settings because I don't have data prepared to not use bucketing. After everything I've tested I'm convinced this has nothing to do with drivers but the fact that CPU can't give the GPU enough work with lower batch sizes. Even the person that has 4090 and same CPU as Furkan had similar numbers as Furkan had because on 3090 Ti the CPU is already at the edge of what it can do in this process, because only single CPU thread is being used during training. |
if what you were telling accurate RTX 4090 wouldn't work faster on RunPod - unix :) I tested with single batch size on Runpod 1.5 it / second - xFormers on but Gradient checkpoint off full dreambooth training same settings on runpod RTX 3090 getting 1 it / second - so looks pretty accurate maybe i should make a video of this here my tweet about this : https://twitter.com/GozukaraFurkan/status/1704457802997969119 |
You can't compare windows and unix because windows actually uses 3D acceleration to render the desktop. That's why we are comparing both 3090 with 4090 on windows. As you wrote before even 3090 is faster in unix than in windows. |
true but this is i think due to Triton package mostly. we dont have it in Windows :/ Also difference of RTX 3090 speed on unix is not as significant as RTX 4090 |
I'm working with Linux professionally for more than 20 years. I'm low level developer of high performance apps, I also wrote stuff for kernel. Performance monitoring and testing is my expertise. I can tell you that Linux can get more out of CPU than windows because windows is cluttered with background services that you have no control over. It's just not comparable even more so if we are talking about CPU bottleneck. I'm not at all surprised that you can utilize more out of 4090 in Linux if the CPU is the bottleneck. Everything that we've seen, all the comparisons and monitoring I've done and wrote before point to the fact that under Windows the 4090 is not utilized fully until you get to at least batch size 3 and the only thing that's fully utilized for those small batch sizes is that 100% CPU thread running the training python process. That was the only thing that was running at 100%. I've also confirmed this suspicion by comparing batch size 1 and batch size 2 which both achieved same speed in seconds per iteration. If I would ever see the process go above 100% or if I hadn't seen it at 100% all the time during batch size 1 and batch size 2 training I wouldn't be so convinced, but this definitely points huge finger at the only thing that's limiting it. However even with the knowledge I have I still know that there is a lot that I don't know so I can't say it for 100% sure. But if I was the dev of kohya_ss the CPU bottleneck would be definitely primary thing I would try to solve. Since as soon as that stops being the limiting factor (batch size 3+) the GPU starts performing as expected. There is really nothing more I can add or test as everything just points to this single process using 100% CPU and the only thing I can do is to go into the code and try to figure out if there is any way to make it run in multiple processes so it can utilize multiple cores during training process, but that is a long way to first understand exactly how it works and then figuring out how it can be split effectively and I'm not sure if I will have time and energy to do that .. we will see. |
I've just tested with exactly the same settings as Furkan had for batch size 4 but with bucketing enabled (should be slower afaik) and the results are much better than I had before, probably because of AdamW8bit optimizer (I used Adafactor) and not using full bf16, other settings are similar to what I've used. With these settings I'm getting 1.60s/it at batch size 4 which is almost twice the performance that Furkan reported on 3090. The GPU is fully utilized running at 99%. But still as soon as I try to lower the batch size to 1 or 2 it behaves the same as before. Both have the exact same speed of about 1-1.1s/it and with batch size 1 the GPU is only 40% utilized while with batch size 2 its about 80% utilized max. I'm also thinking to try to downclock my CPU to see if it lowers the speed of the training accordingly which would be another confirmation its CPU bound, but currently I'm connected only remotely so that will have to wait. |
have you tested batch size 4 uses how many cores and batch size 1 uses how many cores? I can't test atm Also this doesnt explain batch size 1 difference that we have on unix on unix RTX 4090 getting 1.5x of RTX 3090 but not on Windows when batch size is 1 |
Just as before it never uses more than 1 core, batch size 1 & 2 use 100% of 1 core all the time, batch size 4 uses 70-90% of single CPU core. As I tried to explain there is no point in comparing Windows and Linux they are different OSes with different CPU scheduling, different interrupt handling, different mitigations against CPU vulnerabilities like Heartbleed, Spectre & Downfall (that may even be completely turned off on those Linux machines which would mean the CPU can do maybe even 2x the work than with those mitigations present), etc etc. Simple explanation of the difference there can be easily explained if it is really CPU bound:
In any case if it is able to push much more work to the GPU through that single CPU core then the performance will of course be much better on 4090 and the difference between performance gain between windows and linux on 3090 and 4090 can be easily explained by the fact that in Windows the 3090 is getting used closer to its limit than 4090. |
I disagree with most (most, but not all) of what I have just read. Where I am having an issue is where we are not comparing apples to apples. Grab a 3090. Grab a 4090. Same computer, and specs, VERY same machine (iow, runpod is a nono). Run a training session. Remove the 4090 and replace with a 3090. Run the exact same dataset and I bet a fiver to a doughnut you will see what is being glossed over. I have to agree Python being limited to a single core is total bullshit in this day with 96, and even a 128, core CPU coming next year for the home consumer. Seriously, that global lock nonsense has to go. |
That is why I was comparing Windows 4090 to Windows 3090 and not mix in Linux. Explanation of possible differences between Windows vs Linux was just because Furkan kept bringing it up. As I expected it has derailed the discussion. I don't have 3090 and just running runpod and saying this does X it/s and this does Y it/s doesn't tell us any reason why it happens without proper monitoring (real GPU utilization, CPU use per process per thread, etc). I believe I have proper monitoring. Everything since the very first post I made here still points to CPU bottleneck. Even if we have different software, different systems if I can see that their 3090 can do ~1it/s at batch size 1 while doing almost 3s/it at batch size 4 while my 4090 can do ~1it/s at batch size 1 and 1.6s/it at batch size 4 (with same settings) it shows me that 4090 can really get utilized properly on Windows when CPU stops being the limiting factor. Anyway from my side there is nothing more we can discover just testing the runs and comparing. Since there was absolutely 0 metrics pointing to why it wouldn't be CPU bottleneck and everything since the beginning of my testing points at the CPU even new results and tests just kept confirming it. The next steps are actually properly profiling the software to figure out where the time is spent or doing proof of concept implementation that separates the training process to multiple processes. Trust me Python GIL will not go anywhere. Thank you everyone who actually tested their performance and shared their results. I don't know if devs of this repo are able to fix issue like this or if this is really just UI above original kohya_ss sd-scripts. Maybe the discussion is at completely wrong place for anyone relevant to be able to help. |
if your issue isn't resolved still downgrade the driver version for your gpu to version 531 from reading patch notes in ~535 ish they broke training speed. |
Yes they did, and I mean by a ton. Try 200+ seconds per it and still climbing, but back on 531.xx back to the norm. Driver issues already proven just by that, then bottlenecking, then Kohya code and who knows what else. |
@MMaster This is why I feel for Bmalt because 90% of the tickets, at least, is not the fault of the GUI. The slowness of this ticket is not his fault, or the gui's, as well. People don't understand that this is just the more human interface to the actual working program/scripts it does nothing more than take all of our input to throw it out on the command line for Kohya to run with. Prior to this GUI that was done all by hand. This is why I used the scripts directly in my test that showed the slowdowns do exist in the actual program. TBH, this is a nice discussion but isn't really the fault of the GUI and should not even be here, rather over on the sd-scripts side. I know it is confusing as all get out to me too. |
true all back end comes from scripts gui just turns actions into command for scripts |
I'm running a 4090 with a 7950x processor. Currently Im getting 1.13 it/s using the settings from the aitrepreneur SDXL LoRA training video: https://youtu.be/N_zhQSx2Q3c?si=pjGpVkzbzERdnfdS Should rolling back the drivers give a boost? |
If on Windows the answer is a resounding yes. His settings are normal settings and for me, on Windows, an 1800 step Kohya training (I have switched to OneTrainer that doesn't use steps) takes about 12-16m. For 3k steps it is less than 25m. I can't give an accurate number now as I forget precisely, but roll back to 531.79 if on Windows and see for yourself. Worst case it takes 20 mins out of your life to roll back and test then if it doesn't just upgrade the drives back to where they were. |
Thanks for the tip. I rolled back to 531.79 and it moved me to 1.21s/it instead of 1.13. This put me at 53 minutes instead of 56 minutes for 2600 steps, which is meh. Now it makes me wonder if I have a problem with my Kohya configuration (although I swore I followed his settings), because less than 25 minutes for 3k steps is substantially better than what I'm seeing. If it makes a difference I'm using these settings: Train batch size = 1 |
Onetrainer looks promising, definitely going to try it out when i get home. You Got good results? also care to share some settings you used? |
It has settings in it., SDXL 1.0 LoRA is one of them. It is mainly set up for FT but is branching out. They are adding a lot and with the addition (coming in beta right now) of Adafactor that will be super sweet. |
The higher the rank the worse it is and I have never done any LoRA/LoCON/or whatever that needed 256 ranks. I use 32, and NO, going after the fact to reduce the LoRA is not an option as one of my releases I tried that and no matter what I did it destroyed the LoRA as in it became nothing like I trained. I learned my lesson. The rest of your settings seem fine so you are the max your system, and drivers, will allow. |
Yeah just trained my first lora haha, it clearly ovetrained but ok I wasn’t really sure what i was doing, but ui looks sweet. Also since you mentioned before, that we are blaming blmatais for no reason, you are definitely right. I am the author of this topic, so i can close it and open another on sd scripts? |
Closed because we are blaming the wrong person for slow perfomance on rtx 4090, will open another one on sd scripts, sorry bmaltais |
Thank you. |
Lets continue here |
As the title says, training lora for sdxl on 4090 is painfully slow. It needs at least 15-20 seconds to complete 1 single step, so it is impossible to train. i dont know whether i am doing something wrong, but here are screenshot of my settings.
Also it is using full 24gb of ram, but it is so slow that even gpu fans are not spinning.
The text was updated successfully, but these errors were encountered: