-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training extremely slow after updating #961
Comments
For reference, I'm training a test model on 12 images of a character at 125 steps each using batch size 2 and 512,512 resolution and get a training time of over 2 hours. Is this expected from a GPU like the 1080 with 8GB of VRAM? |
If you have installed Nvidia driver 535.98, rollback to previous version. There's a bug with it. |
I did see there's issues with the newer NVIDIA drivers, so I'm currently using the 527.56 version. |
Ok try to reduce batch to 1 then. 🤒 |
is this still an issue i reverted back as well and it worked fine but i updated from windows 10 to 11 and i think my drivers updated and i got a fully new kohya zip and re installed it and it works but it says like 300hours for 2500 steps lol. (before it took like 5-10m) |
Make sure to install CUDA 11.8 drivers |
hello, same issue here! I followed @bmaltais youtube guide for lycoris training TO THE LETTER, and it took 4 hours of training! instead of the 12-15 minutes it took him. First of all I have a 3080, I think I have everything set up (I'll put a screenshot of what the script detects when I launch it), Re-installed a few times and still slow. I thought it could be my dataset (images too large) so I resized them to 512x512 just in case.. and still 3 hours and a half. |
Try re-installing wioth torch 1... Everything look good so I have no idea why it would be so slow... How big is your dataset? How many repeats? Perhaps you are training a large dataset? |
@bmaltais I followed your video on youtube about lycoris (thank you for those videos btw, very very helpful for newbies like me), Same number of images, same steps, everything, I downloaded images from instagram so they are not very large, and I even resized them to 512x512 just in case. A friend of mine pointed to me that the last few Nvidia drivers were faulty, so I am gonna reroll them back to see if that helps |
Same problem. Has anyone fixed it? |
someone mentionned trying to install bitsandbytes-windows... might help... activate the venv ( Hope it help. |
I also have the same problem. After updating kohya, it will appear as headless: false. It can be normal started and trained normally, but training speed is extremely slow. Before update, a lora 5000 steps about an hour, after update showed need to 10 hours. Anyone can solve this problem? I use pytorch 2.0 |
I am not sure what is the cause of the slow speed for some user. I have not observed this on me system. Probably a combination of drivers version and hardware. I suggest you go back to the release you used to run until this is fixed in a future release.
The headless is normal, it is just reporting that it is NOT running in headless mode. |
I'm having the same experience - even tried rolling back to the last version that was working well for me. Still getting a speed of 24 seconds per item on a 3070, which is nearly 100x slower than before. Since rolling back the kohya_ss version didn't fix it either, the other thing that has changed is the nvidia driver. I'm currently on game ready driver v536.23, and still getting the slow speeds. Is there a known working driver version? |
So, I got a decent speed now. What I did was to install with pytorch 1 instead of 2 and use a lower number of total steps. Training for characters works fine now, although it takes at least one hour using DAdaptation to get good results. |
It appears to be a common issue. RTX 3060 in my case. Same general issue as other posts. Ridiculously long training. I'd love to here about any other suggestions. In my case I upgraded to torch 2 to try to improve speeds an in short "nope". Thanks |
This is kind of frustrating... One thing I could suggest is to go back to a previous release that used to work, delete the venv folder and redo setup. If things are still slow then it is related to a change outside kohya_ss |
Same issue on 4090 also |
Fixed for me, so hopefully this helps others.
I think it was environmental variables, where I still had traces of a previous 12.1 install. After following the above steps, everything works perfectly! |
Awesome. Ill try it. I have zero patience to watch the technological paint dry lol. Can you elaborate on the environmental variables you addressed? My experiments have angered something hahaha |
Sure, I can help you with that. Here are the steps on how to edit or delete environment variables in Windows:
|
I'm having this problem but in a weird way on Windows. I reinstalled twice now, and after that I get a speed of 2.x it/s. I shutdown the PC for the day, I come back next day, run a training (similar params), and it goes back to slow 1.5 s/it. Any ideas what Windows could be doing that makes the speed go down like that? |
Thanks man. Ill giver a go:) |
Update on my end. Evanheckert's suggestion including the removal of environmental variables fixed my issue. Thank you for the insights. |
Posting this here as it is somewhat related. Same case as OP, tried the solution by evanheckert but sadly it only worked half way:. I had no environmental variables left to delete after uninstalling cuda 12.1, but downgrading to 11.8 helped to get to normal speeds (10 it/s, 3080) for one training. |
Ok, I kept trying things out and I've discovered that at least sometimes, this might be a video memory issue. I have a 3070ti which has 8GB. So previously I wasn't able to run more than batch 2 at Adam8bit or I got a OOM error, because my limit was the physical 8GB. But lately with DadaptAdam, on my previous setup (I changed drives and formatted) I was running batch 3 and it seemed to work fine. So, while I was looking at the Task Manager just now, I noticed that the LoRA I'm currently training Edit: I kept testing this and seems to be that on my case, as soon as the Shared memory gets used the speed gets affected, the more is used, the slower it gets. But as I said, the behavior before was to give an error if the card memory wasn't enough, now is allocating from both places. Anything I can do to change that? |
I try this solution but its not working :( |
Was able to recreate this now. As soon as there is something in the shared GPU memory it crashes/doesnt continue with training. Im pretty sure I had like 1 or 2 GB in shared memory in earlier versions though... Edit: Using DDU to uninstall nvidia driver and going with 531.68 solved it for me. Even with shared memory it's fast again. |
I also have the issue during Lora training. I fixed it by :
After some training, the issue appear again... I repeat the same process, and the speed came back again. Edit : I noticed that the issue appear when I tried my Lora with InvokeAI |
Quick update for the failed loHa training on my side: After I found this #1291 , I switched to this specific kohya_ss version and in combination with the 531.68 driver, training works again. I tried other older versions and they didnt work, so what I got from this is that there are some bitsandbytes related problems apparently. |
I tried going back on card drivers and I can observe two changes. Memory allocation is way more efficient; I'm using less VRAM per-training. Shared memory seems to have less of an influence on the speed and speed seems to be going up during the training and with latest drivers it used to go down instead. |
tagging this. |
nVIDIA put out new windows drivers yesterday (8/8, version 536.99) tried them today and so far they are working MUCH better, however I 've noticed that the CUDA usage will stay 90-100% for a while then drop to like 60% for a while then go back up to the 90's and keep repeating that cycle, is there anything you can do to make sure it's using the cores as efficiently as possible? |
A lot of variables here so this is an itemization of my journey to get from 12s/it, to 5.66s/it -- TL;DR is that Windows 11, Nvidia 3090 RTX, GeForce Game Ready Driver 535.98
|
Cheers Mate! Hi, this is I from the future, mid august 23, wanted to say this is still saving lives and GPU time. |
Why are people still happy with the speeds they are getting? Are you getting 35-57% faster speeds than a 3090? If not then you need to be very angry as I am. |
@evanheckert Thumbs down me all you like, but I own a 4090, and I have done the testing and the truth hurts. Thumbs down the messenger for telling you straight up facts, else it shows you are biased on a good day. What is your major malfunction with what I said? You have a 4090 too and are so upset you just had to thumbs down the lone voice in the Jensen woods? smh. |
Because im back to 1sec/it... (im on 3080ti) |
I hear there is a specific issue with 4090's currently. It must be frustrating.. but it will be sorted, i have no doubt. |
I missed the context from your previous post. It sounded like someone just dissing on free open source software without trying to help identify and resolve the problem, which always grinds my gears. I'll retract my thumbs down, now that I see you've shared a lot of information about your issue. |
Thank you, as I love OSS, and especially OSH, but as my experiment shows something isn't right. We do know Nvidia has some issues to be addressed in a future update concerning memory management they now do. This slow issue is ADA across the board from the 6000 ADA to 4090 and 4080. The other cards I am unsure about. What I showed is something changed with the Kohya setup when XL version was released. I am unsure what it could be though as I am just showing that something is indeed not right. When you pay 1200, 1700, 5-6k for an ADA card, and they are all about the speed of, or slower than, a previous gen something is seriously wrong. My hope is it works itself out as 1700 dollars to get slower than my friend's 3090 really is bad. |
As I said I hope it gets sorted out, but I was talking about the ADA card holders seemingly fine with being slower than last gen not last gen cards. My friend never saw a slow down on his 3090 with the same 2.1 size. If you increase to XL size then I expect it to be slower so that isn't what I mean. Are you saying that same size training suddenly your 3080ti became slower? If that is the case then, yeah, the steps mentioned might speed you up. Now if you mean you went from 1.5/2.1 size to XL size and became slower just remember 1.5 to XL size is four times the amount of info being handled. |
To be fair, @oranwav, I do get 2.4 it/sec on my 3070, that's with 512x512 with a batch size of 2, FWIW, so not sure it's an apples-to-apples rate comparison. Edit: Running one now: |
I hadn't used kohya_ss in a couple of months.
I did a fresh install using the latest version, tried with both pytorch 1 and 2 and did the acceleration optimizations from the setup.bat script. I now find that training loras with the same parameters and datasets I previously used has slowed down to a crawl, I would say it's a lot slower, where something that took 30 minutes previously now takes over 2 hours. I don't get any error messages (other than the missing 'triton' package if I use pytorch 2) so I'm quite lost as to what is going wrong in this situation.
Where could I have gone wrong? Should I manually install a new CUDA version? I have a GTX 1080, so I'm not using cuDNN.
The text was updated successfully, but these errors were encountered: