Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training extremely slow after updating #961

Closed
pulth opened this issue Jun 10, 2023 · 46 comments
Closed

Training extremely slow after updating #961

pulth opened this issue Jun 10, 2023 · 46 comments

Comments

@pulth
Copy link

pulth commented Jun 10, 2023

I hadn't used kohya_ss in a couple of months.

I did a fresh install using the latest version, tried with both pytorch 1 and 2 and did the acceleration optimizations from the setup.bat script. I now find that training loras with the same parameters and datasets I previously used has slowed down to a crawl, I would say it's a lot slower, where something that took 30 minutes previously now takes over 2 hours. I don't get any error messages (other than the missing 'triton' package if I use pytorch 2) so I'm quite lost as to what is going wrong in this situation.

Where could I have gone wrong? Should I manually install a new CUDA version? I have a GTX 1080, so I'm not using cuDNN.

@pulth
Copy link
Author

pulth commented Jun 10, 2023

For reference, I'm training a test model on 12 images of a character at 125 steps each using batch size 2 and 512,512 resolution and get a training time of over 2 hours. Is this expected from a GPU like the 1080 with 8GB of VRAM?

@sumire608
Copy link

If you have installed Nvidia driver 535.98, rollback to previous version. There's a bug with it.

@pulth
Copy link
Author

pulth commented Jun 10, 2023

I did see there's issues with the newer NVIDIA drivers, so I'm currently using the 527.56 version.

@sumire608
Copy link

Ok try to reduce batch to 1 then. 🤒

@Disorbs
Copy link

Disorbs commented Jun 14, 2023

is this still an issue i reverted back as well and it worked fine but i updated from windows 10 to 11 and i think my drivers updated and i got a fully new kohya zip and re installed it and it works but it says like 300hours for 2500 steps lol. (before it took like 5-10m)

@bmaltais
Copy link
Owner

Make sure to install CUDA 11.8 drivers

@rehvka
Copy link

rehvka commented Jun 15, 2023

hello, same issue here! I followed @bmaltais youtube guide for lycoris training TO THE LETTER, and it took 4 hours of training! instead of the 12-15 minutes it took him. First of all I have a 3080, I think I have everything set up (I'll put a screenshot of what the script detects when I launch it), Re-installed a few times and still slow. I thought it could be my dataset (images too large) so I resized them to 512x512 just in case.. and still 3 hours and a half.
I am very sorry to bother with such a questions, I am just very ignorant when it comes to codding. Is there a way to rule out problems?, thanks!
image

@bmaltais
Copy link
Owner

Try re-installing wioth torch 1... Everything look good so I have no idea why it would be so slow... How big is your dataset? How many repeats? Perhaps you are training a large dataset?

@rehvka
Copy link

rehvka commented Jun 16, 2023

@bmaltais I followed your video on youtube about lycoris (thank you for those videos btw, very very helpful for newbies like me), Same number of images, same steps, everything, I downloaded images from instagram so they are not very large, and I even resized them to 512x512 just in case. A friend of mine pointed to me that the last few Nvidia drivers were faulty, so I am gonna reroll them back to see if that helps

@Kr01iKs
Copy link

Kr01iKs commented Jun 19, 2023

Same problem. Has anyone fixed it?

@bmaltais
Copy link
Owner

someone mentionned trying to install bitsandbytes-windows... might help...

activate the venv (.\venv\scripts\activate.bat)
pip install bitsandbytes-windows

Hope it help.

@wenwenwennnnn
Copy link

hello, same issue here! I followed @bmaltais youtube guide for lycoris training TO THE LETTER, and it took 4 hours of training! instead of the 12-15 minutes it took him. First of all I have a 3080, I think I have everything set up (I'll put a screenshot of what the script detects when I launch it), Re-installed a few times and still slow. I thought it could be my dataset (images too large) so I resized them to 512x512 just in case.. and still 3 hours and a half.
很抱歉打扰到这样的问题,我只是在codding方面非常无知。有办法排除问题吗?,谢谢!
图像

I also have the same problem. After updating kohya, it will appear as headless: false. It can be normal started and trained normally, but training speed is extremely slow. Before update, a lora 5000 steps about an hour, after update showed need to 10 hours. Anyone can solve this problem? I use pytorch 2.0

@bmaltais
Copy link
Owner

bmaltais commented Jun 20, 2023

I am not sure what is the cause of the slow speed for some user. I have not observed this on me system. Probably a combination of drivers version and hardware. I suggest you go back to the release you used to run until this is fixed in a future release.

git checkout <release name like v20.5.2>

The headless is normal, it is just reporting that it is NOT running in headless mode.

@evanheckert
Copy link

I'm having the same experience - even tried rolling back to the last version that was working well for me. Still getting a speed of 24 seconds per item on a 3070, which is nearly 100x slower than before.

Since rolling back the kohya_ss version didn't fix it either, the other thing that has changed is the nvidia driver. I'm currently on game ready driver v536.23, and still getting the slow speeds. Is there a known working driver version?

@pulth
Copy link
Author

pulth commented Jun 29, 2023

So, I got a decent speed now. What I did was to install with pytorch 1 instead of 2 and use a lower number of total steps. Training for characters works fine now, although it takes at least one hour using DAdaptation to get good results.

@Warleroux
Copy link

It appears to be a common issue. RTX 3060 in my case. Same general issue as other posts. Ridiculously long training. I'd love to here about any other suggestions. In my case I upgraded to torch 2 to try to improve speeds an in short "nope". Thanks

@bmaltais
Copy link
Owner

This is kind of frustrating... One thing I could suggest is to go back to a previous release that used to work, delete the venv folder and redo setup. If things are still slow then it is related to a change outside kohya_ss

@E2GO
Copy link

E2GO commented Jul 20, 2023

Same issue on 4090 also
UP: removing venv, installing with none (not fp16 or bf16), installing by advice of @bmaltais posted above:
activate the venv (.\venv\scripts\activate.bat)
pip install bitsandbytes-windows
After that speed normalized. Don't know what exactly causes it but after a couple reinstall looks like it is precisions type. First I choose fp16 and later bf16... Put can't prove it for sure.

@evanheckert
Copy link

evanheckert commented Jul 20, 2023

Fixed for me, so hopefully this helps others.
(Note: Running Windows 11, RTX 3070)

  1. Uninstall Cuda toolkit and display drivers
  2. Remove environment variables related to cuda
  3. Make sure all Cuda related folders are deleted in program files
  4. Download Cuda toolkit (I chose 11.8) and install, including display drivers that come with the installer
  5. Delete venv folder and the pycache folder
  6. Run setup, using torch 2.

I think it was environmental variables, where I still had traces of a previous 12.1 install. After following the above steps, everything works perfectly!

@Warleroux
Copy link

Awesome. Ill try it. I have zero patience to watch the technological paint dry lol. Can you elaborate on the environmental variables you addressed? My experiments have angered something hahaha

@evanheckert
Copy link

Sure, I can help you with that. Here are the steps on how to edit or delete environment variables in Windows:

  1. Open the Control Panel.
  2. Go to System and Security > System.
  3. Click the Advanced system settings link.
  4. In the System Properties window, click the Environment Variables button.
    • In the Environment Variables window, there are two sections: **User variables for and System variables.
    • User variables are for the current user only.
  5. System variables are for all users on the computer.
  6. To edit a variable, select the variable and click Edit.
    • In the Edit window, you can change the name and value of the variable.
  7. To delete a variable, select the variable and click Delete.
  8. Click OK to save your changes.

@mbastias
Copy link

I'm having this problem but in a weird way on Windows. I reinstalled twice now, and after that I get a speed of 2.x it/s. I shutdown the PC for the day, I come back next day, run a training (similar params), and it goes back to slow 1.5 s/it. Any ideas what Windows could be doing that makes the speed go down like that?

@Warleroux
Copy link

Thanks man. Ill giver a go:)

@Warleroux
Copy link

Update on my end. Evanheckert's suggestion including the removal of environmental variables fixed my issue. Thank you for the insights.

@bmaltais bmaltais pinned this issue Jul 23, 2023
@tobi-sp
Copy link

tobi-sp commented Jul 24, 2023

Posting this here as it is somewhat related. Same case as OP, tried the solution by evanheckert but sadly it only worked half way:. I had no environmental variables left to delete after uninstalling cuda 12.1, but downgrading to 11.8 helped to get to normal speeds (10 it/s, 3080) for one training.
When starting another (LoHa) training, speeds are very slow (80s/it) and I need to restart PC (gui only doesnt help). So I guess this also points to something outside the gui being the problem?

@mbastias
Copy link

mbastias commented Jul 26, 2023

Ok, I kept trying things out and I've discovered that at least sometimes, this might be a video memory issue. I have a 3070ti which has 8GB. So previously I wasn't able to run more than batch 2 at Adam8bit or I got a OOM error, because my limit was the physical 8GB. But lately with DadaptAdam, on my previous setup (I changed drives and formatted) I was running batch 3 and it seemed to work fine. So, while I was looking at the Task Manager just now, I noticed that the LoRA I'm currently training
was using Shared GPU memory, so RAM, and it was obviously bottlenecking the speed and even halting it. The solution on this case was lowering the batch size from 3 to 2. I don't know if that could be tobi-sp second run case too. But why is it going there to start with, is accelerator mixing the memories up?

Edit: I kept testing this and seems to be that on my case, as soon as the Shared memory gets used the speed gets affected, the more is used, the slower it gets. But as I said, the behavior before was to give an error if the card memory wasn't enough, now is allocating from both places. Anything I can do to change that?

@AIrtistry
Copy link

Fixed for me, so hopefully this helps others. (Note: Running Windows 11, RTX 3070)

  1. Uninstall Cuda toolkit and display drivers
  2. Remove environment variables related to cuda
  3. Make sure all Cuda related folders are deleted in program files
  4. Download Cuda toolkit (I chose 11.8) and install, including display drivers that come with the installer
  5. Delete venv folder and the pycache folder
  6. Run setup, using torch 2.

I think it was environmental variables, where I still had traces of a previous 12.1 install. After following the above steps, everything works perfectly!

I try this solution but its not working :(

@tobi-sp
Copy link

tobi-sp commented Aug 1, 2023

Ok, I kept trying things out and I've discovered that at least sometimes, this might be a video memory issue. I have a 3070ti which has 8GB. So previously I wasn't able to run more than batch 2 at Adam8bit or I got a OOM error, because my limit was the physical 8GB. But lately with DadaptAdam, on my previous setup (I changed drives and formatted) I was running batch 3 and it seemed to work fine. So, while I was looking at the Task Manager just now, I noticed that the LoRA I'm currently training was using Shared GPU memory, so RAM, and it was obviously bottlenecking the speed and even halting it. The solution on this case was lowering the batch size from 3 to 2. I don't know if that could be tobi-sp second run case too. But why is it going there to start with, is accelerator mixing the memories up?

Edit: I kept testing this and seems to be that on my case, as soon as the Shared memory gets used the speed gets affected, the more is used, the slower it gets. But as I said, the behavior before was to give an error if the card memory wasn't enough, now is allocating from both places. Anything I can do to change that?

Was able to recreate this now. As soon as there is something in the shared GPU memory it crashes/doesnt continue with training. Im pretty sure I had like 1 or 2 GB in shared memory in earlier versions though...

Edit: Using DDU to uninstall nvidia driver and going with 531.68 solved it for me. Even with shared memory it's fast again.
Edit2: But still, my trained LoHa has no effect. Previously trained lohas are still working...

@Zorgonaute84
Copy link

Zorgonaute84 commented Aug 6, 2023

I also have the issue during Lora training. I fixed it by :

  • Restarting the PC
  • Try again -> Same issue
  • Launch a textual inversion -> Same issue
  • Launch a textual inversion (without bucket and 1epoch only) -> fixed, it train speed again
  • Stop Textual Inversion Training
  • Launch the Lora training -> Perfect, speed is back

After some training, the issue appear again... I repeat the same process, and the speed came back again.

Edit : I noticed that the issue appear when I tried my Lora with InvokeAI

@tobi-sp
Copy link

tobi-sp commented Aug 6, 2023

Quick update for the failed loHa training on my side: After I found this #1291 , I switched to this specific kohya_ss version and in combination with the 531.68 driver, training works again. I tried other older versions and they didnt work, so what I got from this is that there are some bitsandbytes related problems apparently.

@mbastias
Copy link

mbastias commented Aug 6, 2023

Ok, I kept trying things out and I've discovered that at least sometimes, this might be a video memory issue. I have a 3070ti which has 8GB. So previously I wasn't able to run more than batch 2 at Adam8bit or I got a OOM error, because my limit was the physical 8GB. But lately with DadaptAdam, on my previous setup (I changed drives and formatted) I was running batch 3 and it seemed to work fine. So, while I was looking at the Task Manager just now, I noticed that the LoRA I'm currently training was using Shared GPU memory, so RAM, and it was obviously bottlenecking the speed and even halting it. The solution on this case was lowering the batch size from 3 to 2. I don't know if that could be tobi-sp second run case too. But why is it going there to start with, is accelerator mixing the memories up?
Edit: I kept testing this and seems to be that on my case, as soon as the Shared memory gets used the speed gets affected, the more is used, the slower it gets. But as I said, the behavior before was to give an error if the card memory wasn't enough, now is allocating from both places. Anything I can do to change that?

Was able to recreate this now. As soon as there is something in the shared GPU memory it crashes/doesnt continue with training. Im pretty sure I had like 1 or 2 GB in shared memory in earlier versions though...

Edit: Using DDU to uninstall nvidia driver and going with 531.68 solved it for me. Even with shared memory it's fast again. Edit2: But still, my trained LoHa has no effect. Previously trained lohas are still working...

I tried going back on card drivers and I can observe two changes. Memory allocation is way more efficient; I'm using less VRAM per-training. Shared memory seems to have less of an influence on the speed and speed seems to be going up during the training and with latest drivers it used to go down instead.

@Thom293
Copy link

Thom293 commented Aug 7, 2023

tagging this.

@brianiup
Copy link

brianiup commented Aug 9, 2023

nVIDIA put out new windows drivers yesterday (8/8, version 536.99) tried them today and so far they are working MUCH better, however I 've noticed that the CUDA usage will stay 90-100% for a while then drop to like 60% for a while then go back up to the 90's and keep repeating that cycle, is there anything you can do to make sure it's using the cores as efficiently as possible?

@atomantic
Copy link

atomantic commented Aug 9, 2023

A lot of variables here so this is an itemization of my journey to get from 12s/it, to 5.66s/it -- TL;DR is that Torch 1 might be a good first thing to try.

Windows 11, Nvidia 3090 RTX, GeForce Game Ready Driver 535.98

  • Seeing 12s/it on 12 images with SDXL lora training, batch size 1, learning rate .0004, Network Rank 256, etc all same configs from the guide.
  • Noticed I was using python 3.11.0, which had a subtle error message that I overlooked.
  • deleted venv
  • Installed python 3.10.9
  • install via ./setup.bat (answered with all defaults)
  • same result (12s/it)
  • updated Game Ready Driver to 536.99
  • same result (12s/it)
  • started over with Torch 1 (renamed venv to venv_torch2 and did a new run of setup.bat, choosing torch 1 but everything else defaults)
  • gui errored with module 'os' has no attribute 'statvfs' so I did pip install --upgrade aiofiles to fix
  • ran training again and got 7s/it, which over the next 14 minutes shrunk to5.66s/it! -- still 3.5-4 hours estimated, which isn't great but it's better than 8 hours...

@DarkAlchy
Copy link

DarkAlchy commented Aug 11, 2023

I have not tried the drivers just released as I just rolled back to 531.61 and my 4090 is still very unhappy, as I am. As a test I tried 1 image with 1 repeat (edit: BS1) to see 1.25-1.5s/it. Something, somewhere is seriously wrong.
image

@DarkAlchy
Copy link

Here are some hard ass numbers. This is the same dataset, same everything, only this is local 4090 vs a Colab T4. WATAF?

4090
image
Colab T4
image

@DarkAlchy
Copy link

Now, here is something for you, as I just tested.

1 - I grabbed my sd-scripts from my colab.
2 - I used the same toml files.
3 - Ubuntu 22.04 for Colab, and locally.
4 - The images in my last response was using the gui vs colab.
5 - I had to install everything on my local scripts so everything is the latest.

After I did 4 I decided to do a real fair test since 3. I only changed the paths for my local from the colab paths. Now, mind you, these are on Windows drives so there is translation happening that will slow it down (by how much I am unsure, but it is there as I have seen it in some benchmarks before).

Here is the result of this test. Now mind you to keep it fair, and to really test this is on 2.1 using the exact same captions, and images, at 768x768 2.1 settings. I don't care about anything else but my speed, for now.

Here is the result of several tests.

Colab T4
image

Local 4090
image

I still think that is slow, but look at BS 4
image

Considering I am using my card doing all sorts of stuff, using windows mounted drives, and Colab is headless this isn't all that bad while in comparison to Windows gui usage it blows it out of the water. btw, It used a whole lot more of my card than anything else as it went to 70c at 31% fan and flipped back and forth from 64c to 70c. I never get that in the windows gui.

@oranwav
Copy link

oranwav commented Aug 28, 2023

Fixed for me, so hopefully this helps others. (Note: Running Windows 11, RTX 3070)

  1. Uninstall Cuda toolkit and display drivers
  2. Remove environment variables related to cuda
  3. Make sure all Cuda related folders are deleted in program files
  4. Download Cuda toolkit (I chose 11.8) and install, including display drivers that come with the installer
  5. Delete venv folder and the pycache folder
  6. Run setup, using torch 2.

I think it was environmental variables, where I still had traces of a previous 12.1 install. After following the above steps, everything works perfectly!

Cheers Mate! Hi, this is I from the future, mid august 23, wanted to say this is still saving lives and GPU time.
You are a chad.

@DarkAlchy
Copy link

Fixed for me, so hopefully this helps others. (Note: Running Windows 11, RTX 3070)

  1. Uninstall Cuda toolkit and display drivers
  2. Remove environment variables related to cuda
  3. Make sure all Cuda related folders are deleted in program files
  4. Download Cuda toolkit (I chose 11.8) and install, including display drivers that come with the installer
  5. Delete venv folder and the pycache folder
  6. Run setup, using torch 2.

I think it was environmental variables, where I still had traces of a previous 12.1 install. After following the above steps, everything works perfectly!

Cheers Mate! Hi, this is I from the future, mid august 23, wanted to say this is still saving lives and GPU time. You are a chad.

Why are people still happy with the speeds they are getting? Are you getting 35-57% faster speeds than a 3090? If not then you need to be very angry as I am.

@DarkAlchy
Copy link

@evanheckert Thumbs down me all you like, but I own a 4090, and I have done the testing and the truth hurts. Thumbs down the messenger for telling you straight up facts, else it shows you are biased on a good day.

What is your major malfunction with what I said? You have a 4090 too and are so upset you just had to thumbs down the lone voice in the Jensen woods? smh.

@oranwav
Copy link

oranwav commented Aug 29, 2023

Why are people still happy with the speeds they are getting? Are you getting 35-57% faster speeds than a 3090? If not then you need to be very angry as I am.

Because im back to 1sec/it... (im on 3080ti)
That is NOT slow my friend.

@oranwav
Copy link

oranwav commented Aug 29, 2023

@evanheckert Thumbs down me all you like, but I own a 4090, and I have done the testing and the truth hurts. Thumbs down the messenger for telling you straight up facts, else it shows you are biased on a good day.

What is your major malfunction with what I said? You have a 4090 too and are so upset you just had to thumbs down the lone voice in the Jensen woods? smh.

I hear there is a specific issue with 4090's currently. It must be frustrating.. but it will be sorted, i have no doubt.

@evanheckert
Copy link

@evanheckert Thumbs down me all you like, but I own a 4090, and I have done the testing and the truth hurts. Thumbs down the messenger for telling you straight up facts, else it shows you are biased on a good day.

What is your major malfunction with what I said? You have a 4090 too and are so upset you just had to thumbs down the lone voice in the Jensen woods? smh.

I missed the context from your previous post. It sounded like someone just dissing on free open source software without trying to help identify and resolve the problem, which always grinds my gears.

I'll retract my thumbs down, now that I see you've shared a lot of information about your issue.

@DarkAlchy
Copy link

DarkAlchy commented Aug 29, 2023

@evanheckert Thumbs down me all you like, but I own a 4090, and I have done the testing and the truth hurts. Thumbs down the messenger for telling you straight up facts, else it shows you are biased on a good day.
What is your major malfunction with what I said? You have a 4090 too and are so upset you just had to thumbs down the lone voice in the Jensen woods? smh.

I missed the context from your previous post. It sounded like someone just dissing on free open source software without trying to help identify and resolve the problem, which always grinds my gears.

I'll retract my thumbs down, now that I see you've shared a lot of information about your issue.

Thank you, as I love OSS, and especially OSH, but as my experiment shows something isn't right. We do know Nvidia has some issues to be addressed in a future update concerning memory management they now do. This slow issue is ADA across the board from the 6000 ADA to 4090 and 4080. The other cards I am unsure about. What I showed is something changed with the Kohya setup when XL version was released. I am unsure what it could be though as I am just showing that something is indeed not right. When you pay 1200, 1700, 5-6k for an ADA card, and they are all about the speed of, or slower than, a previous gen something is seriously wrong. My hope is it works itself out as 1700 dollars to get slower than my friend's 3090 really is bad.

@DarkAlchy
Copy link

Why are people still happy with the speeds they are getting? Are you getting 35-57% faster speeds than a 3090? If not then you need to be very angry as I am.

Because im back to 1sec/it... (im on 3080ti) That is NOT slow my friend.

As I said I hope it gets sorted out, but I was talking about the ADA card holders seemingly fine with being slower than last gen not last gen cards. My friend never saw a slow down on his 3090 with the same 2.1 size. If you increase to XL size then I expect it to be slower so that isn't what I mean.

Are you saying that same size training suddenly your 3080ti became slower? If that is the case then, yeah, the steps mentioned might speed you up. Now if you mean you went from 1.5/2.1 size to XL size and became slower just remember 1.5 to XL size is four times the amount of info being handled.

@evanheckert
Copy link

evanheckert commented Aug 30, 2023

Why are people still happy with the speeds they are getting? Are you getting 35-57% faster speeds than a 3090? If not then you need to be very angry as I am.

Because im back to 1sec/it... (im on 3080ti) That is NOT slow my friend.

To be fair, @oranwav, I do get 2.4 it/sec on my 3070, that's with 512x512 with a batch size of 2, FWIW, so not sure it's an apples-to-apples rate comparison.

Edit: Running one now: | 1373/5976 [07:47<26:07, 2.94it/s, loss=0.126]

bmaltais pushed a commit that referenced this issue Dec 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests