-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SD 1.5 Training Completely Broken + Unable to Use Old Kohya Commits #1291
Comments
Yes kohya_ss is completely broken for me also since V21.8.2, I used the same settings who gave me good results before 21.8.2 but now impossible to have decent results even with multiples modifications :( |
One thing to check would be your version of bits and bytes. From my experience with AdamW8bit on Windows, and even on WSL 2 Ubuntu (but will have to look at more to be sure), newer versions of bits and bytes (required for full bf16) causes it to rapidly scale the weights up, quickly frying the model. This would manifest as high average key weights and a large number of keys being scaled if using scale weight norms, and with loss increasing quickly, eventually to near 1 / NaN, without scaling weight norms. Any version after 0.35.0 manifests this particular issue for my setups, but it may not be the cause of the original reported issue, just something to check. |
I did note that on Windows with the latest version of this UI (21.8.5), at some point for some reason, it looks to be installing a newer version than 0.35.0, possibly mixing multiple versions, grabbing a known working 0.35.0 version (that has the Windows DLLs), then deleting and replacing the existing version, regardless of what it says, works. This issue occurs even with a completely fresh install of the UI, so there would seem to be some issue in the dependency install process. Edit: Figured it out, assumed the setup option to install bitsandbytes-windows was just setting it up as normal, but this actually installs a newer version of bits and bytes, and newer versions seem to cause problems with my setup, and seem to typically causes problems for others as well. |
I have the same problem. |
@67372a 20 images were used without any retouching or resizing; the base image format was 512X512. The following parameters were used: BF16 / ADAMW / Constant / Warmup0 / Network 128. For Batch 1, epoch 10, rep 5, > 1000 steps, everything worked perfectly before the insertion of SDXL. An alternative approach also worked for Batch 2, epoch 10, rep 10, > 1000 steps, and it worked just as well with an IT/s (Iterations per second) ranging from 2.5 to 3, with a loss between 0.5 to 0.7. However, currently, with the same settings, the IT/s is around 3.5 to 4, with a loss of 0.8 to 0.9. It's important to note that this is with the same checkpoint and all the same settings. For the 20 images with the identical parameters and no modifications, except for the triton error that appears at the end of each epoch (which doesn't seem to cause any issues), the Lora results are completely distorted. Faces look like they have undergone excessive plastic surgery. To address this issue, I tried running tests with 3000 steps, gradually reducing it to 2500/2000/1500/1000, etc., but the output remains consistently problematic. I also tested using checkpoints that were created when this problem first arose, including special 2D checkpoints, but the Lora results still don't resemble the model at all. Since older versions of Commit are blocked and not functioning currently, I'm unable to make any functional Lora on SD1.5. The only alternative I found was to use Lora in Vlad or A1111 in combination. To elaborate, I must use this method to obtain approximately 95% of the chosen model's face without excessive distortion, and then apply ADetails on top of it. lora:01Test001:0.5,Lora:01Test002:0.5 This situation is certainly abnormal and requires attention. |
OK... so there is probably something bad going on with the bitsandbytes version. This is tricky cause supporting SDXL require newer versions of bitsandbytes for the new optimizers... but it no longer behave like it used to and require totally new LR tuning for training models. I can't really keep the old and the new alive at the same time without some signigicant overhead... So for now best is probably to find the last known to work release and fix the gradio interface so it start to work again. @XT-404 Can you tell me what was the last good version that no longer work when you open the interface? I will test it on my side and see if I can apply a fix for it and publish a version that will work again? |
@bmaltais Thank you for your prompt response and attention to this matter. I appreciate your hard work and efforts for us, the users. Thank you, and you have my support. |
OK, try this and see if it will fix the gradio issue for v21.7.16. Update the requirements.txt file and change the line
to
Save the file and run gui.sh again. Let me know if it fix the GUI issue, If it does I will publish a v21.7.16.1 minor update to fix it so users can use that release again. |
@bmaltais Thank you for the tip :) I will conduct some future tests, and I will come back to you if I notice a return to normalcy or not. It's worth noting that bitsandbytes_windows is also installed in this version, even if not activated. Thank you again for your response, Mr @bmaltais :) |
@bmaltais Batch 1, epoch 10, rep 5 The results are the same, the face is completely distorted. The issue with opening on Gradio has been resolved, which is understandable. However, the problem with the distorted faces persists, despite having IT/S (Iterations per second) at a maximum of 2.5 and 2.7 and loose (loss) ranging from 0.0230 to a maximum of 0.0570. |
I have just performed a test after making modifications on version v21.7.10. I also had to modify Gradio and run GUI.SH two or three times (the first time I encountered errors), but after relaunching it twice, there were no more errors. The installation was successful, done under PyTorch 1. I launched Kohya version 2.7.10 and started the training. The training process went perfectly, and I can confirm that the results are outstanding. Without a doubt, I now have the faces and appearances of the trained models as before, without any issues. If bitsandbytes_windows is not referenced in the configuration as in this version, I no longer have any Triton errors and no more visual anomalies. |
So its def a problem with bitsandbytes_windows ? |
@bmaltais First of all, thanks for your great work and effort! Download the 21.7.10 version from https://github.com/bmaltais/kohya_ss/releases?page=2 Important thing is when you have saved config files, recreate them from start! Thanks again to @bmaltais! |
@AIrtistry |
I had the same problem in the past(#1252), you need to reinstall koyha_ss via setup.bat but NOT install bitsandbytes-windows |
@maxencry Good morning |
I got it working like you described, but made a 2nd folder for install with Torch v2, and it seems to be working for me. What was wrong with it? |
taggins |
I wonder if any of this may actually have its tentacles in the slow speed issue as well? |
@DarkAlchy no idea at the moment. what I can say, however, is that all of the old versions, for example 21.1b or 21.1.0, no longer work at all, the setup and KO. |
I did a nice test about speed and I used the old scripts directly for 2.1. The speed greatly increased. #961 (comment) With the introduction of the code for SDXL it screwed up the speed for all versions (don't know how nor care to know how just know what I found in my test). Something is seriously wrong with the scripts. This is just a gui bmalt made it is a friendlier interface for people to use the sd-scripts. It is nothing more so the problem lies in the sd-scripts made by Kohya (again, as I proved). |
@DarkAlchy |
@XT-404 Thing is did you see my experiment? I think the slowness in ADA based cards is deliberate by Nvidia to protect their Hopper sales. Think of it as an LHR but for this. Gamers will never see it, just as they didn't with those LHR cards. LHR was mostly defeated so I bet if we had the ones (if they had the incentive because they sure have the brains to do it) who broke LHR look at the current drivers I bet they would find something. Might just be an innocent issue but the 4090 will be 1 year old come October and this isn't AMD so no plausible excuse short of malice, imo. The testing I did showed Kohya has an issue too, where I went from sub T4 speeds on Colab to about twice as fast. Combine the driver nonsense with the Kohya nonsense (SDXL nonsense) we have some serious issues that need fixing. Remember, the Linux driver is 535 as the latest proprietary one I can get. |
It also felt like whatever I did still ended up frying the lora or network (when I attempted dreambooth), even when starting from the sd15 branch. Reinstalling venv and even clearing out .cache/huggingface in my home directory did not help. In case others like me end up reading this issue, and reach their wits end, this is what finally worked for me. Basic idea was to switch to upstream kohya-ss/sd-scripts running in WSL2 and use this project to help generate the command line. Also had to track down and hack around a few things to get xformers and bitsandbytes working...
|
I followed these steps and I noticed a significant speedup than using the latest version I went from 90s/it to ~2.3s/it when training a SD 1.5 lora, but I am not sure this is the speed I should be getting, I forgot what speed I was getting from earlier versions. My setup is a RTX 3090, 768x768 resolution, 6 batch size, 5 repeats and 10 epochs for a total of 1184 total steps and it completes in ~40 mins. Is this the expected speed for a 3090 with these parameters? |
@YukiSakuma That is absolutely horrid for a 1.5 training. RTX4090 1024x1024 BS8 20 repeats with 20 images and regularization I get, about the same as you are in Linux. In Windows 10 I get about 1s/it longer. edit: I see you are using DyLoRA. Ditch it as it is rubbish, slow, and doesn't work in ComfyUI which most people are moving to (my iA3 is never going to be implemented in ComfyUI either he told me). |
I think I'm suffering from the same issue, except that Easy Training Lora gives approximately the same results as Kohya. Settings/imagesets that used to give high-quality, stable LORAs now give LORAs that completely mangle outputs. Extra appendages, bad eyes, weird outlines around characters, a film grain over images that won't go away, and a poorer understanding of the concept I am trying to train it on. I have tired Torch 1 and 2 with nearly identical results, in both ETL and Kohya. Based on my experience and what other people have reported, it sounds like it's probably with some dependency rather than Kohya itself. I am going to investigate further and possibly try what ecnat mentioned. I will update this comment or leave a new one if I find anything useful. Update: I had no luck. I didn't go through the effort of WSL, but I did try 2 older releases, including one from around the time I think I downloaded the version that worked. I also tried intentionally diverging from my old config in various ways and it doesn't change much. Maybe it's Nvidia's fault somehow. For the time being, I am going to wait it out and see if more knowledgeable people find anything because I'm out of my depth on this topic. |
I should mention that while training under WSL gave me better results, it still wasn't very satisfactory. I've not trained LoRAs before this, so I don't know if it's just my settings or training set, though. In the end, what gave me good results was full Dreambooth on a runpod with a bigger GPU... |
I'm happy, but I'm afraid this might not be too useful to the discussion. I was able to get training working the same as it was before. Unfortunately, I had to use a Frankenstein approach of recreating my first setup from bits and pieces left over. I'm not sure what part of this did it, but if I had to guess, it's the old kohya_ss hash. Here's what I did:
It didn't generate the exact same LORA (can't reproduce images the old one made), but the outputs are equivalent in quality. At risk of derailing this somewhat, can anyone explain what sshs_legacy_hash and sshs_model_hash mean? These, the timestamps, and session ID are the only metadata difference between my new and old LORA. I'm curious if I can recreate the EXACT LORA I made before if I make those match. |
removed unnecessary `torch` import on line 115
Hello,
I am posting a significant issue here:
1: Since the update to SDXL, all my Koyha_SS training for creating a checkpoint or Lora has been completely shattered. Before the major update with the implementation of SDXL, I could easily create Checkpoints or Loras with specific settings, based on extensive research carried out over several months. Now, no matter what values I insert, all my Checkpoints or Loras have a terrible appearance, completely distorted, whether it's after 1000 steps or 3000 steps.
To verify if the problem is related to my PC or if it originates from Kohya, I downloaded Easy Training Lora and performed my training there. Surprisingly, when using Easy Training Lora, I obtain perfect, clean Loras just like before, without faces resembling some kind of bizarre monster. However, when I use the exact same settings in Kohya, I get distorted faces. I asked a colleague who also performs Lora or checkpoint training to conduct a series of tests (over 2 days), and she also obtained distorted faces.
Problem 2:
As a result, I decided to revert to an older version of Kohya SS, before the addition and implementation of SDXL. However, during the installation process, I encountered a strange issue. While the installation completed smoothly, upon launching the Kohya interface, the window opens but it's impossible to open any files that would allow us to retrieve the image source, the model checkpoint, or any other interface that lets us open a folder. It's as if the folder selection buttons are disabled.
I am reporting this significant issue, and I want to mention that I am not experiencing any other errors, except for the Triton message, indicating a missing Triton module, but I'm not sure what that means.
Apart from this, the training runs without any anomalies or CUDA/Python errors.
Also, no errors appear or are indicated when installing an older version that doesn't work on the interface.
Thank you in advance for addressing this matter.
Best regards,
Augus Wrath
The text was updated successfully, but these errors were encountered: