Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get this to install correctly on latest A1111 #3

Open
oliverban opened this issue Sep 15, 2023 · 9 comments
Open

Can't get this to install correctly on latest A1111 #3

oliverban opened this issue Sep 15, 2023 · 9 comments

Comments

@oliverban
Copy link

Gives error about not being able to install correctly. Tried some stuff but it just doesn't work in latest A1111 (1.6.0)

@6DammK9
Copy link

6DammK9 commented Nov 5, 2023

Try my fork (https://github.com/6DammK9/auto-MBW-rt/). I have made it working with some preset changes.
However you need to follow my installation guide to make it work.

@oliverban
Copy link
Author

Try my fork (https://github.com/6DammK9/auto-MBW-rt/). I have made it working with some preset changes. However you need to follow my installation guide to make it work.

Thanks, I'll check it out!

@Luke2642
Copy link

@6DammK9 I tried your extension, it looks great, thank you. However, On the first test it slowly filled up my 32gb of ram then choked. I'm on auto1111 1.7 on ubuntu.

There's also a few circular dependency problems, image-reward requires a different version of hugging-face-hub compared to gradio.

@6DammK9
Copy link

6DammK9 commented Dec 21, 2023

@6DammK9 I tried your extension, it looks great, thank you. However, On the first test it slowly filled up my 32gb of ram then choked. I'm on auto1111 1.7 on ubuntu.

There's also a few circular dependency problems, image-reward requires a different version of hugging-face-hub compared to gradio.

32GB system RAM is generally not enough for this usage. Such "RL" approach requires additional resources on top of SD + A1111. It is considered that models must be created in system RAM first, then move to GPU's VRAM.
I see no serious memory leak, seems that ImageReward is loaded into GPU successfully, meanwhile hyperactive / BayesianOptimizer used some system RAM since numpy is CPU only. The maximum system RAM usage is somewhere around 33.4GB (34270MB), and my PC has 128GB RAM (X99 Mobo, page file should be unlikely activated). Choose other reward model (chad) + optimizer (non global like hillclimb) may dodge this problem.

For the dependency problem, it points to ImageReward 's dependency, which will become a headache if the environment is using plain python + pip, even A1111's venv doesn't help. I suggest use conda (miniconda for my case) to setup the python environment, or switch to other reward models. ImageReward itself has exclusive code base, and is already inactive in development.

@Luke2642
Copy link

@6DammK9 I'm confused... the first 10 steps run fine at 5gb ram usage, then it slowly creeps up more and more ram until about 85 steps where it's choking at about 26gb on my system!

I don't think this is related to image-reward package, that is obviously working fine for the first 20 steps. Is it keeping many copies of the model variants in ram for some reason? Why?

@6DammK9
Copy link

6DammK9 commented Dec 23, 2023

@6DammK9 I'm confused... the first 10 steps run fine at 5gb ram usage, then it slowly creeps up more and more ram until about 85 steps where it's choking at about 26gb on my system!

I don't think this is related to image-reward package, that is obviously working fine for the first 20 steps. Is it keeping many copies of the model variants in ram for some reason? Why?

First, I assume the "steps" you are mentioning is "iterlations" of the optimization process instead of denoising steps.

Due to the nasty software implementation stuffs (This extension with multiple non-SD models > A1111 > SD WebUI > Pytorch > Torch > CUDA > driver etc.), memory leak in Linux (or Windows) can have many reasons. I need further information (e.g. Leaking rate, patterns, "how long for how much ram") to investigate (especially I have my Windows PC only). Any lines of python script (not code, CS stuffs) may not successfully referencing the same resource and result in new instance, hence the leak, and it can be OS dependent.
You may check out this related issue for more ideas:
AUTOMATIC1111/stable-diffusion-webui#6722
As on 231223, it may points to "new instance of VAE" when A1111 unload the SD model to CPU (heard from dev community), but I need to verify (or just tolerate it because I have more RAM).

BTW I will switch to bayesian-merger soon, but I have heard that it leaks more badly, and it may consume up to 256GB of RAM:
s1dlx/sd-webui-bayesian-merger#110

@Luke2642
Copy link

Luke2642 commented Dec 24, 2023

Thanks for your help, I really appreciate it. I know memory leaks are the worst to track down.

I've been working on my own one too. I've refactored runtime block merger, added image reward and hyperactive, but I've had to introduce threading and waits for hyperactive, as the Bayesian optimizer wants 'to be in control' so to speak calling the objective function, which then waits for each generate + score. I was just using 'generate forever' and catching the generations.

I also stripped the code back, fp16 only, gpu memory only, so much faster, safetensors only, but that's just my preference for clean UI, speed and size. I've never seen an fp32 checkpoint make a significant difference.

Anyway, it's hacky as hell. I'm also getting a memory leak each time I cycle it but haven't investigated further. It could even be runtime block merge.

I've also looked at the actual model structure, and I think the 12 in and 12 out is sub optimal. IN0 for example is tiny, then the layers with emb norm proj transformer are massive, and could do with splitting up. Anything with a .0 or a .1. So now it's clicked why some layers have much more effect than others, and they're so much asymmetrically between in and out, they're completely different. Out is so much bigger!

However, increasing the UI layer count to even more than 27 would be a disaster for usability. I think the ideal UI for manual merging would have "only" five dials, and five advanced dials, coded smart enough to toggle affecting unet layers with skips symmetrically, for example. Or, things like global gradV gradA cosine and smooth step could have a dial each for strength, and smoother, at block level. And then one dial would affect the high up transformer layers, another affect the smaller deeper ones, etc.

In other words, the original runtime block merge was perfect for learning, and now we know what stuff does we can make a better interface.

Then, if we can pin down out smarter block manipulation with reduced variables - 27 down to e.g. 10, the Bayesian optimiser will converge 10x faster or something. It's highly nonlinear!

Anyway sorry for the wall of text and random ideas!

@6DammK9
Copy link

6DammK9 commented Dec 24, 2023

Thanks for your response.

No, the idea of shared.UNBMSettingsInjector is awful. The idea of shared.sd_model.sd_model_checkpoint is also awful. This kind of "plugin" is far from a proper written trainer like kohya-ss. When I picked up the repo ALONE, I don't even know what is inside. The concept of "BO" only pops up when I fully "recovered" this dead repo into a useable state. If it points to sd-webui-runtime-block-merge, maybe we can discuss there. I have no idea either. My machine went 48 iterlations (in progress) and no leak. I also tried to use "Runtime Block Merge" only. Seems that nothing special.

I'm not sure if there is anyone trying to integrate AutoMBW / bayesian-merger with InvokeAI / ComfyUI, you may check it out. However it is not visible in Github, usually hidden in Discord servers.

For fp16 / fp32 / pruned / ema stuffs, "community standard" is fp16 pruned. There is no solid technical proof, just a "common general knowledge" across communities. However the resource usage is literally halved and make most systems works. fp8 may work also, but you may need to look at how A1111 handle it.

Then for the algorithm / UI part (most are my original and unique findings):

  • How and why MBW seems chaotic.
  • How and why AutoMBW is also chaotic.
  • Model performance, especially in artistic sense, we cannot naively believe that it can be correlated by "layers", no matter bottom neural layers, or higher abstract "MBW" layers. Although whole model stays deterministic (with ODE solvers like DDIM), it is far from ResNet / VAE / GAN which can be simply evaluated, because attention layers (*.attn) makes neuron linkage "dynamic". Then, there is no academic justification of "27", or "24 + something", or even "an arbitary number" of "layers", and "merge model". The most appearant reason of "27 layers" is just because programmers naively pick the layer prefixes. SDXL has less layers but larger dimensions, which is more tricky
  • I've made the closet approximate to make such merge appearantly works (I want to prove it with SD2, not SD1 or SDXL), but running through payloads are really time consuming. Dropping down BO parameters won't help much. It is like a week (payloads) vs half an hour (BO). However dropping BO parameters can get the merge precision a bit higher (I think the np.meshgrid is flawed but I am not fully understand the BO / LIPO algorithm, which should be somewhat O(N^3) only, given 128GB of RAM it shouldn't crash, or spend too much time to compute). Finally, if it is full RL approach, why we need another model? Just adjust itself like what FreeU did is good enough,

Thanks for your interest. Hope it inspires you.

@Luke2642
Copy link

Luke2642 commented Dec 24, 2023

It's all helpful, thanks! I must admit, I do find your writing style a real challenge to interpret, but, you do seem to know your stuff :-D

Yes, putting the other code all in one extension was the only way for me. Wipe out the technical debt of dependencies and also so I could understand it all better. I haven't got webui running with real-time debugging yet, breakpoints etc, do you have any tips on that?

Torchview, in your article,is a great tip, thanks! And your unet view notebook is great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants