-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't get this to install correctly on latest A1111 #3
Comments
Try my fork (https://github.com/6DammK9/auto-MBW-rt/). I have made it working with some preset changes. |
Thanks, I'll check it out! |
@6DammK9 I tried your extension, it looks great, thank you. However, On the first test it slowly filled up my 32gb of ram then choked. I'm on auto1111 1.7 on ubuntu. There's also a few circular dependency problems, image-reward requires a different version of hugging-face-hub compared to gradio. |
32GB system RAM is generally not enough for this usage. Such "RL" approach requires additional resources on top of SD + A1111. It is considered that models must be created in system RAM first, then move to GPU's VRAM. For the dependency problem, it points to |
@6DammK9 I'm confused... the first 10 steps run fine at 5gb ram usage, then it slowly creeps up more and more ram until about 85 steps where it's choking at about 26gb on my system! I don't think this is related to image-reward package, that is obviously working fine for the first 20 steps. Is it keeping many copies of the model variants in ram for some reason? Why? |
First, I assume the "steps" you are mentioning is "iterlations" of the optimization process instead of denoising steps. Due to the nasty software implementation stuffs (This extension with multiple non-SD models > A1111 > SD WebUI > Pytorch > Torch > CUDA > driver etc.), memory leak in Linux (or Windows) can have many reasons. I need further information (e.g. Leaking rate, patterns, "how long for how much ram") to investigate (especially I have my Windows PC only). Any lines of python script (not code, CS stuffs) may not successfully referencing the same resource and result in new instance, hence the leak, and it can be OS dependent. BTW I will switch to bayesian-merger soon, but I have heard that it leaks more badly, and it may consume up to 256GB of RAM: |
Thanks for your help, I really appreciate it. I know memory leaks are the worst to track down. I've been working on my own one too. I've refactored runtime block merger, added image reward and hyperactive, but I've had to introduce threading and waits for hyperactive, as the Bayesian optimizer wants 'to be in control' so to speak calling the objective function, which then waits for each generate + score. I was just using 'generate forever' and catching the generations. I also stripped the code back, fp16 only, gpu memory only, so much faster, safetensors only, but that's just my preference for clean UI, speed and size. I've never seen an fp32 checkpoint make a significant difference. Anyway, it's hacky as hell. I'm also getting a memory leak each time I cycle it but haven't investigated further. It could even be runtime block merge. I've also looked at the actual model structure, and I think the 12 in and 12 out is sub optimal. IN0 for example is tiny, then the layers with emb norm proj transformer are massive, and could do with splitting up. Anything with a .0 or a .1. So now it's clicked why some layers have much more effect than others, and they're so much asymmetrically between in and out, they're completely different. Out is so much bigger! However, increasing the UI layer count to even more than 27 would be a disaster for usability. I think the ideal UI for manual merging would have "only" five dials, and five advanced dials, coded smart enough to toggle affecting unet layers with skips symmetrically, for example. Or, things like global gradV gradA cosine and smooth step could have a dial each for strength, and smoother, at block level. And then one dial would affect the high up transformer layers, another affect the smaller deeper ones, etc. In other words, the original runtime block merge was perfect for learning, and now we know what stuff does we can make a better interface. Then, if we can pin down out smarter block manipulation with reduced variables - 27 down to e.g. 10, the Bayesian optimiser will converge 10x faster or something. It's highly nonlinear! Anyway sorry for the wall of text and random ideas! |
Thanks for your response. No, the idea of I'm not sure if there is anyone trying to integrate AutoMBW / bayesian-merger with InvokeAI / ComfyUI, you may check it out. However it is not visible in Github, usually hidden in Discord servers. For fp16 / fp32 / pruned / ema stuffs, "community standard" is fp16 pruned. There is no solid technical proof, just a "common general knowledge" across communities. However the resource usage is literally halved and make most systems works. fp8 may work also, but you may need to look at how A1111 handle it. Then for the algorithm / UI part (most are my original and unique findings):
Thanks for your interest. Hope it inspires you. |
It's all helpful, thanks! I must admit, I do find your writing style a real challenge to interpret, but, you do seem to know your stuff :-D Yes, putting the other code all in one extension was the only way for me. Wipe out the technical debt of dependencies and also so I could understand it all better. I haven't got webui running with real-time debugging yet, breakpoints etc, do you have any tips on that? Torchview, in your article,is a great tip, thanks! And your unet view notebook is great, thanks! |
Gives error about not being able to install correctly. Tried some stuff but it just doesn't work in latest A1111 (1.6.0)
The text was updated successfully, but these errors were encountered: