Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: new sampler "lcm" for lcm-loras #13952

Closed
1 task done
light-and-ray opened this issue Nov 11, 2023 · 46 comments
Closed
1 task done

[Feature Request]: new sampler "lcm" for lcm-loras #13952

light-and-ray opened this issue Nov 11, 2023 · 46 comments
Labels
enhancement New feature or request

Comments

@light-and-ray
Copy link
Contributor

light-and-ray commented Nov 11, 2023

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

This feature requires not so much: no a new model class support, only a new sampler

LCM had major update and now we can use it like a regular lora:
https://huggingface.co/latent-consistency/lcm-lora-sdv1-5
https://huggingface.co/latent-consistency/lcm-lora-ssd-1b
https://huggingface.co/latent-consistency/lcm-lora-sdxl

We only need to rename them and put into lora models directory. Set sampling steps to 4, CFG Scale to 1.0. Sampling method to "DPM2" or "Euler a". It gives decent results, but for better work it requires a special sampler, with is similar to others but with little change. You can look how it works in ComfyUI: comfyanonymous/ComfyUI@002aefa

Proposed workflow

  1. Go to Sampling method
  2. Select LCM sampler

Additional information

How it works now
Screenshot 2023-11-12 at 00-13-13 Stable Diffusion

CompyUI's recent update:
Screenshot_20231112_001837

@light-and-ray light-and-ray added the enhancement New feature or request label Nov 11, 2023
@BahzBeih
Copy link

I'm awaiting the integration of the LCM sampler into AUTOMATIC1111, While AUTOMATIC1111 is an excellent program, the implementation of new features, such as the LCM sampler and consistency VAE, appears to be sluggish. The consistency VAE is currently only accessible in beta, while comfyui already offers both consistency VAE and LCM sampler support at no time :(

@camoody1
Copy link

I'm awaiting the integration of the LCM sampler into AUTOMATIC1111, While AUTOMATIC1111 is an excellent program, the implementation of new features, such as the LCM sampler and consistency VAE, appears to be sluggish. The consistency VAE is currently only accessible in beta, while comfyui already offers both consistency VAE and LCM sampler support at no time :(

Let's not forget that TensorRT for SDXL is sitting on the Dev branch, too. I desperately need all three of these moved up into the Main branch.

@fgtm2023
Copy link

I'm awaiting the integration of the LCM sampler into AUTOMATIC1111, While AUTOMATIC1111 is an excellent program, the implementation of new features, such as the LCM sampler and consistency VAE, appears to be sluggish. The consistency VAE is currently only accessible in beta, while comfyui already offers both consistency VAE and LCM sampler support at no time :(

Let's not forget that TensorRT for SDXL is sitting on the Dev branch, too. I desperately need all three of these moved up into the Main branch.

let's hope this will be soon 🤞🤞🤞🤞

@unphased
Copy link

Someone reported nearly 5 gens per second on a 3090 with both TensorRT and LCM Lora. We're living in a pretty good timeline over here. Can't wait to see all this stuff get in so we can use it all together. Never ever thought I'd see a factor of 15+ speed up this quick.

@iChristGit
Copy link

Someone reported nearly 5 gens per second on a 3090 with both TensorRT and LCM Lora. We're living in a pretty good timeline over here. Can't wait to see all this stuff get in so we can use it all together. Never ever thought I'd see a factor of 15+ speed up this quick.

Makes me feel happy with my 3090 Ti!
Can you elaborate more on TensorRT? is that something that can be enabled on automatic1111? arent sdp and xformers the only options?

@light-and-ray
Copy link
Contributor Author

I could make tensorrt works with lcm only after merging lcm lora into model (and training difference in supermerger)

@szriru
Copy link

szriru commented Nov 14, 2023

I’m looking forward to this feature too.
I wanna use contrlnets with it.

@iChristGit
Copy link

Someone reported nearly 5 gens per second on a 3090 with both TensorRT and LCM Lora. We're living in a pretty good timeline over here. Can't wait to see all this stuff get in so we can use it all together. Never ever thought I'd see a factor of 15+ speed up this quick.

I just installed TensorRT and yeah its amazing (3090Ti)
SDXL 1024x1024 from 11 seconds to 3.5 seconds an image! its a must!

@camoody1
Copy link

@iChristGit I assume you ran this on the Dev branch?

@iChristGit
Copy link

@iChristGit I assume you ran this on the Dev branch?

Yeah git checkout dev

@light-and-ray
Copy link
Contributor Author

light-and-ray commented Nov 16, 2023

Someone from Reddit implemented it 4 day ago
https://www.reddit.com/r/StableDiffusion/comments/17ti2zo/you_can_add_the_lcm_sampler_to_a1111_with_a/

You can add the LCM sampler to A1111 with a little trick Tutorial | Guide

So I was trying out the new LCM LoRA and found out the sampler is missing in A1111. As as long shot I just copied the code from Comfy, and to my surprise it seems to work. I think.

You have to make two small edits with a text editor. Here's how you do it:

Edit the file sampling.py found at this path:

...\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py

Add the following at the end of the file:

@torch.no_grad()
def sample_lcm(model, x, sigmas, extra_args=None, callback=None, disable=None, noise_sampler=None):
    extra_args = {} if extra_args is None else extra_args
    noise_sampler = default_noise_sampler(x) if noise_sampler is None else noise_sampler
    s_in = x.new_ones([x.shape[0]])
    for i in trange(len(sigmas) - 1, disable=disable):
        denoised = model(x, sigmas[i] * s_in, **extra_args)
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})

        x = denoised
        if sigmas[i + 1] > 0:
            x += sigmas[i + 1] * noise_sampler(sigmas[i], sigmas[i + 1])
    return x

The second change is done in the file sd_samplers_kdiffusion.py found here:

...\stable-diffusion-webui-new\modules\sd_samplers_kdiffusion.py

On line 39 add this:

('LCM Test', 'sample_lcm', ['lcm'], {}),

That should give you a new sampler option called 'LCM Test'.

изображение

@light-and-ray
Copy link
Contributor Author

But also there was found a bug in LCM Scheduler, and the algorithm will be updated
huggingface/diffusers#5815

@light-and-ray
Copy link
Contributor Author

There is almost no difference between "Euler a" and "LCM Test" for SD1. But for SDXL it solves all the problems

xl
1 5

@light-and-ray
Copy link
Contributor Author

light-and-ray commented Nov 16, 2023

I've modified patch from reddit to not edit external repository:

diff --git a/modules/sd_samplers_extra.py b/modules/sd_samplers_extra.py
index 1b981ca8..d154a2b6 100644
--- a/modules/sd_samplers_extra.py
+++ b/modules/sd_samplers_extra.py
@@ -72,3 +72,20 @@ def restart_sampler(model, x, sigmas, extra_args=None, callback=None, disable=No
         last_sigma = new_sigma

     return x
+
+
+@torch.no_grad()
+def sample_lcm(model, x, sigmas, extra_args=None, callback=None, disable=None, noise_sampler=None):
+    extra_args = {} if extra_args is None else extra_args
+    noise_sampler = k_diffusion.sampling.default_noise_sampler(x) if noise_sampler is None else noise_sampler
+    s_in = x.new_ones([x.shape[0]])
+    for i in tqdm.auto.trange(len(sigmas) - 1, disable=disable):
+        denoised = model(x, sigmas[i] * s_in, **extra_args)
+        if callback is not None:
+            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
+
+        x = denoised
+        if sigmas[i + 1] > 0:
+            x += sigmas[i + 1] * noise_sampler(sigmas[i], sigmas[i + 1])
+    return x
+
diff --git a/modules/sd_samplers_kdiffusion.py b/modules/sd_samplers_kdiffusion.py
index 8a8c87e0..b6c3dc44 100644
--- a/modules/sd_samplers_kdiffusion.py
+++ b/modules/sd_samplers_kdiffusion.py
@@ -36,6 +36,7 @@ samplers_k_diffusion = [
     ('DPM2 a Karras', 'sample_dpm_2_ancestral', ['k_dpm_2_a_ka'], {'scheduler': 'karras', 'discard_next_to_last_sigma': True, "uses_ensd": True, "second_order": True}),
     ('DPM++ 2S a Karras', 'sample_dpmpp_2s_ancestral', ['k_dpmpp_2s_a_ka'], {'scheduler': 'karras', "uses_ensd": True, "second_order": True}),
     ('Restart', sd_samplers_extra.restart_sampler, ['restart'], {'scheduler': 'karras', "second_order": True}),
+    ('LCM Test', sd_samplers_extra.sample_lcm, ['lcm'], {}),
 ]

@light-and-ray
Copy link
Contributor Author

I have wrapped this patch in extention! 🎉
https://github.com/light-and-ray/sd-webui-lcm-sampler

@wcde
Copy link

wcde commented Nov 16, 2023

We also need Skipping-Step from paper. Without it we still have incomplete implementation.

@light-and-ray
Copy link
Contributor Author

We hope @AUTOMATIC1111 will add this sampler natively and correctly, because I know nothing about defussion theory

@charliemagee
Copy link

There's this: https://github.com/0xbitches/sd-webui-lcm

@light-and-ray
Copy link
Contributor Author

There's this: https://github.com/0xbitches/sd-webui-lcm

Nooo. It is converted gradio demo of lcm model before the loras were released

It is separated tab. It's no longer relevant

@charliemagee
Copy link

charliemagee commented Nov 16, 2023

There's this: https://github.com/0xbitches/sd-webui-lcm

Nooo. It is converted gradio demo of lcm model before the loras were released

It is separated tab. It's no longer relevant

Thanks. So is the light-and-ray thing an actual solution? I'm hoping for a version that will work in Deform.

@light-and-ray
Copy link
Contributor Author

Thanks. So is the light-and-ray thing an actual solution? I'm hoping for a version that will work in Deform.

Yes, you're right

@continue-revolution
Copy link
Contributor

continue-revolution commented Nov 17, 2023

use my extension to use LCM sampler in exaly the way you describe. https://github.com/continue-revolution/sd-webui-animatediff#lcm

@recoilme
Copy link

But for SDXL it solves all the problems

Thanks for your research! Any ideas why in sdxl other samplers not work?

@joyoungzhang
Copy link

joyoungzhang commented Nov 17, 2023

We hope @AUTOMATIC1111 will add this sampler natively and correctly, because I know nothing about defussion theory

Do @AUTOMATIC1111 have time to support the LCM sampler?

leejet added a commit to leejet/stable-diffusion.cpp that referenced this issue Nov 17, 2023
This referenced an issue discussion of the stable-diffusion-webui at
AUTOMATIC1111/stable-diffusion-webui#13952, which
may not be too perfect.
@camoody1
Copy link

If you install the latest update of the Animatediff extension, that installs the LCM sampler for you. No need for this workaround, now.
Link: https://github.com/continue-revolution/sd-webui-animatediff

@ThisIsNetsu
Copy link

use my extension to use LCM sampler in exaly the way you describe. https://github.com/continue-revolution/sd-webui-animatediff#lcm

Dude, what a BLESSING! Installing Animatediff just added the LCM sampler as well. Also you can totally increase the step count to 10-15 and get even better quality it seems. Thank you!

@iChristGit
Copy link

use my extension to use LCM sampler in exaly the way you describe. https://github.com/continue-revolution/sd-webui-animatediff#lcm

Dude, what a BLESSING! Installing Animatediff just added the LCM sampler as well. Also you can totally increase the step count to 10-15 and get even better quality it seems. Thank you!

I am getting an image per second on 3090Ti, with TensorRT (no LCM) it is around 3.5 second per image.
But the results are so bad, its not worth using, i have tried 8-15 steps, 2 cfg scale, results come up almost identical to each other in all ways, without deep colors, do you get decent results?

@camoody1
Copy link

I am getting an image per second on 3090Ti, with TensorRT (no LCM) it is around 3.5 second per image. But the results are so bad, its not worth using, i have tried 8-15 steps, 2 cfg scale, results come up almost identical to each other in all ways, without deep colors, do you get decent results?

I'm using an RTX 3060 12GB and I'm getting surprisingly good results. My settings are lora weight of 0.75, LCM sampler, 6-8 steps and CFG of 2.0. Try these settings and see if things improve at all for you.

@aifartist
Copy link

I'm awaiting the integration of the LCM sampler into AUTOMATIC1111, While AUTOMATIC1111 is an excellent program, the implementation of new features, such as the LCM sampler and consistency VAE, appears to be sluggish. The consistency VAE is currently only accessible in beta, while comfyui already offers both consistency VAE and LCM sampler support at no time :(

And with LCM you want TinyVAE. It turns 70ms 512x512 4 step generations into 45ms on my 4090.
For some reason the way A1111 is fusing the Lora must be different. It only make about a 2.5X speed up instead of the 10X I get in a simple diffusers pipeline.

@continue-revolution
Copy link
Contributor

continue-revolution commented Nov 20, 2023

You need to understand several things

  1. Most likely comfyanonymous is paid by SAI so that he can work full time on ComfyUI, but we A1111 developers are paid nothing.
  2. Although A1111 is much more convenient to use compared to ComfyUI, the internal is much harder to modify. If you read my source code of AnimateDiff, you will understand this much deeper. However ComfyUI is made of a bunch of nodes which almost don’t affect each other.
  3. A1111 takes cautious step to add new features - most features are not universal so it’s better to come with an extension. This is actually better for us programmers - our extensions will not be abruptly broken.
  4. It is not easy to write an elegant implementation to convert diffusers based research into A1111. It is actually easy to write a tab that force you to use diffusers, but I don’t want to do that.
  5. Theoretically A1111 should not be slower than anything else. A known reason that A1111 is “slower” than diffusers is that - some samplers require 2 unet forward. In diffusers, step means the number of unet forwards; in A1111, step means the number of sampler forwards. There might be other reasons.

@camoody1
Copy link

You need to understand several things

  1. Most likely comfyanonymous is paid by SAI so that he can work full time on ComfyUI, but we A1111 developers are paid nothing.
  2. Although A1111 is much more convenient to use compared to ComfyUI, the internal is much harder to modify. If you read my source code of AnimateDiff, you will understand this much deeper. However ComfyUI is made of a bunch of nodes which almost don’t affect each other.
  3. A1111 takes cautious step to add new features - most features are not universal so it’s better to come with an extension. This is actually better for us programmers - our extensions will not be abruptly broken.
  4. It is not easy to write an elegant implementation to convert diffusers based research into A1111. It is actually easy to write a tab that force you to use diffusers, but I don’t want to do that.
  5. Theoretically A1111 should not be slower than anything else. A known reason that A1111 is “slower” than diffusers is that - some samplers require 2 unet forward. In diffusers, step means the number of unet forwards; in A1111, step means the number of sampler forwards. There might be other reasons.

Do you feel like Automatic1111 is a dead-end in the long run? Should we all be putting more effort into learning ComfyUI?

@continue-revolution
Copy link
Contributor

continue-revolution commented Nov 20, 2023

Do you feel like Automatic1111 is a dead-end in the long run? Should we all be putting more effort into learning ComfyUI?

I've never had a second having any thoughts like this, and I will almost certainly stick to A1111. I believe that creating a great user experience is our (programmer's) mission, and you, users, should focus more on how to use, and teach us how to actually use our software. A lot of people are much better at using my extension than myself. Designing a user-friendly software is never easy in any UI, but I enjoy that.

A1111 has done his best. A1111 WebUI is easy-to-use, fast and memory efficient. We should not critisize A1111 before having any clear evidence. Dispite the fact that it is tricky to hook some functions, some other functions are designed so good that it can easily fit my need.

That said, we programmers do need money. Working for love is never sustainable, unless we are as rich as Mark Zuckerberg. Mark has already been extremely rich through Facebook, so he open-sourced almost everything about ML in Meta. Sam is not as rich as Mark, so OpenAI becomes CloseAI.

@light-and-ray
Copy link
Contributor Author

They have updated the algorithm of sampler
https://github.com/huggingface/diffusers/pull/5836/files

@shitianfang
Copy link

@light-and-ray How do I use this update? I tested lcm and there are some problems in controlnet inpaint, it doesn't recognize red anymore.

@LIQUIDMIND111
Copy link

use my extension to use LCM sampler in exaly the way you describe. https://github.com/continue-revolution/sd-webui-animatediff#lcm

Dude, what a BLESSING! Installing Animatediff just added the LCM sampler as well. Also you can totally increase the step count to 10-15 and get even better quality it seems. Thank you!

I am getting an image per second on 3090Ti, with TensorRT (no LCM) it is around 3.5 second per image. But the results are so bad, its not worth using, i have tried 8-15 steps, 2 cfg scale, results come up almost identical to each other in all ways, without deep colors, do you get decent results?

please can you elaborate how ti install TENSOR RT? I have an RTX 2060, do i need to install a driver from NVIDIA, or just the webui tensor rt extension??????

@light-and-ray
Copy link
Contributor Author

@continue-revolution cold you apply this update in your extension?

They have updated the algorithm of sampler
https://github.com/huggingface/diffusers/pull/5836/files

@continue-revolution
Copy link
Contributor

post a feature request in my repo and AT the original author of LCM. He will make the decision.

@Aime-ry
Copy link

Aime-ry commented Nov 22, 2023

After trying it out, it's really fantastic. For the Sampling method, select Euler a, Euler, or LCM. Set the Sampling steps to 6-8, and choose the lora option as either lora:lcm-lora-sdv1-5:0.6 or lora:lcm-lora-sdv1-5:0.7. For the CFG Scale, select either 1.2 or 1.5.
xyz_grid-0002-3834625937
xyz_grid-0003-3834625937

@LIQUIDMIND111
Copy link

After trying it out, it's really fantastic. For the Sampling method, select Euler a, Euler, or LCM. Set the Sampling steps to 6-8, and choose the lora option as either lora:lcm-lora-sdv1-5:0.6 or lora:lcm-lora-sdv1-5:0.7. For the CFG Scale, select either 1.2 or 1.5. xyz_grid-0002-3834625937 xyz_grid-0003-3834625937

yes but for high quality animation, you can do even 15 steps...... try it with the LCM just like you did..... but ONLY for animatediff animations... not for photos......

@camoody1
Copy link

@LIQUIDMIND111 Why do you say NOT for photos? The images aren't perfect, but after an Ultimate Upscale application, the results look really, really nice. 🤷🏼‍♂️

@LIQUIDMIND111
Copy link

LIQUIDMIND111 commented Nov 24, 2023 via email

@joyoungzhang
Copy link

@continue-revolution
It seems that LCM in animatediff is not compatible with regional-prompter. I found in my test that after installing regional-prompter (Did not choose Active), it takes 5 seconds to draw, but after uninstalling regional-prompter, the same parameters only take 1 second to draw.

@andupotorac
Copy link

@continue-revolution It seems that LCM in animatediff is not compatible with regional-prompter. I found in my test that after installing regional-prompter (Did not choose Active), it takes 5 seconds to draw, but after uninstalling regional-prompter, the same parameters only take 1 second to draw.

StreamMultiDiffusion solved the issue for regional prompting with LCM - https://github.com/ironjr/StreamMultiDiffusion.

@andupotorac
Copy link

I'm awaiting the integration of the LCM sampler into AUTOMATIC1111, While AUTOMATIC1111 is an excellent program, the implementation of new features, such as the LCM sampler and consistency VAE, appears to be sluggish. The consistency VAE is currently only accessible in beta, while comfyui already offers both consistency VAE and LCM sampler support at no time :(

And with LCM you want TinyVAE. It turns 70ms 512x512 4 step generations into 45ms on my 4090. For some reason the way A1111 is fusing the Lora must be different. It only make about a 2.5X speed up instead of the 10X I get in a simple diffusers pipeline.

Care to share a link to TinyVAE?

@0xdevalias
Copy link

Care to share a link to TinyVAE?

Google'ing TinyVAE doesn't come up with a direct result, but I wonder if they intended to refer to this one:

  • https://huggingface.co/docs/diffusers/en/api/models/autoencoder_tiny
    • Tiny AutoEncoder
      Tiny AutoEncoder for Stable Diffusion (TAESD) was introduced in madebyollin/taesd by Ollin Boer Bohan. It is a tiny distilled version of Stable Diffusion’s VAE that can quickly decode the latents in a StableDiffusionPipeline or StableDiffusionXLPipeline almost instantly.

    • https://github.com/madebyollin/taesd
      • Tiny AutoEncoder for Stable Diffusion
        TAESD is very tiny autoencoder which uses the same "latent API" as Stable Diffusion's VAE*. TAESD can decode Stable Diffusion's latents into full-size images at (nearly) zero cost.

@andupotorac
Copy link

Care to share a link to TinyVAE?

Google'ing TinyVAE doesn't come up with a direct result, but I wonder if they intended to refer to this one:

  • https://huggingface.co/docs/diffusers/en/api/models/autoencoder_tiny

    • Tiny AutoEncoder
      Tiny AutoEncoder for Stable Diffusion (TAESD) was introduced in madebyollin/taesd by Ollin Boer Bohan. It is a tiny distilled version of Stable Diffusion’s VAE that can quickly decode the latents in a StableDiffusionPipeline or StableDiffusionXLPipeline almost instantly.

    • https://github.com/madebyollin/taesd

      • Tiny AutoEncoder for Stable Diffusion
        TAESD is very tiny autoencoder which uses the same "latent API" as Stable Diffusion's VAE*. TAESD can decode Stable Diffusion's latents into full-size images at (nearly) zero cost.

Yeah this is what I came across also, though it doesn't use the same name oddly. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests