Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bf16 support for VAE as a fallback #9295

Closed
wants to merge 9 commits into from

Conversation

Sakura-Luna
Copy link
Collaborator

@Sakura-Luna Sakura-Luna commented Apr 2, 2023

Describe what this pull request is trying to achieve.

According to the description here, bf16 can solve the problem of VAE working in half precision to generate black images, so I made this commit.

Additional notes and description of your changes

bf16 is great to use as a fallback, when the webui detects an empty image generation, it tries to convert and retry on supported devices, works fine on my test case. Note that if you want to use this feature, you need to use a GPU that supports bf16 and the webui works on PyTorch 2.1. For unsupported devices, you can still only use --no-half-vae.

Edit: In theory, AMD GPUs are also supported.

Environment this was tested in

  • OS: Win
  • Browser: chrome
  • Graphics card: NVIDIA Ampere GPU

@playlogitech
Copy link

the only vae that producing black squares is vae from nai/any/k8, just saying

@catboxanon
Copy link
Collaborator

catboxanon commented Apr 3, 2023

The VAE should be converted to bf16 beforehand, I don't think the current implementation is the correct way to go about this because then it wastes time decoding the latent image twice.

Also this shouldn't be merged yet since only pytorch 2.1 nightly supports this iirc, and the only other PR for upgrading torch is for the 2.0 release. Should be marked as a draft for now.

@Sakura-Luna
Copy link
Collaborator Author

Sakura-Luna commented Apr 3, 2023

Also this shouldn't be merged yet since only pytorch 2.1 nightly supports this iirc, and the only other PR for upgrading torch is for the 2.0 release. Should be marked as a draft for now.

In theory, this is the case. According to some preferences, I did not add the version number to judge, but considering the actual use, it will not have any effect if it will not trigger NAN or disable NAN, so the merge is also feasible. If someone finds it necessary, an exception handling can also be added to explain that the PyTorch version does not support bf16.

The VAE should be converted to bf16 beforehand, I don't think the current implementation is the correct way to go about this because then it wastes time decoding the latent image twice.

I don't agree with this opinion, ideally this conversion is done only once, so it takes very little time. Conversely, bf16 performs worse than fp16 in terms of speed, VRAM consumption, and accuracy, so the overall bf16 is not ideal. Considering that even a problematic VAE does not necessarily generate a black image, lazy conversion is a good solution.
If you use pre-converted VAE, you need to add model type judgment, global bf16 cannot be accepted, it will affect the performance of other normal models.

@Cyberbeing
Copy link
Contributor

Cyberbeing commented Apr 3, 2023

Conversely, bf16 performs worse than fp16 in terms of speed, VRAM consumption, and accuracy, so the overall bf16 is not ideal.
If you use pre-converted VAE, you need to add model type judgment, global bf16 cannot be accepted, it will affect the performance of other normal models.

What kind of Performance and VRAM impact are you seeing to make this sort of statement? Do you have numbers to back this up, since personally I've not see this after using BF16 VAE full-time during the past 4 months, and performance improved after PyTorch merged into BF16 interpolate support into nightly.

I just did a quick test with --opt-sdp-no-mem-attention to double check myself, and full-time BF16 VAE was performing 1-2% Faster (or otherwise within the margin of error) than FP16 VAE with 100% identical VRAM usage on RTX A4000 using CUDNN 8.8.1 with TF32 enabled, and Live Previews disabled.

This is to be expected, since TF32<->BF16 should be a faster and higher quality conversion than TF32<->FP16 on NVIDIA. While both FP16 and BF16 have identical 16bit data sizes, so there should be no additional VRAM usage unless an unneeded conversion with duplication of data is occurring somewhere in WebUI.

Similarly as for accuracy, the general expectation is that BF16 VAE bias should have closer output to FP32 VAE bias than FP16 VAE bias does when converting from a FP32 VAE. VAE, unlike the rest of the model components, seems to care more about dynamic range than significant digit precision, though I suspect this aspect may need more widespread testing to verify.


The VAE should be converted to bf16 beforehand, I don't think the current implementation is the correct way to go about this because then it wastes time decoding the latent image twice.

Also this shouldn't be merged yet since only pytorch 2.1 nightly supports this iirc, and the only other PR for upgrading torch is for the 2.0 release. Should be marked as a draft for now.

I'm also of the opinion that it would make more sense to implement this similar to --no-half-vae as a command line argument once PyTorch 2.1 GA is released, since it serves a near-identical purpose while having performance and VRAM usage similar to FP16.

@Sakura-Luna
Copy link
Collaborator Author

What kind of Performance and VRAM impact are you seeing to make this sort of statement? Do you have numbers to back this up, since personally I've not see this after using BF16 VAE full-time during the past 4 months, and performance improved after PyTorch merged into BF16 interpolate support into nightly.

You can refer to the data at the end of this issue, but there may be some differences for VAE. I am most concerned about the accuracy, but as you said, it may be necessary to do a comparison to test which of bf16 and fp16 has better results.

I'm also of the opinion that it would make more sense to implement this similar to --no-half-vae as a command line argument once PyTorch 2.1 GA is released, since it serves a near-identical purpose while having performance and VRAM usage similar to FP16.

I do the bf16 conversion as a trigger operation, so it can be considered as part of the nan-check without additional parameters. If the implementation changes, then reconsideration is required.

@Cyberbeing
Copy link
Contributor

Cyberbeing commented Apr 3, 2023

You can refer to the data at the end of this issue

It seems like that old resolved issue was about a perf regression in Lightning only compared to PyTorch itself when using CUDNN V7 API, which was resolved with CUDNN V8 API. So I don't think that will affect us as Torch switched to using the CUDNN V8 API for BF16 convolution support a long time ago.

The statement at the bottom of the issue is also true currently, since BF16 mixed-precision can indeed having poor performance under certain workflows. This can cause situations where working in TF32 can be faster by eliminating slow casts which result in a negative perf benefit, but that doesn't seem to apply for webui SD inference.

I've tested BF16 autocast before in webui and it does indeed have a significant memory and performance impact on inference since Torch doesn't autocast many ops to BF16 like it does for FP16, resulting it mostly working in TF32 when BF16 autocasting is enabled. Still faster than TF32 only for our use-case, but very little benefit of doing so for inference unlike training.

Similarly converting other components of a model such a UNET or CLIP to BF16 has a rather large impact on seed output, so that should likely be avoided as well unless someone trained a BF16 precision model directly.

but there may be some differences for VAE.

Yes, but rather than specifically VAE, I believe it has more to do with not using BF16 mixed-precision (autocast) here. We are only doing a single manual cast to BF16 VAE bias, while using FP16 mixed-precision.

@Sakura-Luna
Copy link
Collaborator Author

@Cyberbeing briefly tested the VAE running under different types, and posted two pictures here to show that there is no obvious content difference in these test samples.
e1
e2

Tested on these samples, there is no significant difference in speed between tf16 and bf16. The following is the content difference between tf16 and bf16 compared with tf32 on the test sample, and bf16 shows more deviations on all samples.
example 0
example 1
example 2
example 3

In previous tests I found an example of global bf16 causing significant content differences (I didn't keep it), which is why I insist that bf16 is only suitable as a fallback, there is no advantage of bf16 on the current implementation.

@Cyberbeing
Copy link
Contributor

Cyberbeing commented Apr 4, 2023

It would appear your testing may have been done with the non-default webui options which I mentioned recently in discussions. Can you double check?

torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False
torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False

I did a quick test myself and I can only reproduce results similar to yours when those options are set, though it does seem to be true that BF16 always degrades output more than FP16 does, the difference is basically invisible to the human eye unless you pixel peep.

I think what happened, was months ago when I last tested this I was still using xformers (non-deterministic), and I had only set the following before I discovered the massive degradation recently if I used FP16 VAE without that also set to false.

torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = True
torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False

Which is why I got the impression that BF16 VAE was closer to FP32 VAE than FP16 VAE was, since with those settings it was. Yet my testing wasn't apples to apples.

This reminds me, that we really should create a pull request to disable the reduced_precision_reduction options, which is a huge quality boost for FP16 & BF16 and seems to bring seed reproduction very close to FP32 VAE. At least on my GPU setting both to False had no impact on inference performance, but someone should really test the impact on training.


FP32.png vs TF32.png (reduced_precision_reduction disabled)
Manhattan norm: 7015.666674117092 / per pixel: 0.008920881492763636
Zero norm: 20698.0 / per pixel: 0.026318868001302082

FP32.png vs FP16.png (reduced_precision_reduction disabled)
Manhattan norm: 29685.33371661324 / per pixel: 0.03774685378597672
Zero norm: 83786.0 / per pixel: 0.10653940836588542

FP32.png vs BF16.png (reduced_precision_reduction disabled)
Manhattan norm: 205934.0009592753 / per pixel: 0.26185862345285454
Zero norm: 403128.0 / per pixel: 0.512603759765625

With webui/pytorch defaults they had nearly identical seed output with image output noticeably different from FP32, though technically FP16 was still 0.03% better, that was invisible to the human eye on the noise floor:

FP32.png FP16-pytorch-defaults.png
Manhattan norm: 15672142.778314568 / per pixel: 19.92816006764039
Zero norm: 775492.0 / per pixel: 0.9860890706380209

FP32.png vs BF16-pytorch-defaults.png
Manhattan norm: 15677395.778366838 / per pixel: 19.934839602618965
Zero norm: 775501.0 / per pixel: 0.9861005147298177

BF16-pytorch-defaults.png vs FP16-pytorch-defaults.png
Manhattan norm: 191109.00093583157 / per pixel: 0.24300766110208075
Zero norm: 381207.0 / per pixel: 0.4847297668457031


I'm beginning to see the merit of your approach, but I'd still prefer this to be a command line argument which is optionally enabled. Even better would be implementing both methods and have --bfloat16-vae auto and --bfloat16-vae always to give user choice.

The main problems with the auto on nan method, is indeed the first time you'd be repeating processing, but also that the vae then stays permanently as bfloat16 until you load a new model or vae, which could lead to inconsistent seed output. In other words, your output could change depending on the order and type of generations you perform.

@Sakura-Luna
Copy link
Collaborator Author

It would appear your testing may have been done with the non-default webui options which I mentioned recently in discussions. Can you double check?

torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False
torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False

I didn't set it in the code. If the PyTorch documentation is not invalid, the default setting is fp16 on and bf16 off, and it is not recommended to enable allow bf16.

@Sakura-Luna
Copy link
Collaborator Author

@Cyberbeing This is the result of disabling allow fp16. Similar to the previous one, fp16 is still more similar, and there is no obvious performance difference between the two. These samples are significantly different from not disabling allow fp16.
ee1
ee2

This reminds me, that we really should create a pull request to disable the reduced_precision_reduction options, which is a huge quality boost for FP16 & BF16 and seems to bring seed reproduction very close to FP32 VAE. At least on my GPU setting both to False had no impact on inference performance, but someone should really test the impact on training.

I don't see any examples where disabling it improves the quality, fp16 can show a high similarity to fp32 whether or not allow fp16 is disabled. Predictably, disabling it slows down training.

I'm beginning to see the merit of your approach, but I'd still prefer this to be a command line argument which is optionally enabled. Even better would be implementing both methods and have --bfloat16-vae auto and --bfloat16-vae always to give user choice.

The main problems with the auto on nan method, is indeed the first time you'd be repeating processing, but also that the vae then stays permanently as bfloat16 until you load a new model or vae, which could lead to inconsistent seed output. In other words, your output could change depending on the order and type of generations you perform.

Adding an enable parameter is trivial, but I want it to work out of the box so that users don't have to suffer through VAE anymore on supported devices. The reason why it is linked to nan-check is because it is relatively useless (I don’t think it makes sense to prevent a black image from being saved, because you have already spent time, it just makes you press delete one less time), and on the other hand, it is because it does not increase the complexity of use.

Since I don't see any advantage of bf16, I won't support global bf16, it just reduces accuracy for nothing, maybe I can try another VAE. I think another suitable option is to add VAE type recognition to go with the pre-converted bf16 VAE, it can be reproduced stably, but it increases the cost of use. I know that this lazy conversion may cause inconsistent output on a specific VAE. This is a matter of trade-offs, but thanks to the similarity between fp32 and fp16, this inconsistency can be regarded as a difference with fp32.

@Cyberbeing
Copy link
Contributor

Cyberbeing commented Apr 5, 2023

[Edit: I realized a potential oversight that I may have forgotten to test FP32 VAE with reduced precision reduction enabled which was a missing data point, so I removed most of this post. It's irrelevant to the PR anyway, so not worth further discussion here.]

If the PyTorch documentation is not invalid, the default setting is fp16 on and bf16 off, and it is not recommended to enable allow bf16.

The PyTorch documentation is indeed invalid, which surprised me as well when I first discovered it months ago.

As you can see, the commit message states they've disabled it, but if you look at the code itself you'll see both fp16 & bf16 reduced_precision_reduction are enabled by default. Yet both degrade inference quality unless set to False, without any real performance benefit for SD inference. It does significantly change seeds though, so it would need to be made optional.

pytorch/pytorch@909a989
The PR made this change at the last moment, since the reviewer suggested the defaults remain as-is.
pytorch/pytorch@8b617f8
It was then merged into the main branch with both set to True in the code, but with an incorrect commit message.

Which is still the case in PyTorch 2.1 master branch

You can easily check the double-check pytorch defaults by just importing torch and calling them.
defaults
As you can see, both are True

Either way this is getting a bit off-topic, since setting those options to false should likely be made a separate PR, which will then likely need to be added to compatibility options so people can reproduce their old reduced precision seeds. I only brought this up, since it seems to make my results closer to your with them set to False.

I'll step out of this PR for now and just let automatic111 decide the course.

@Sakura-Luna
Copy link
Collaborator Author

Sakura-Luna commented Apr 5, 2023

@Cyberbeing I checked your example, it also clearly shows that fp16 maintains better accuracy under the same setting, as for the backend setting, it is not discussed here. You're right about one thing, we need opinion from the @AUTOMATIC1111.

@YHD233
Copy link

YHD233 commented Apr 7, 2023

12
13
I'm getting these errors, how do I fix them?

@Sakura-Luna
Copy link
Collaborator Author

@YHD233 What version of PyTorch and what type of GPU are you using?

@YHD233
Copy link

YHD233 commented Apr 7, 2023

@YHD233您使用的是什么版本的 PyTorch 和什么类型的 GPU?

python: 3.10.6 torch: 2.1.0.dev20230407+rocm5.4.2 GPU:RX6800 system: Kubuntu 22.04

But when I reopen the console the error doesn't appear again.

@Sakura-Luna
Copy link
Collaborator Author

@YHD233 I think I need to know your XYZ Plot parameters.

@YHD233
Copy link

YHD233 commented Apr 7, 2023

@YHD233 I think I need to know your XYZ Plot parameters.

X type:CFG Scale X values:8,9,10,11,12,13,14,15,16

When I have this error, this error will also occur when I close XYZ Plot and generate directly. It will not work until I close the console and open it again.

@Sakura-Luna
Copy link
Collaborator Author

Sakura-Luna commented Apr 7, 2023

@YHD233 My mistake, fixed.

@YHD233
Copy link

YHD233 commented Apr 8, 2023

I found that after stopping the generation when using XYZ plot, and then starting to generate, this error will be output when the progress bar is full

I will try to fix it.

I found that after stopping the generation when using XYZ plot, and then starting to generate, this error will be output when the progress bar is full

@Sakura-Luna Sakura-Luna linked an issue May 8, 2023 that may be closed by this pull request
@Sakura-Luna
Copy link
Collaborator Author

@AUTOMATIC1111 What do you think about this PR?

@AUTOMATIC1111
Copy link
Owner

Since we are on torch 2.0 and this appears to need torch 2.1, I have not considered it yet. I don't like adding a commandfline flag - if it works and is supported by GPU I think it should be enabled without asking user to enable it.

Also the most important question is does it really help with black square images in VAE?

@Sakura-Luna
Copy link
Collaborator Author

I don't like adding a commandfline flag - if it works and is supported by GPU I think it should be enabled without asking user to enable it.

I originally did it as part of the nan-check without adding parameters, but I found it was not feasible. We can't check AMD GPU support via PyTorch, so either introduce a new dependency, or add a startup parameter.

Also the most important question is does it really help with black square images in VAE?

On my test case, it is clearly effective, and in theory it can solve the same problem as no-half-vae, but consumes less VRAM.

@Sakura-Luna
Copy link
Collaborator Author

PyTorch lacks an explicit method to check for bf16 support on AMD GPUs.

@AUTOMATIC1111
Copy link
Owner

Can't you just create a one-number bf16 tensor and do some with it like multiply iy by 0.5 to test if bk16 is supported?

@Sakura-Luna
Copy link
Collaborator Author

Can't you just create a one-number bf16 tensor and do some with it like multiply iy by 0.5 to test if bf16 is supported?

I know it works, it's just that it's not aesthetically pleasing and I don't have the equipment to test it.

@Sakura-Luna
Copy link
Collaborator Author

I found a method on PyTorch that will try to enable this feature by default.

@catboxanon
Copy link
Collaborator

catboxanon commented May 9, 2023

The title for this PR isn't really clear either imo. It isn't adding support for bf16 VAEs, that's already supported out-of-the-box with Torch 2.1. All this adds is the rollback feature for when a fp16 VAE produces NaNs.

@Sakura-Luna
Copy link
Collaborator Author

The title for this PR isn't really clear either imo. It isn't adding support for bf16 VAEs, that's already supported out-of-the-box with Torch 2.1. All this adds is the rollback feature for when a fp16 VAE produces NaNs.

WebUI doesn't have code to handle bf16, so it looks like it will switch to fp even though PyTorch supports bf16. But it doesn't matter, the name of the pr has no effect.

@Sakura-Luna Sakura-Luna changed the title Add bf16 support for VAE Add bf16 support for VAE as a fallback May 9, 2023
@Sakura-Luna Sakura-Luna removed a link to an issue May 24, 2023
@catboxanon
Copy link
Collaborator

I think 23c947a supercedes this?

@Sakura-Luna
Copy link
Collaborator Author

I think 23c947a supercedes this?

You are wrong, both fp16 and bf16 are designed to save VRAM. If users can accept fp32 at any time, it is better to run VAE with fp32 globally, and it is even more useless to fall back to fp32.

@lone-wolf-akela
Copy link

Now that PyTorch 2.1 has been released, any news on this?

@AUTOMATIC1111
Copy link
Owner

Every since this was made, the webui got a similar mechanism (and I used the idea from this PR) to deal with SDXL VAE errors, but converting to FP32 instead of BF16. So this PR would have to be integrated into existing system, which I did in ac0ecf3.

@Sakura-Luna Sakura-Luna deleted the master branch January 3, 2024 14:11
@w-e-w w-e-w mentioned this pull request Feb 17, 2024
@pawel665j pawel665j mentioned this pull request Apr 16, 2024
ruchej pushed a commit to ruchej/stable-diffusion-webui that referenced this pull request Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants