Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with 16xx cards #4407

Merged
merged 3 commits into from
Dec 3, 2022
Merged

Fix issue with 16xx cards #4407

merged 3 commits into from
Dec 3, 2022

Conversation

yoinked-h
Copy link
Contributor

16XX cards dont natively support FP32; but with this simple workaround they do work, without --precision full and --no-half

@C43H66N12O12S2
Copy link
Collaborator

This is a fix I've seen floated around on threads for a while now and it's a curious one. Enabling cuDNN shouldn't have any effect as cuDNN is enabled by default in all cases if available.

So, logically only benchmark should be fixing this issue (and that seems more like a bug with PyTorch tbh). Could anybody with a 16xx test only enabling benchmarking?

Anyhow, you should gate benchmark enablement beyond a SM check for 16xx cards as enabling benchmarking has highly variable results. It degraded performance on my 3080, for example.

@TKoestlerx
Copy link

Just added the 2 lines on an gtx1660 super.(6gb)

And indeed. I can start without command line parameters and the image I get is ok. (not black).
But the performance absolutely collapses.

From 1.5 iteration / sec to 2.5sec / iteration.

With the same 2 lines active, but started with --no-half and --precision full the performance is back to normal.

@drax-xard
Copy link

I had a 1650 until recently and it worked fine with just "--medvram", didn't need to use no-half or such. (I'm on Linux)

@yoinked-h
Copy link
Contributor Author

Maybe you could check what GPU that is enabled, if it even is possible, to filter which should get it

@yoinked-h
Copy link
Contributor Author

yoinked-h commented Nov 8, 2022

This is a fix I've seen floated around on threads for a while now and it's a curious one. Enabling cuDNN shouldn't have any effect as cuDNN is enabled by default in all cases if available.

So, logically only benchmark should be fixing this issue (and that seems more like a bug with PyTorch tbh). Could anybody with a 16xx test only enabling benchmarking?

Anyhow, you should gate benchmark enablement beyond a SM check for 16xx cards as enabling benchmarking has highly variable results. It degraded performance on my 3080, for example.

I am a 1660 user, I use this fix in order to run it; and yeah, if there is a way to check if the gpu is a 16xx card, ill try and implement it, haven't found one yet

@yoinked-h yoinked-h marked this pull request as draft November 8, 2022 01:16
@yoinked-h yoinked-h marked this pull request as ready for review November 8, 2022 02:10
@yoinked-h
Copy link
Contributor Author

this might take more time on startup; since it loops over every card and loops over a list of turing cards and checks the name; but its better for the long run preformance

@C43H66N12O12S2
Copy link
Collaborator

torch.cuda.get_device_capability(device) == (7, 5)

@XiteSDF
Copy link

XiteSDF commented Nov 8, 2022

Why are the 20xx cards in the list though? They work fine now, and judging by other replies this change would just tank performance for no reason.

@yoinked-h
Copy link
Contributor Author

Why are the 20xx cards in the list though? They work fine now, and judging by other replies this change would just tank performance for no reason.

some 20xx cards are turing
although mentioned by C43H66N12O12S2, ill implement the better solution

@yoinked-h yoinked-h marked this pull request as draft November 8, 2022 23:17
thanks C43H66N12O12S2
@yoinked-h yoinked-h marked this pull request as ready for review November 8, 2022 23:19
@JackCopland
Copy link

I can confirm this fix works for me on a 1660 SUPER. Till now I've had to use the args "--precision full" and "--no-half" otherwise I get black images. With this change made I no longer see black images even without those args. (In both cases I am also using "--medvram" and "--xformers")

It looks like @C43H66N12O12S2 was correct that it is the benchmarking change that is fixing this. I commented out "torch.backends.cudnn.enabled = True" and still saw this fix work. I guess that line can be removed from this change unless it has some other effect.

@AUTOMATIC1111 AUTOMATIC1111 merged commit 681c000 into AUTOMATIC1111:master Dec 3, 2022
@MrCheeze
Copy link
Contributor

MrCheeze commented Dec 3, 2022

This is a fix I've seen floated around on threads for a while now and it's a curious one. Enabling cuDNN shouldn't have any effect as cuDNN is enabled by default in all cases if available.

So, logically only benchmark should be fixing this issue (and that seems more like a bug with PyTorch tbh). Could anybody with a 16xx test only enabling benchmarking?

Anyhow, you should gate benchmark enablement beyond a SM check for 16xx cards as enabling benchmarking has highly variable results. It degraded performance on my 3080, for example.

benchmark=True is the only thing that has an effect, yes. And as far as I know it improves performance if anything, at least on the second generation onwards once the benchmarking has already been done?

By the way, calculations with 16-bit floats are extremely slow on 16xx cards, so even with this fix you should always be using --no-half anyway unless you're truly desperate for vram. Might be worth updating the documentation accordingly. (Although I don't know exactly which set of cards has fast 16-bit and which set doesn't.)

@yoinked-h yoinked-h deleted the patch-1 branch December 6, 2022 05:48
Vermiliond added a commit to Vermiliond/stable-diffusion-webui that referenced this pull request Dec 9, 2022
Vermiliond added a commit to Vermiliond/stable-diffusion-webui that referenced this pull request Dec 9, 2022
…an ones related to the PR"

This reverts commit 2651267.
Vermiliond added a commit to Vermiliond/stable-diffusion-webui that referenced this pull request Dec 9, 2022
@pinyangcong
Copy link

@yoinked-h @C43H66N12O12S2 Maybe I need torch.backends.cudnn.benchmark_limit = 0, because the total number of convolution algorithm benchmark tests is small, which can still lead to the possibility of issue occurring in my 1650 card.

@yoinked-h
Copy link
Contributor Author

ill try it out with torch2

@pinyangcong
Copy link

ill try it out with torch2

According to some tutorial websites, it seems that only the 16 series will have issues with not working.
if any(["GeForce GTX 16" in torch.cuda.get_device_name(devid) for devid in range(0, torch.cuda.device_count())]):
may be better than
if any([torch.cuda.get_device_capability(devid) == (7, 5) for devid in range(0, torch.cuda.device_count())]):

@yoinked-h
Copy link
Contributor Author

yep; tensor cores are the main reason the 20xx series does fp32 normally, 16xx dont get that comfort

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants