Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Device capability check incorrectly sets cuDNN benchmark on cards it should not or on multi-GPU systems, causes non-deterministic results #12879

Closed
1 task done
catboxanon opened this issue Aug 31, 2023 · 0 comments · Fixed by #12924
Labels
bug Report of a confirmed bug

Comments

@catboxanon
Copy link
Collaborator

catboxanon commented Aug 31, 2023

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

What happened?

# enabling benchmark option seems to enable a range of cards to do fp16 when they otherwise can't
# see https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/4407
if any(torch.cuda.get_device_capability(devid) == (7, 5) for devid in range(0, torch.cuda.device_count())):
torch.backends.cudnn.benchmark = True

This check above sets enables cuDNN benchmark on cards it should not as the title describes. The linked PR was made with the intention to only apply to 16XX series cards, but the CUDA compute capability of 7.5 applies to more than just the 16XX series. https://developer.nvidia.com/cuda-gpus

As noted in the PyTorch docs, enabling this also makes results non-deterministic. https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking

This is particular problematic because 1) I have a 2080TI, which supports fp16 operations just fine, so this enables this option when it shouldn't, and 2) I don't actually run the webui with my 2080TI, and instead with my 3090, as I have two GPUs installed (this occurs because the check looks for any() GPU that has this compute cabability).

Steps to reproduce the problem

Run the webui with any non-16XX card but one that has a compute cabability of 7.5 installed. Note that results between webui restarts will never be 1:1 because of the non-deterministic results.

What should have happened?

This benchmark option should only be enabled for 16XX cards, and only if they are actually in use by the webui (either as the default or the one determined by --device-id).

Sysinfo

sysinfo.txt

What browsers do you use to access the UI ?

Mozilla Firefox

Console logs

n/a

Additional information

I worked with voldy on debugging this on Discord to find the root cause but I'm opening an issue here for more visibility and to better track it.

Also, funnily enough, I didn't see it until now, but #9359 is related to this. Since the current code doesn't fulfill the PR's original purpose of applying to only 16XX cards and when they're actually in use (this is what made it so hard to track down) this really should be considered a bug and not a feature request.

@catboxanon catboxanon added bug-report Report of a bug, yet to be confirmed bug Report of a confirmed bug and removed bug-report Report of a bug, yet to be confirmed labels Aug 31, 2023
@catboxanon catboxanon changed the title [Bug]: Device capability check incorrectly sets cuDNN benchmark on cards it should not [Bug]: Device capability check incorrectly sets cuDNN benchmark on cards it should not or on multi-GPU systems, causes non-deterministic results Aug 31, 2023
@catboxanon catboxanon linked a pull request Sep 2, 2023 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Report of a confirmed bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant