[Bug]: Device capability check incorrectly sets cuDNN benchmark on cards it should not or on multi-GPU systems, causes non-deterministic results #12879

catboxanon · 2023-08-31T05:17:14Z

Is there an existing issue for this?

I have searched the existing issues and checked the recent builds/commits

What happened?

stable-diffusion-webui/modules/devices.py

Lines 61 to 64 in 5ef669d

    
           # enabling benchmark option seems to enable a range of cards to do fp16 when they otherwise can't 
        
           # see https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/4407 
        
           if any(torch.cuda.get_device_capability(devid) == (7, 5) for devid in range(0, torch.cuda.device_count())): 
        
               torch.backends.cudnn.benchmark = True

This check above sets enables cuDNN benchmark on cards it should not as the title describes. The linked PR was made with the intention to only apply to 16XX series cards, but the CUDA compute capability of 7.5 applies to more than just the 16XX series. https://developer.nvidia.com/cuda-gpus

As noted in the PyTorch docs, enabling this also makes results non-deterministic. https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking

This is particular problematic because 1) I have a 2080TI, which supports fp16 operations just fine, so this enables this option when it shouldn't, and 2) I don't actually run the webui with my 2080TI, and instead with my 3090, as I have two GPUs installed (this occurs because the check looks for any() GPU that has this compute cabability).

Steps to reproduce the problem

Run the webui with any non-16XX card but one that has a compute cabability of 7.5 installed. Note that results between webui restarts will never be 1:1 because of the non-deterministic results.

What should have happened?

This benchmark option should only be enabled for 16XX cards, and only if they are actually in use by the webui (either as the default or the one determined by --device-id).

Sysinfo

sysinfo.txt

What browsers do you use to access the UI ?

Mozilla Firefox

Console logs

n/a

Additional information

I worked with voldy on debugging this on Discord to find the root cause but I'm opening an issue here for more visibility and to better track it.

Also, funnily enough, I didn't see it until now, but #9359 is related to this. Since the current code doesn't fulfill the PR's original purpose of applying to only 16XX cards and when they're actually in use (this is what made it so hard to track down) this really should be considered a bug and not a feature request.

The text was updated successfully, but these errors were encountered:

catboxanon added bug-report Report of a bug, yet to be confirmed bug Report of a confirmed bug and removed bug-report Report of a bug, yet to be confirmed labels Aug 31, 2023

catboxanon changed the title ~~[Bug]: Device capability check incorrectly sets cuDNN benchmark on cards it should not~~ [Bug]: Device capability check incorrectly sets cuDNN benchmark on cards it should not or on multi-GPU systems, causes non-deterministic results Aug 31, 2023

This was referenced Aug 31, 2023

Commandline arg for disabling "torch.backends.cudnn.benchmark" #9359

Closed

More accurate check for enabling cuDNN benchmark on 16XX cards #12924

Merged

catboxanon linked a pull request Sep 2, 2023 that will close this issue

More accurate check for enabling cuDNN benchmark on 16XX cards #12924

Merged

4 tasks

catboxanon closed this as completed Sep 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Device capability check incorrectly sets cuDNN benchmark on cards it should not or on multi-GPU systems, causes non-deterministic results #12879

[Bug]: Device capability check incorrectly sets cuDNN benchmark on cards it should not or on multi-GPU systems, causes non-deterministic results #12879

catboxanon commented Aug 31, 2023 •

edited

Loading

[Bug]: Device capability check incorrectly sets cuDNN benchmark on cards it should not or on multi-GPU systems, causes non-deterministic results #12879

[Bug]: Device capability check incorrectly sets cuDNN benchmark on cards it should not or on multi-GPU systems, causes non-deterministic results #12879

Comments

catboxanon commented Aug 31, 2023 • edited Loading

Is there an existing issue for this?

What happened?

Steps to reproduce the problem

What should have happened?

Sysinfo

What browsers do you use to access the UI ?

Console logs

Additional information

catboxanon commented Aug 31, 2023 •

edited

Loading