Error with CUDA version : "the provided PTX was compiled with an unsupported toolchain" #517

olivbrau · 2024-12-08T20:20:58Z

Hi everyone.
After successfully using CPU build of sd,
I've tried the CUDA build for Windows : sd-master-9578fdc-bin-win-cuda12-x64.zip
I've dezipped dll from cudart-sd-bin-win-cu12-x64.zip in the same directory
When I use sd, I get this error :

ggml_cuda_compute_forward: GET_ROWS failed
CUDA error: the provided PTX was compiled with an unsupported toolchain.
current device: 0, in function ggml_cuda_compute_forward at D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2174
err
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:70: CUDA error

I have not installed all CUDA driver from NVidia because I don't have admin right on my computer (and will never have I fear).
Can this explain the error ?
Does cudart-sd-bin-win-cu12-x64.zip provide all required dll to run sd in my case ?
Is there a way to run sd with GPU on a computer without admin right ?
(I kwnow I can run some LLM tools on my GPU without installing all the drivers : for ex. LM Studio can do it, I don't know how)

For information, here the full log :

sd -m .\sd-v1-4.ckpt -p "a cute birman cat" --steps 30 -H 1024 -W 1024
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX A1000 Laptop GPU, compute capability 8.6, VMM: yes
[INFO ] stable-diffusion.cpp:191 - loading model from '.\sd-v1-4.ckpt'
[INFO ] model.cpp:891 - load .\sd-v1-4.ckpt using checkpoint format
ZIP 0, name = archive/data.pkl, dir = archive/
[INFO ] stable-diffusion.cpp:238 - Version: SD 1.x
[INFO ] stable-diffusion.cpp:271 - Weight type: f32
[INFO ] stable-diffusion.cpp:272 - Conditioner weight type: f32
[INFO ] stable-diffusion.cpp:273 - Diffusion model weight type: f32
[INFO ] stable-diffusion.cpp:274 - VAE weight type: f32
[INFO ] stable-diffusion.cpp:512 - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:516 - loading model from '.\sd-v1-4.ckpt' completed, taking 8.97s
[INFO ] stable-diffusion.cpp:546 - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:673 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1199 - apply_loras completed, taking 0.00s
ggml_cuda_compute_forward: GET_ROWS failed
CUDA error: the provided PTX was compiled with an unsupported toolchain.
current device: 0, in function ggml_cuda_compute_forward at D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2174
err
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:70: CUDA error

Any help would be very appreciated !
Olivier

stduhpf · 2024-12-08T21:48:47Z

I have not installed all CUDA driver from NVidia because I don't have admin right on my computer (and will never have I fear).
Can this explain the error ?

Most likely, yes. If you can't rune CUDA, you should try Vulkan or Sycl instead.

Green-Sky · 2024-12-09T11:53:06Z

Looking at the output, it should work though.

Device 0: NVIDIA RTX A1000 Laptop GPU, compute capability 8.6, VMM: yes

It definitely finds the device. Maybe we compile with a too-new-a-version of CUDA. The CI built binaries use 12.2 .

In the meantime, follow @stduhpf 's advice :) .

olivbrau · 2024-12-09T13:00:10Z

Thanks a lot for your (quick) help !
I have donwloaded the "runtime zip" called VulkanRT-1.3.296.0-Components.zip which doesn't need admin right.
Now the error is a lack of memory.
My graphics card only has 4 Go VRAM.
However, when using sd on CPU, I noticed that it needed only ~2.5 Go of RAM, so I thouht it would work on my graphics card.
The below error mention that I need 8.7 Go VRAM to run the simulation.
I've tried some options to run part of the model on CPU but it still needs 8.7 Go VRAM.
Is there a way to reduce this need ?
Thanks in advance.

sd -m .\sd-v1-4.ckpt -p "a lovely cat" --steps 30 -H 1024 -W 1024 --vae-on-cpu --clip-on-cpu --diffusion-fa
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA RTX A1000 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
ggml_vulkan: Compiling shaders..............................Done!
[INFO ] stable-diffusion.cpp:191 - loading model from '.\sd-v1-4.ckpt'
[INFO ] model.cpp:891 - load .\sd-v1-4.ckpt using checkpoint format
ZIP 0, name = archive/data.pkl, dir = archive/
[INFO ] stable-diffusion.cpp:238 - Version: SD 1.x
[INFO ] stable-diffusion.cpp:271 - Weight type: f32
[INFO ] stable-diffusion.cpp:272 - Conditioner weight type: f32
[INFO ] stable-diffusion.cpp:273 - Diffusion model weight type: f32
[INFO ] stable-diffusion.cpp:274 - VAE weight type: f32
[INFO ] stable-diffusion.cpp:318 - CLIP: Using CPU backend
[INFO ] stable-diffusion.cpp:322 - Using flash attention in the diffusion model
[INFO ] stable-diffusion.cpp:350 - VAE Autoencoder: Using CPU backend
[INFO ] stable-diffusion.cpp:512 - total params memory size = 2719.24MB (VRAM 2155.33MB, RAM 563.91MB): clip 469.44MB(RAM), unet 2155.33MB(VRAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:516 - loading model from '.\sd-v1-4.ckpt' completed, taking 14.84s
[INFO ] stable-diffusion.cpp:546 - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:673 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1199 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1332 - get_learned_condition completed, taking 246 ms
[INFO ] stable-diffusion.cpp:1355 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1359 - generating image: 1/1 - seed 42
ggml_vulkan: Device memory allocation of size 8767072784 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 8767072784
[ERROR] ggml_extend.hpp:1016 - unet: failed to allocate the compute buffer

olivbrau · 2024-12-09T13:11:39Z

Hum, in fact, I'm wrong : with -H 1024 -W 1024 it really needs this amount of RAM even with CPU...
By reducing image size, it finally works on my GPU !
However I'm still eager to know how I could reduce RAM needs with other way than just limiting the image size.
For example, all of these options resulted in the same need of 8767072784 VRAM in the error message:
--vae-on-cpu --clip-on-cpu --diffusion-fa --vae-tiling
-> with or without these options, the VRAM need seems to be the same

Green-Sky · 2024-12-09T13:51:55Z

[INFO ] stable-diffusion.cpp:272 - Conditioner weight type: f32
[INFO ] stable-diffusion.cpp:273 - Diffusion model weight type: f32
[INFO ] stable-diffusion.cpp:274 - VAE weight type: f32

You likely don't need the extra precision that f32 gives you. You can use different quantizations to lower memory usage. Try --type f16 for starters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with CUDA version : "the provided PTX was compiled with an unsupported toolchain" #517

Error with CUDA version : "the provided PTX was compiled with an unsupported toolchain" #517

olivbrau commented Dec 8, 2024

stduhpf commented Dec 8, 2024

Green-Sky commented Dec 9, 2024

olivbrau commented Dec 9, 2024

olivbrau commented Dec 9, 2024 •

edited

Loading

Green-Sky commented Dec 9, 2024

Error with CUDA version : "the provided PTX was compiled with an unsupported toolchain" #517

Error with CUDA version : "the provided PTX was compiled with an unsupported toolchain" #517

Comments

olivbrau commented Dec 8, 2024

stduhpf commented Dec 8, 2024

Green-Sky commented Dec 9, 2024

olivbrau commented Dec 9, 2024

olivbrau commented Dec 9, 2024 • edited Loading

Green-Sky commented Dec 9, 2024

olivbrau commented Dec 9, 2024 •

edited

Loading