Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with CUDA version : "the provided PTX was compiled with an unsupported toolchain" #517

Open
olivbrau opened this issue Dec 8, 2024 · 5 comments

Comments

@olivbrau
Copy link

olivbrau commented Dec 8, 2024

Hi everyone.
After successfully using CPU build of sd,
I've tried the CUDA build for Windows : sd-master-9578fdc-bin-win-cuda12-x64.zip
I've dezipped dll from cudart-sd-bin-win-cu12-x64.zip in the same directory
When I use sd, I get this error :

ggml_cuda_compute_forward: GET_ROWS failed
CUDA error: the provided PTX was compiled with an unsupported toolchain.
current device: 0, in function ggml_cuda_compute_forward at D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2174
err
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:70: CUDA error

I have not installed all CUDA driver from NVidia because I don't have admin right on my computer (and will never have I fear).
Can this explain the error ?
Does cudart-sd-bin-win-cu12-x64.zip provide all required dll to run sd in my case ?
Is there a way to run sd with GPU on a computer without admin right ?
(I kwnow I can run some LLM tools on my GPU without installing all the drivers : for ex. LM Studio can do it, I don't know how)

For information, here the full log :

sd -m .\sd-v1-4.ckpt -p "a cute birman cat" --steps 30 -H 1024 -W 1024
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX A1000 Laptop GPU, compute capability 8.6, VMM: yes
[INFO ] stable-diffusion.cpp:191 - loading model from '.\sd-v1-4.ckpt'
[INFO ] model.cpp:891 - load .\sd-v1-4.ckpt using checkpoint format
ZIP 0, name = archive/data.pkl, dir = archive/
[INFO ] stable-diffusion.cpp:238 - Version: SD 1.x
[INFO ] stable-diffusion.cpp:271 - Weight type: f32
[INFO ] stable-diffusion.cpp:272 - Conditioner weight type: f32
[INFO ] stable-diffusion.cpp:273 - Diffusion model weight type: f32
[INFO ] stable-diffusion.cpp:274 - VAE weight type: f32
[INFO ] stable-diffusion.cpp:512 - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:516 - loading model from '.\sd-v1-4.ckpt' completed, taking 8.97s
[INFO ] stable-diffusion.cpp:546 - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:673 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1199 - apply_loras completed, taking 0.00s
ggml_cuda_compute_forward: GET_ROWS failed
CUDA error: the provided PTX was compiled with an unsupported toolchain.
current device: 0, in function ggml_cuda_compute_forward at D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2174
err
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:70: CUDA error

Any help would be very appreciated !
Olivier

@stduhpf
Copy link
Contributor

stduhpf commented Dec 8, 2024

I have not installed all CUDA driver from NVidia because I don't have admin right on my computer (and will never have I fear).
Can this explain the error ?

Most likely, yes. If you can't rune CUDA, you should try Vulkan or Sycl instead.

@Green-Sky
Copy link
Contributor

Looking at the output, it should work though.

Device 0: NVIDIA RTX A1000 Laptop GPU, compute capability 8.6, VMM: yes

It definitely finds the device. Maybe we compile with a too-new-a-version of CUDA. The CI built binaries use 12.2 .

In the meantime, follow @stduhpf 's advice :) .

@olivbrau
Copy link
Author

olivbrau commented Dec 9, 2024

Thanks a lot for your (quick) help !
I have donwloaded the "runtime zip" called VulkanRT-1.3.296.0-Components.zip which doesn't need admin right.
Now the error is a lack of memory.
My graphics card only has 4 Go VRAM.
However, when using sd on CPU, I noticed that it needed only ~2.5 Go of RAM, so I thouht it would work on my graphics card.
The below error mention that I need 8.7 Go VRAM to run the simulation.
I've tried some options to run part of the model on CPU but it still needs 8.7 Go VRAM.
Is there a way to reduce this need ?
Thanks in advance.

sd -m .\sd-v1-4.ckpt -p "a lovely cat" --steps 30 -H 1024 -W 1024 --vae-on-cpu --clip-on-cpu --diffusion-fa
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA RTX A1000 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
ggml_vulkan: Compiling shaders..............................Done!
[INFO ] stable-diffusion.cpp:191 - loading model from '.\sd-v1-4.ckpt'
[INFO ] model.cpp:891 - load .\sd-v1-4.ckpt using checkpoint format
ZIP 0, name = archive/data.pkl, dir = archive/
[INFO ] stable-diffusion.cpp:238 - Version: SD 1.x
[INFO ] stable-diffusion.cpp:271 - Weight type: f32
[INFO ] stable-diffusion.cpp:272 - Conditioner weight type: f32
[INFO ] stable-diffusion.cpp:273 - Diffusion model weight type: f32
[INFO ] stable-diffusion.cpp:274 - VAE weight type: f32
[INFO ] stable-diffusion.cpp:318 - CLIP: Using CPU backend
[INFO ] stable-diffusion.cpp:322 - Using flash attention in the diffusion model
[INFO ] stable-diffusion.cpp:350 - VAE Autoencoder: Using CPU backend
[INFO ] stable-diffusion.cpp:512 - total params memory size = 2719.24MB (VRAM 2155.33MB, RAM 563.91MB): clip 469.44MB(RAM), unet 2155.33MB(VRAM), vae 94.47MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:516 - loading model from '.\sd-v1-4.ckpt' completed, taking 14.84s
[INFO ] stable-diffusion.cpp:546 - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:673 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1199 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1332 - get_learned_condition completed, taking 246 ms
[INFO ] stable-diffusion.cpp:1355 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1359 - generating image: 1/1 - seed 42
ggml_vulkan: Device memory allocation of size 8767072784 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 8767072784
[ERROR] ggml_extend.hpp:1016 - unet: failed to allocate the compute buffer

@olivbrau
Copy link
Author

olivbrau commented Dec 9, 2024

Hum, in fact, I'm wrong : with -H 1024 -W 1024 it really needs this amount of RAM even with CPU...
By reducing image size, it finally works on my GPU !
However I'm still eager to know how I could reduce RAM needs with other way than just limiting the image size.
For example, all of these options resulted in the same need of 8767072784 VRAM in the error message:
--vae-on-cpu --clip-on-cpu --diffusion-fa --vae-tiling
-> with or without these options, the VRAM need seems to be the same

@Green-Sky
Copy link
Contributor

[INFO ] stable-diffusion.cpp:272 - Conditioner weight type: f32
[INFO ] stable-diffusion.cpp:273 - Diffusion model weight type: f32
[INFO ] stable-diffusion.cpp:274 - VAE weight type: f32

You likely don't need the extra precision that f32 gives you. You can use different quantizations to lower memory usage. Try --type f16 for starters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants