-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NVIDIA cuBLAS support #1044
Conversation
I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great - and I guess ppl results are similar between non-cuBLAS and cuBLAS?
I haven't completed a full run yet, but with 7B q4_0, the perplexity of the first iterations is identical to OpenBLAS. It will probably be higher in f16xf32 because instead of converting to f32xf32, I convert to f16xf16. |
Perplexity with 7B q4_0 is 6.2838
./perplexity -m models/7B/ggml-model-q4_0.bin -f wikitext-2-raw/wiki.test.raw -t 8
main: seed = 1681837585
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 59.11 KB
llama_model_load_internal: mem required = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size = 256.00 MB
system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | llama_print_timings: load time = 11045.83 ms |
Is FindCUDAToolkit a good reason to bump the CMake version to 3.17? |
This is the expected value
Yes |
Tested successfully under windows. Build with Though I would appreciate a review on the cmake changes, I have no idea how any of that works. |
hmm, cmake on ubuntu 20.04 shipps 3.16 by default but even the gh action runner uses 3.26 |
Is it possible to make the CMake version depend on |
@@ -97,6 +97,10 @@ ifdef LLAMA_OPENBLAS | |||
CFLAGS += -DGGML_USE_OPENBLAS -I/usr/local/include/openblas | |||
LDFLAGS += -lopenblas | |||
endif | |||
ifdef LLAMA_CUBLAS | |||
CFLAGS += -DGGML_USE_CUBLAS -I/usr/local/cuda/include | |||
LDFLAGS += -lcublas_static -lculibos -lcudart_static -lcublasLt_static -lpthread -ldl -L/usr/local/cuda/lib64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pthread is added above depending on os.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, do we actually ever link against pthread? why is it only a compile flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand it is a dependency of cuda, so it is required to build with cublas.
the |
That seems to work, updated. |
yup, perfect |
Very exciting. Can't wait to try it out 🤩 |
Just wondering for all those who have tried, how much speedup do you get in the batched prompt eval timings vs openblas (not perplexity calculations)? Would be good to benchmark against a fixed context size, say 1024 tokens. |
@rabidcopy our newest Clblast implementation does the dequantization on GPU as well, which actually provides much better speeds, since a major bottleneck was actually transferring the data on and off the GPU after the mat mul. That's why I am curious about how fast this might compare. |
Found a comparison someone did between llama.cpp with cuBLAS and koboldcpp with CLBlast. Maybe it would be worth implementing CLBlast over here as well? (Sorry, wasn't aware there was further improvements on CLBLast in koboldcpp since I last compared on my own hardware.)
|
@LostRuins I have a thread going on in the discussions where people are trying out the Kobold clblast implementation. On my integrated Intel HD530 clblast prompt ingestion was twice as slow as openblas but someone with a Nvidia 3060 reported a 50% improvement on his end. |
Here are benchmarks for my system Note: This is with the non-quantized 13B-16bit model
With cublas make clean && LLAMA_CUBLAS=1 make -j && ./main --mlock -t 8 -b 512 -m ./models/13B/ggml-model-f16.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt
llama_print_timings: load time = 20691.75 ms
llama_print_timings: sample time = 16.89 ms / 50 runs ( 0.34 ms per run)
llama_print_timings: prompt eval time = 18748.63 ms / 373 tokens ( 50.26 ms per token)
llama_print_timings: eval time = 24565.83 ms / 49 runs ( 501.34 ms per run)
llama_print_timings: total time = 45275.08 ms With OpenBLAS make clean && LLAMA_OPENBLAS=1 make -j && ./main --mlock -t 8 -b 512 -m ./models/13B/ggml-model-f16.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt
llama_print_timings: load time = 43043.43 ms
llama_print_timings: sample time = 17.31 ms / 50 runs ( 0.35 ms per run)
llama_print_timings: prompt eval time = 27472.01 ms / 373 tokens ( 73.65 ms per token)
llama_print_timings: eval time = 24480.05 ms / 49 runs ( 499.59 ms per run)
llama_print_timings: total time = 67541.45 ms So that's a ~48% total time speedup, super nice! |
cc @ravenscroftj https://github.com/ggerganov/llama.cpp/blob/master/Makefile#L107-L115 Will be available in the |
oh that is awesome thanks for the tag @ggerganov - will definitely be looking at adding this as making suggestions much faster will make turbopilot much more usable! |
CUDA_CHECK(cudaFree(d_X)); | ||
CUDA_CHECK(cudaFree(d_Y)); | ||
CUDA_CHECK(cudaFree(d_D)); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add cuda quantize row below as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not used in cuBLAS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, my bad, we do not need to quantize the out tensor nor the weight matrix.
Adds support for NVIDIA cuBLAS for batched operations. In my system this is significantly faster than OpenBLAS.
Build with
LLAMA_CUBLAS
:Perplexity seconds per pass (i9 9900k, RTX 3080 10GB)