-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cuBLAS performance by dequantizing on the GPU #1065
Conversation
@@ -150,6 +150,10 @@ if (LLAMA_CUBLAS) | |||
if (CUDAToolkit_FOUND) | |||
message(STATUS "cuBLAS found") | |||
|
|||
enable_language(CUDA) | |||
|
|||
set(GGML_CUDA_SOURCES ggml-cuda.cu ggml-cuda.h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a discussion somewhere recently about splitting out the accel specific code into dedicated .c files. what was the state on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to keep all the cuda code in ggml-cuda.cu
to avoid having to compile ggml with nvcc, but otherwise nothing changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff
Just keep in mind that my long-term plan for GPU support is different and it is likely at some point to drop these changes
Can you also provide sample times for prompt ingestion with and without cuBLAS?
Maybe one of the chat
examples that we have in the repo
I just added some prompt eval times for 7B q4_0. |
Wow, this is a game changer! Interestingly, 16 threads and 8 threads seems to be same speed now. Only uses ~600MB of GPU RAM (RTX 3080), and GPU utilization 65% or so. Amazing work! All tests run with: w/ cuBLAS (this PR):
Without CUDA (8 threads):
|
Even more incredible, this allows me to run the full 65B model on a machine with 32GB of RAM quite quickly!
Btw, for comparison, from last month 7B was at |
Nice! Don't use this just yet to run perplexity computations though, I found a synchronization issue that may cause inaccurate results. Should be fixed in the last commit though, I am running a full perplexity test and if it looks good it will be ready to merge. |
wait... how does that work. are you not supposed to need ~60gigs of ram for 65B ? |
I'm using mmap mode, so it has to go to disk to read parts of the model in as it's going. That was brutally slow previously, but the overlap with running things on GPU seems to make it feasible now.
Actually, even on CPU it's much better than it used to be. Everyone doing amazing work here :). |
We could probably get another 10% or so speedup by pre-allocating the cuda memory, but I am not sure how to do that without littering the ggml code with mode cuda specific stuff. |
On a side node, should we increase the default batch size when ggml is built with BLAS support? Would make it easier to use. |
A problem while building for Windows using Visual Studio: FAILED: CMakeFiles/ggml.dir/ggml-cuda.cu.obj |
I believe ggml doesn't even use BLAS if the batch size isn't large enough despite system_info reporting BLAS=1. You need a larger batch size to cover the overhead of using the library. Personally I haven't seen any performance difference between BLAS runs with say a batch size of 512 vs 2048. |
That's right, the default batch is 8, but the minimum to use BLAS is 32.
Currently the maximum batch size is 512, if you try to use a larger one it will be clamped to 512. |
@avada-z I think it should be fixed now. |
Hello there. I'm trying to build it using the make LLAMA_CUBLAS=1 command with Windows and WIN64DevKit. However, even though I have CUDA Toolkit installed and changed the paths for -L and -I in the makefile accordingly, it still misses the following libaries: Where can I get them? I would appreciate some help getting this to work. Thank you! |
cudaMemcpyAsync(d_Q, (char *) src0->data + i03*nb03 + i02*nb02, | ||
GGML_TYPE_SIZE[type] * x_ne / GGML_BLCK_SIZE[type], cudaMemcpyHostToDevice, cudaStream)); | ||
|
||
dequantize_row_q_cuda(d_Q, d_X, ne01 * ne00, cudaStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we might be able to get better perf via cuda graphs by stitching dequantize, sgemm and quantize. Thoughts? See #1192
If the weights are stored in the device HBM/DRAM, I suspect we can get much better perf than copying the weights each time. |
For me this makes cuBLAS about twice as fast with quantized models.
Perplexity seconds per pass
Prompt eval time with 7B q4_0 (bs=512)
13B q4_0