-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues with cuBLAS and a bug #1129
Comments
I don't think we should expect to be able to do everything with cuBLAS, if the user has enough VRAM for that they should be using a GPU implementation of llama instead. We could investigate the possibility of always keeping the largest matrices in the GPU, up to the available VRAM. I am not sure if that is going to be enough to use cuBLAS for generation, but would improve the performance with batch processing. Most of the time the GPU is sitting idle waiting for data, that is expected for the way this is done. We could likely improve utilization by allowing ggml to work on multiple mat muls simultaneously. There was some discussion about this in #1094, but I am not familiar with the graph processing code in ggml and I am not sure what would take to do that. The loop does nothing in llama.cpp, |
I have tried implementing a cache for the quantized matrices (ie. the weights), and even in the best case where the entire model fits in VRAM, for me this is only ~30% faster in the perplexity test. Considering how much more VRAM this requires, I am not sure that it is worth it Anyway, if anybody wants to give it a try, it is at slaren@5a606d5. |
It's true that for F32 x F32 it doesn't change the thread number but the function static void ggml_compute_forward_mul_mat_f32(
const struct ggml_compute_params * params,
const struct ggml_tensor * src0,
const struct ggml_tensor * src1,
struct ggml_tensor * dst) {
// ...
#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS) || defined(GGML_USE_CUBLAS)
if (ggml_compute_forward_mul_mat_use_blas(src0, src1, dst)) {
if (params->ith != 0) {
return;
}
// ... The variable The GEMM computation is not a very big part of the time taken, it really depends on the hardware but slow memory, slow video memory, slow PCI bus can totally ruin the performance. @slaren, 30% improvement is pretty big in my opinion! On slower hardware it could be even more significant. |
If there is interest in that, I can implement it properly and open a PR, however do it in a safe way in ggml we would need #1093 to be able to tag the matrices that are constant and can be cached in the GPU safely. Currently, I am just assuming that when multiplying q x f32, the quantized matrix is the weights and therefore constant and safe to cache, but that may not always be the case for other uses of ggml. |
I think that a smart mix of GPU and CPU can be a very interesting library extension. Any improvements in that area will be awesome. The 1093 suggestions was kind of approved by GG: that is to extend the tensor_node with more information.
Strange Performance: Of course a tiny multiplication is not worth to be moved to GPU if the graph is mostly CPU based, but it shouldn't take 4-5 times longer when doing it that way. Maybe someone has insight into that already ? |
I think that's all there is to it. The cuBLAS matrix multiplication is very fast, but the latency is high due to having to move the data between main RAM and the GPU. At some point, the matrix is big enough that it takes longer to multiply it in the CPU than to copy the data to and from the GPU. But that only happens with big enough matrices, in our case is when processing in large batches. @SlyEcho showed that here. It is the same issue with OpenBLAS, it is only worth using it if the cost of dequantizing the matrices is lower than the time gained from the faster matrix multiplication. 4-5 times slower doesn't seem wrong, that's why the minimum batch size to use BLAS is 32. |
Though it's really a huge delay. The PCIE bus is faster than RAM was a couple years ago. I'll try benchmark that area (though the async part makes it a bit harder). One function I was working on (which needs the flags mentioned earlier to be useable) is an actual smart decision maker. |
EDIT2: Even though I had the performance power governor enabled the PCIe buses were powersaving by switching to 2.5 GT/sec. All results below are with 8GT/sec lanes, across 4 old GTX 1080Ti GPUs with varying numbers of PCIe lanes. Here's some data points which may be of interest with a
EDIT1:
|
It may be useful to see some kind of profile of what the GPU is doing in llama.cpp. For AMD, I can use |
For high-level tracing of what's going on in the system, NSight Systems is the way to go. It will give you a trace very similar to what With lots of kernels, looking at the trace can get overwhelming, so you might want to add some markers/ranges to pinpoint the problematic |
@slaren I tried your branch on HIP, it seems to be running the dequant and right after that the GEMM ( Speed is only a little better, from 59 minutes (5.49 s/p) to 44 (4.10 s/p) EDIT: I think there might be something to optimize in the kernel launch arguments, you know with group and local sizes, etc. |
I tried that when I made the initial implementation of dequantization on the GPU using Edit: Tested it again and still couldn't notice a difference. |
Why doesn't CUDA do the DTH overlapped with the sgemm kernel? Isn't cublasSgemm intelligent enough to determine when some rows are complete and can hence be copied? |
No, I don't think it is. |
Here is what GPT-3.5 suggests to do: // Set up CUDA streams
cudaStream_t compStream, copyStream;
cudaStreamCreate(&compStream);
cudaStreamCreate(©Stream);
// Set up matrices A, B, and C using cuBLAS functions
// ...
// Set up variables for subset computation and copy
int subsetSize = 100; // Size of each subset of rows
int numSubsets = m / subsetSize; // Total number of subsets
int subsetIdx = 0; // Index of current subset being computed
int offset = 0; // Offset to current subset of rows in matrix C
// Loop over all subsets of rows
for (int i = 0; i < numSubsets; i++) {
// Set the computation stream for cublasSgemm
cublasSetStream(handle, compStream);
// Compute subset of rows in matrix C using cublasSgemm
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, subsetSize, n, k, &alpha,
&d_A[offset], lda, &d_B[0], ldb, &beta, &d_C[offset], ldc);
// Set the copy stream for cudaMemcpyAsync
cudaStreamAttachMemAsync(copyStream, &d_C[offset], 0, cudaMemAttachSingle);
// Copy computed subset of rows from device to host asynchronously
cudaMemcpyAsync(&h_C[offset], &d_C[offset], subsetSize * n * sizeof(float),
cudaMemcpyDeviceToHost, copyStream);
// Update subset index and offset
subsetIdx++;
offset = subsetIdx * subsetSize * n;
}
// Wait for all asynchronous operations to complete
cudaStreamSynchronize(compStream);
cudaStreamSynchronize(copyStream);
|
From your profiling, it looks like another 25-30% improvement if implemented. |
If you want to make a proof of concept and show the results go ahead, a snippet from ChatGPT does not help. |
I'm skeptical, these GEMM routines are super optimized to the hardware already depending on the exact problem sizes. Besides, the DeviceToHost copy is small and doesn't take much time. |
I'm currently on holiday and won't have access to a CUDA machine for the next month. The idea is clear, which is to split up the sgemm kernel to operate on row subsets rather than the entire matrix; thus the completed subsets can be synced to host asynchronously.
The profiling graph shows the exact opposite - that it takes 30% of the time. Unless the problem size profiled is much smaller than realistic? @slaren
This improvement is only intended for overlapping compute and communication, not improving compute. |
I need to look at how exactly we multiply Here's how FasterTransformer does it, for example. Note that I'm linking to a very old version of FT for clarity, since I don't think the current versions illustrate this concept very clearly.
I'm guessing from the kernel names that you have an Ampere GPU. Those should have FP16 tensor cores and absolutely no problems with f16 x f16 matmuls (in fact, it's f32 x f32 matmuls that are second-class citizens). Besides, the actual matmul kernels are so tiny in your screenshot I can't even see their names. You should probably check what the CPU was doing between scheduling the copy and the matmul kernels (I think that if you click on a GPU activity, NSight Systems will show you the corresponding CUDA API call).
It may turn out to be not that hard. I believe that you only need a softmax kernel, a layer norm kernel, and whatever activation function is used between the FFN layers in the model you choose. But even in a "what-if" scenario, I personally don't think it would be much fun. My impression is you will basically end up with a very stripped down version of FasterTransformer/ONNX Runtime, and at this point there's is no good reason to not just use one of those. I might be wrong, but so far I'm understanding that the philosophy of
If we look at things this way, this actually sounds pretty convincing:
If we allow ourselves to use CUTLASS (which is a big if, since it's a large dependency), this is pretty doable (e.g., see this discussion) and should give a nice speedup. |
Wait, what? Could you please clarify how exactly |
Yes, I think this was wrong, perhaps this will work: // Set up CUDA streams and events
cudaStream_t *compStreams, *copyStreams;
cudaEvent_t *events;
compStreams = new cudaStream_t[numSubsets];
copyStreams = new cudaStream_t[numSubsets];
events = new cudaEvent_t[numSubsets];
for (int i = 0; i < numSubsets; i++) {
cudaStreamCreate(&compStreams[i]);
cudaStreamCreate(©Streams[i]);
cudaEventCreate(&events[i]);
}
// Set up matrices A, B, and C using cuBLAS functions
// ...
// Set up variables for subset computation and copy
int subsetSize = 100; // Size of each subset of rows
int numSubsets = m / subsetSize; // Total number of subsets
int offset = 0; // Offset to current subset of rows in matrix C
// Loop over all subsets of rows
for (int i = 0; i < numSubsets; i++) {
// Set the computation stream for cublasSgemm
cublasSetStream(handle, compStreams[i]);
// Compute subset of rows in matrix C using cublasSgemm
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, subsetSize, n, k, &alpha,
&d_A[offset], lda, &d_B[0], ldb, &beta, &d_C[offset], ldc);
// Record event after computation is completed
cudaEventRecord(events[i], compStreams[i]);
// Set the copy stream for cudaMemcpyAsync
cudaMemcpyAsync(&h_C[offset], &d_C[offset], subsetSize * n * sizeof(float),
cudaMemcpyDeviceToHost, copyStreams[i]);
// Update subset offset
offset += subsetSize * n;
}
// Wait for all events to complete
cudaDeviceSynchronize();
// Clean up the computation streams and events
for (int i = 0; i < numSubsets; i++) {
cudaStreamDestroy(compStreams[i]);
cudaEventDestroy(events[i]);
}
delete[] compStreams;
delete[] events;
// Wait for all copy operations to complete
for (int i = 0; i < numSubsets; i++) {
cudaStreamSynchronize(copyStreams[i]);
}
// Clean up the copy streams
for (int i = 0; i < numSubsets; i++) {
cudaStreamDestroy(copyStreams[i]);
}
delete[] copyStreams; It explicitly synchronizes subsets of the rows with per-subset CudaEvent_t |
I thought about doing the f16→f32 conversion on the GPU with a simple kernel, after that a normal Sgemm could be done.
This was done because Gemm doesn't support them, and for OpenBLAS it would have meant to copy the matrix to temporary memory. But for the GPU we are copying anyway so I think it could be worth it. Regarding the Strided/BatchedGemm for tensors, Nvidia mentions this in the docs:
I don't know how many streams we would need. Creating a stream is expensive, when I did my hipBLAS code a month ago I didn't know this and gave up when it was too slow when I was creating and destoying a stream for every multiplication. |
I tried using a different stream for each of the 32 mat muls, but didn't notice any improvements in the performance. That's with pre-initialized streams, so it shouldn't have the overhead of creating and destroying the streams in each mat mul. However, I did observe a minor improvement in performance by doing the mat muls in f32 instead of f16. My GPU is indeed an Ampere RTX 3080, but there may be other factors. I have not looked at the difference in nsight yet (it doesn't work under WSL2), but it may be simply a case that converting f16 to f32 on the CPU is faster than converting f32 to f16. The f16 matrix is smaller too, so there is less data to convert that way. This conversion is necessary because the original data types for the mat mul are f16 x f32, and cuBLAS cannot do that directly. Overall, after all the changes, I am seeing a speedup of 60-70% in the perplexity times. I am going to start opening PRs with each of the changes individually to make it easier to review. Some of the changes may cause some friction, in particular implementing host memory pinning and the weight cache may require changes to ggml that are dangerously close to adding GPU specific details to the ggml interface, which is not what we want (see this comment). |
What about converting f16→f32 on the GPU? it would be less data to copy. |
@slaren Since you and @SlyEcho have the cuBLAS stuff proper covered, I decided to take a look at the quantization kernels in #1221. Turns out they can be optimized quite a bit without introducing complexity (the code arguably becomes even simpler). If you are interested, I can make a real PR out of the draft I posted. (at this point, the discussion in this issue is entirely unrelated to the initial issue description, but I'm not sure where else we should talk about cuBLAS/CUDA-related stuff) |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Performance with cuBLAS isn't there yet, it is more a burden than a speedup with llama eval in my tests.
In a simple benchmark case it is absolutely amazing, getting 10 million elements multiplied in F32 goes from 1+ seconds down to 20 milliseconds. So the improvement is a blast!
But in the llama case the overhead seems to be enormous, when enabling it generically the average computation time shoots from 300ms up to 1500ms using cuBLAS.
I feel like the memory should have been prepared beforehand and I don't think the thousands of CUDA cores are used.
Is that loop really the best way to do it ?
ggml_compute_forward_mul_mat_use_blas() is evaluated too often, I think it would be better to make a generic "check" function that uses the node Operation type as input.
I'm experimenting with another approach in that case, not finished yet but I think we need to calculate the flops required.
Then the function is called too often (init, compute, finalize), I'm just modifying it to set a flag in the tensor instead but that's just experimental atm.
There is a bug in ggml.c which causes the matrix multiplication being executed once for all threads in CUDA when we have a 32 bit input and output.
I think it should rather be like this:
If n_tasks is not reduced to 1 then threads are associated with the cuda one it will currently not stop each thread from using cuda (and surprisingly no crashes)
Finally: the computation with flops taking into consideration the memory time using a quick local benchmark at startup might be a good generic solution to decide if cuda should be enabled. Generically.
But for our use case we know beforehand if we need it, every layer is known and sizes are known.
So that flag in the tensor struct should be set manually (I currently use -1,1 and 0: -1 = generic, 1 = force cuda if possible, 0 = disable cuda)
There should be a way to use the full cuda cores, I have no experience in it (yet) but I'm sure we can get the multiplication through a lot faster.
It looks like the 30+attention heads are calculated in sequence, cuda can do that in one shot.
Also when cuda is enabled we need to make sure that the big matrix tensors are proper in memory for cuda, or they will be skipped. I didn't get to that point yet, still fighting ;)
The text was updated successfully, but these errors were encountered: