You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In contrast, when using a quantized model, the cuBLAS run is significantly faster
Is this expected?
I was hoping to have some performance improvement for F16 as well.
Maybe the data transfer is very slow for F16 and it defeats the purpose of offloading to the GPU?
I noticed this after porting the latest ggml to whisper.cpp where we use F16 precision and was surprised that cuBLAS does not bring any improvement.
For example, sometime ago I tried using NVBLAS in whisper.cpp and it did bring some decent improvements: ggerganov/whisper.cpp#220 (comment)
The NVBLAS code change was very trivial: ggerganov/whisper.cpp#239
What could NVBLAS be doing better in this case?
The text was updated successfully, but these errors were encountered:
F16 used to be the fastest before dequantization on the GPU was implemented: #1044
With the current master, it is still faster than it was originally, so I don't think that there has been a regression: 3.50 seconds per pass - ETA 38 minutes
I don't know why this isn't the case with your GTX 1660. From what I could find, it is a turing chip that can do FP16.
Thanks - it seems the problem is in the GeForce GTX 1660 somehow.
Ran the same test on GeForce RTX 4080 and there is significant improvement.
Also, whisper.cpp is much faster with cuBLAS
I think the NVBLAS test that I did before was on GeForce RTX 2060
I noticed that using cuBLAS with the
F16
model does not give any benefit compared to non-BLAS CPU-only mode:System:
In contrast, when using a quantized model, the cuBLAS run is significantly faster
Is this expected?
I was hoping to have some performance improvement for F16 as well.
Maybe the data transfer is very slow for F16 and it defeats the purpose of offloading to the GPU?
I noticed this after porting the latest
ggml
towhisper.cpp
where we use F16 precision and was surprised that cuBLAS does not bring any improvement.For example, sometime ago I tried using NVBLAS in
whisper.cpp
and it did bring some decent improvements: ggerganov/whisper.cpp#220 (comment)The NVBLAS code change was very trivial: ggerganov/whisper.cpp#239
What could NVBLAS be doing better in this case?
The text was updated successfully, but these errors were encountered: