-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate alternative ggml_compute_forward_mul_mat_q_f32() implementation #909
Comments
I guess you could simply have the This is at least not catastrophically worse in terms of performance for text generation. But perplexity ETA went from 12 hours to 48 hours on my slowpoke Intel Core i3. |
Are you using all the cores, and is the code fully vectorized? I'm familiar with AVX but not NEON, but using AVX at least it likely would be possible to unpack the quantized values to floats in a vectorized way using broadcasting, and shuffling operations. Depending on if the bottleneck is compute or memory bandwidth, it possibly could be worth it to do the compute operation in int16 space etc.. |
@sw and all I just added a new potential mode (E) for investigation - I have a very good feeling about it, but not sure if I will get to playing with this idea soon. Feel free to investigate if you have the time. |
Trying it out, seems like a no-brainer (except for the higher memory use for master...sw:llama.cpp:mulmat-q8 (AVX2/AVX/scalar only) If I squint with both eyes, I can even see a tiny speedup for text generation... Perplexity:
|
Yes, (E) is the way 🦙 ! I just implemented a SIMD ARM_NEON and initial perplexity gains are similar to your observations. See #951 Edit: interestingly, the perplexity calculation does become faster for some reason:
|
Btw, now I am very curious how the 2-bit quantized model will behave using the 8-bit intermediate data. |
Wow, that's a huge win. Appears to capture nearly all the benefit of running with f32/BLAS! Very exciting :) |
This is the most computationally significant call in the entire transformer evaluation, so we have to be sure that it is running optimally.
It computes the matrix multiplication:
z = x * y
x
is quantizedy
is F32z
is F32Currently, it runs in 2 modes, depending on the tensor shapes:
x
is dequantized to F32 and we usesgemm
to perform the matrix multiplicationy
is quantized to 4-bits on-the-fly and we use integer-based dot products to perform the matrix multiplicationThe former method is much more accurate than the latter. This can be clearly observed during perplexity computations.
However, during text generation (i.e. batch = 1), it is not feasible to use it - my experience is that there is significant overhead of calling BLAS for smaller tensor shapes, typical for single-token inference calls.
There are at least two alternative modes of operation that can be explored:
x
is dequantized to F32 and we useggml_vec_dot_f32()
to perform the multiplicationx
is dequantized to F16,y
is converted to F16 and we useggml_vec_dot_f16()
to perform the multiplicationy
is quantized on-the-fly to 8-bits and we use a newggml
dot-product call that operates on4-bit x
and8-bit y
. This call will still unpackx
into 8-bits as usual and perform the 8-bit dot-product as in the existing routines, but in contrast to (B),y
will already be unpacked to 8-bits and the precision loss will be significantly slowerTo me it is not immediately clear if (C) or (D) would be significantly slower compared to (B), but they should be much more accurate compared to (B) and probably as accurate as (A).
I think, one has to be careful and choose the respective mode based on the tensor shapes, trying to find a good balance between speed and accuracy. Ideally, I am hoping after this investigation that we will achieve noticeable perplexity gain without using BLAS at the cost of a slightly slower single-token (i.e. batch = 1) computation.
Edit: after the analysis and discussion in #896 I added a new mode (E) which I think is very important to be explored. Unless I am missing something, I believe this mode can be exactly as efficient as (B), but with significantly higher accuracy. Much higher than what can be achieved via improving the quantization RMS.
So I believe we have to investigate this with very high priority.
The text was updated successfully, but these errors were encountered: