-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : Q4_2 ARM #1046
ggml : Q4_2 ARM #1046
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nitpick. Debug contains:
llama_model_load_internal: ftype = 5 (unknown, may not work)
Fix:
diff --git a/llama.cpp b/llama.cpp
index ef8ee20..dd970f7 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -838,6 +838,7 @@ static const char *llama_ftype_name(enum llama_ftype ftype) {
case LLAMA_FTYPE_MOSTLY_F16: return "mostly F16";
case LLAMA_FTYPE_MOSTLY_Q4_0: return "mostly Q4_0";
case LLAMA_FTYPE_MOSTLY_Q4_1: return "mostly Q4_1";
+ case LLAMA_FTYPE_MOSTLY_Q4_2: return "mostly Q4_2";
case LLAMA_FTYPE_MOSTLY_Q4_1_SOME_F16:
return "mostly Q4_1, some F16";
default: return "unknown, may not work";
Benchmark on Macbook M1 16 GB: 7B q4_0: 75 ms/token |
I guess this is with 4 threads? |
This should probably use Otherwise, looking great! I get 1s/token :-( |
Yes, 8 threads are around 2-times slower on M1 (which has 4 performance cores and 4 efficiency cores). More tests on M1 with the speed up in f30dbf9: 7B q4_0 4 threads: 75 ms/token This is great! |
- 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms
Try again with ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32 I just found that I have a feeling that the next |
Post-merge tests (from master 77a7340): I see no significant change from earlier tests for 7B q4_2 4 threads 7B q4_2 4 threads: 89 ms/token But I see a small improvement for 7B q4_2 8 threads: 7B q4_2 8 threads: 180 -> 173 ms/token |
ref #959
This is a reimplementation of #1026 by introducing new quantization type
Q4_2
This PR implements only ARM NEON. The plan is to merge this soon and add rest of the SIMD implementations.
For now, no need for SIMD
quantize
/dequantize
- will be added later when needed