ggml : Q4_2 ARM #1046

ggerganov · 2023-04-18T18:19:12Z

This is a reimplementation of #1026 by introducing new quantization type Q4_2

This PR implements only ARM NEON. The plan is to merge this soon and add rest of the SIMD implementations.
For now, no need for SIMD quantize / dequantize - will be added later when needed

ggml.c

prusnak

One nitpick. Debug contains:

llama_model_load_internal: ftype      = 5 (unknown, may not work)

Fix:

diff --git a/llama.cpp b/llama.cpp
index ef8ee20..dd970f7 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -838,6 +838,7 @@ static const char *llama_ftype_name(enum llama_ftype ftype) {
         case LLAMA_FTYPE_MOSTLY_F16:  return "mostly F16";
         case LLAMA_FTYPE_MOSTLY_Q4_0: return "mostly Q4_0";
         case LLAMA_FTYPE_MOSTLY_Q4_1: return "mostly Q4_1";
+        case LLAMA_FTYPE_MOSTLY_Q4_2: return "mostly Q4_2";
         case LLAMA_FTYPE_MOSTLY_Q4_1_SOME_F16:
                                       return "mostly Q4_1, some F16";
         default:                      return "unknown, may not work";

prusnak · 2023-04-18T19:13:19Z

Benchmark on Macbook M1 16 GB:

7B q4_0: 75 ms/token
7B q4_2: 105 ms/token

ggerganov · 2023-04-18T19:20:17Z

Benchmark on Macbook M1 16 GB:

7B q4_0: 75 ms/token 7B q4_2: 105 ms/token

I guess this is with 4 threads?
I have been using only 8 threads and didn't realize the the slow down is bigger for less threads

sw · 2023-04-18T19:26:47Z

This should probably use ggml_is_quantized as well:
https://github.com/ggerganov/llama.cpp/blob/99092f2f21809fa4a2b68f0c16b0607410ed4bb1/ggml.c#L10720

Otherwise, looking great! I get 1s/token :-(

prusnak · 2023-04-18T19:52:10Z

I guess this is with 4 threads?

Yes, 8 threads are around 2-times slower on M1 (which has 4 performance cores and 4 efficiency cores).

More tests on M1 with the speed up in f30dbf9:

7B q4_0 4 threads: 75 ms/token
7B q4_0 8 threads: 135 ms/token
7B q4_1 4 threads: 122 ms/token
7B q4_1 8 threads: 240 ms/token
7B q4_2 4 threads: 89 ms/token
7B q4_2 8 threads: 180 ms/token

This is great!

- 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms

ggerganov · 2023-04-18T20:37:11Z

I guess this is with 4 threads?

Yes, 8 threads are around 2-times slower on M1 (which has 4 performance cores and 4 efficiency cores).

More tests on M1 with the speed up in f30dbf9:

7B q4_0 4 threads: 75 ms/token 7B q4_0 8 threads: 135 ms/token 7B q4_1 4 threads: 122 ms/token 7B q4_1 8 threads: 240 ms/token 7B q4_2 4 threads: 89 ms/token 7B q4_2 8 threads: 180 ms/token

This is great!

Try again with ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32

I just found that vmlaq_n_f32 can be used to speed up things significantly.
Also optimized the Q4_1 to be about ~56 ms / token on M1 Pro (branch: q4_1xq8_0) which is pretty good.
Only slightly slower compared to Q4_0 at ~48 ms / token.

I have a feeling that the next Q4_3 quantization (i.e. Q4_1 but with F16 factors) will be able to evaluate at ~60 ms / token and hopefully the perplexity will be very close to full F16 (i.e. ~6.00 for 7B). Which would be just perfect

prusnak · 2023-04-18T21:00:43Z

Post-merge tests (from master 77a7340):

I see no significant change from earlier tests for 7B q4_2 4 threads

7B q4_2 4 threads: 89 ms/token

But I see a small improvement for 7B q4_2 8 threads:

7B q4_2 8 threads: 180 -> 173 ms/token

ggerganov force-pushed the q4_2-arm branch from 0b575b6 to bbd2921 Compare April 18, 2023 18:25

ggerganov marked this pull request as ready for review April 18, 2023 18:25

ggerganov mentioned this pull request Apr 18, 2023

New Q4_0 implementation using 2x F16 instead of 1x F32 #1026

Closed

ggerganov added the generation quality Quality of model output label Apr 18, 2023

ggerganov requested a review from sw April 18, 2023 18:26

ggerganov commented Apr 18, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

prusnak requested changes Apr 18, 2023

View reviewed changes

ggerganov added 5 commits April 18, 2023 23:00

ggml : Q4_2 ARM

e435b81

ggml : add ggml_is_quantized()

fe85929

llama : update llama_type_name() with Q4_2 entry

5e6b62c

ggml : speed-up q4_2

3a79089

- 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms

ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32

5843b45

ggerganov force-pushed the q4_2-arm branch from f30dbf9 to 5843b45 Compare April 18, 2023 20:09

ggerganov merged commit 77a7340 into master Apr 18, 2023

ggerganov deleted the q4_2-arm branch April 18, 2023 20:55

sw mentioned this pull request Apr 19, 2023

Q4 cleanup #1061

Merged

ggerganov self-assigned this Apr 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : Q4_2 ARM #1046

ggml : Q4_2 ARM #1046

ggerganov commented Apr 18, 2023 •

edited

Loading

prusnak left a comment

prusnak commented Apr 18, 2023

ggerganov commented Apr 18, 2023

sw commented Apr 18, 2023

prusnak commented Apr 18, 2023 •

edited

Loading

ggerganov commented Apr 18, 2023 •

edited

Loading

prusnak commented Apr 18, 2023

ggml : Q4_2 ARM #1046

ggml : Q4_2 ARM #1046

Conversation

ggerganov commented Apr 18, 2023 • edited Loading

prusnak left a comment

Choose a reason for hiding this comment

prusnak commented Apr 18, 2023

ggerganov commented Apr 18, 2023

sw commented Apr 18, 2023

prusnak commented Apr 18, 2023 • edited Loading

ggerganov commented Apr 18, 2023 • edited Loading

prusnak commented Apr 18, 2023

ggerganov commented Apr 18, 2023 •

edited

Loading

prusnak commented Apr 18, 2023 •

edited

Loading

ggerganov commented Apr 18, 2023 •

edited

Loading