Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : Q4_2 ARM #1046

Merged
merged 5 commits into from
Apr 18, 2023
Merged

ggml : Q4_2 ARM #1046

merged 5 commits into from
Apr 18, 2023

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Apr 18, 2023

ref #959

This is a reimplementation of #1026 by introducing new quantization type Q4_2

This PR implements only ARM NEON. The plan is to merge this soon and add rest of the SIMD implementations.
For now, no need for SIMD quantize / dequantize - will be added later when needed

@ggerganov ggerganov marked this pull request as ready for review April 18, 2023 18:25
@ggerganov ggerganov added the generation quality Quality of model output label Apr 18, 2023
@ggerganov ggerganov requested a review from sw April 18, 2023 18:26
Copy link
Collaborator

@prusnak prusnak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nitpick. Debug contains:

llama_model_load_internal: ftype      = 5 (unknown, may not work)

Fix:

diff --git a/llama.cpp b/llama.cpp
index ef8ee20..dd970f7 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -838,6 +838,7 @@ static const char *llama_ftype_name(enum llama_ftype ftype) {
         case LLAMA_FTYPE_MOSTLY_F16:  return "mostly F16";
         case LLAMA_FTYPE_MOSTLY_Q4_0: return "mostly Q4_0";
         case LLAMA_FTYPE_MOSTLY_Q4_1: return "mostly Q4_1";
+        case LLAMA_FTYPE_MOSTLY_Q4_2: return "mostly Q4_2";
         case LLAMA_FTYPE_MOSTLY_Q4_1_SOME_F16:
                                       return "mostly Q4_1, some F16";
         default:                      return "unknown, may not work";

@prusnak
Copy link
Collaborator

prusnak commented Apr 18, 2023

Benchmark on Macbook M1 16 GB:

7B q4_0: 75 ms/token
7B q4_2: 105 ms/token

@ggerganov
Copy link
Member Author

Benchmark on Macbook M1 16 GB:

7B q4_0: 75 ms/token 7B q4_2: 105 ms/token

I guess this is with 4 threads?
I have been using only 8 threads and didn't realize the the slow down is bigger for less threads

@sw
Copy link
Contributor

sw commented Apr 18, 2023

This should probably use ggml_is_quantized as well:
https://github.com/ggerganov/llama.cpp/blob/99092f2f21809fa4a2b68f0c16b0607410ed4bb1/ggml.c#L10720

Otherwise, looking great! I get 1s/token :-(

@prusnak
Copy link
Collaborator

prusnak commented Apr 18, 2023

I guess this is with 4 threads?

Yes, 8 threads are around 2-times slower on M1 (which has 4 performance cores and 4 efficiency cores).

More tests on M1 with the speed up in f30dbf9:

7B q4_0 4 threads: 75 ms/token
7B q4_0 8 threads: 135 ms/token
7B q4_1 4 threads: 122 ms/token
7B q4_1 8 threads: 240 ms/token
7B q4_2 4 threads: 89 ms/token
7B q4_2 8 threads: 180 ms/token

This is great!

@ggerganov
Copy link
Member Author

ggerganov commented Apr 18, 2023

I guess this is with 4 threads?

Yes, 8 threads are around 2-times slower on M1 (which has 4 performance cores and 4 efficiency cores).

More tests on M1 with the speed up in f30dbf9:

7B q4_0 4 threads: 75 ms/token 7B q4_0 8 threads: 135 ms/token 7B q4_1 4 threads: 122 ms/token 7B q4_1 8 threads: 240 ms/token 7B q4_2 4 threads: 89 ms/token 7B q4_2 8 threads: 180 ms/token

This is great!

Try again with ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32

I just found that vmlaq_n_f32 can be used to speed up things significantly.
Also optimized the Q4_1 to be about ~56 ms / token on M1 Pro (branch: q4_1xq8_0) which is pretty good.
Only slightly slower compared to Q4_0 at ~48 ms / token.

I have a feeling that the next Q4_3 quantization (i.e. Q4_1 but with F16 factors) will be able to evaluate at ~60 ms / token and hopefully the perplexity will be very close to full F16 (i.e. ~6.00 for 7B). Which would be just perfect

@ggerganov ggerganov merged commit 77a7340 into master Apr 18, 2023
@ggerganov ggerganov deleted the q4_2-arm branch April 18, 2023 20:55
@prusnak
Copy link
Collaborator

prusnak commented Apr 18, 2023

Post-merge tests (from master 77a7340):

I see no significant change from earlier tests for 7B q4_2 4 threads

7B q4_2 4 threads: 89 ms/token

But I see a small improvement for 7B q4_2 8 threads:

7B q4_2 8 threads: 180 -> 173 ms/token

@sw sw mentioned this pull request Apr 19, 2023
@ggerganov ggerganov self-assigned this Apr 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generation quality Quality of model output
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants