Continuous layouts for quantization q4_0c #1073

unbounded · 2023-04-19T21:58:03Z

Adds a q4_0c type that corresponds to the q4_0 layout but with a different memory layout.

Draft status, currently only accelerated for AVX-512, will add a PoC of Neon acceleration but wanted to put this out there since there is some experimentation with quantization formats going on now.

The layout consists of all the quantized values first in blocks of 128 nibbles, followed by all the scales.
The nibbles within a block are laid out consecutively in the lower nibbles, and then consecutively in the higher nibbles.
For dot products we use a q8_0c format, with all the qs bytes followed by all the scales.

The big win is for architectures with larger registers like AVX-512, that can now get two continuous blocks of qs by doing roughly

   xqs01  = xqs0123      & 0x0f0f...
   xqs23  = xqs0123 >> 4 & 0x0f0f...

The dot product implementation here borrows from @dfyz's implementation in #933, but becomes simpler because we don't need to do tricks with the byte layout.
Besides the simplified implementation there is also a small improvement in performance:
llama_print_timings: prompt eval time = 665.66 ms / 6 tokens ( 110.94 ms per token)
llama_print_timings: total time = 15398.10 ms
vs
llama_print_timings: prompt eval time = 449.19 ms / 6 tokens ( 74.86 ms per token)
llama_print_timings: total time = 13557.80 ms

The SIMD implementation with 128-bit registers like Neon should look very similar to the current implementations, with similar speeds. Possibly some benefit from doing only aligned loads.
The scalar implementations are slightly more complex but I do not see any degraded performance.

Perplexity should be exactly the same as q4_0.

Example timings (7B)

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

Current master q4_0:

llama_print_timings:        load time =   797.54 ms
llama_print_timings:      sample time =    69.77 ms /   100 runs   (    0.70 ms per run)
llama_print_timings: prompt eval time =   665.66 ms /     6 tokens (  110.94 ms per token)
llama_print_timings:        eval time = 14528.96 ms /    99 runs   (  146.76 ms per run)
llama_print_timings:       total time = 15398.10 ms

q4_0 with #933

llama_print_timings:        load time =   751.74 ms
llama_print_timings:      sample time =    89.38 ms /   100 runs   (    0.89 ms per run)
llama_print_timings: prompt eval time =   620.75 ms /     6 tokens (  103.46 ms per token)
llama_print_timings:        eval time = 13475.35 ms /    99 runs   (  136.11 ms per run)
llama_print_timings:       total time = 14318.72 ms

continuous layout q4_0c:

llama_print_timings:        load time =   596.51 ms
llama_print_timings:      sample time =    73.39 ms /   100 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time =   449.19 ms /     6 tokens (   74.86 ms per token)
llama_print_timings:        eval time = 12885.82 ms /    99 runs   (  130.16 ms per run)
llama_print_timings:       total time = 13557.80 ms

Todos:

AVX-512 acceleration
PoC ARM NEON acceleration
Investigate performance on M1
test out fully aligned loads by aligning the quantized file to 64 bytes

Future improvements:

PoC for q4_2-like quantization
PoC 3-bit quantization
Support row lengths not divisible by 128. I don't see any big obstacles, will require some extra code for the leftover block, and padding of rows to keep alignment.

pubby · 2023-04-20T05:07:32Z

What do you think about having two separate arrays, one for qs and one for scales?

ggerganov · 2023-04-21T19:05:26Z

@unbounded
Thanks for the efforts. Last few days we were speed running various quantization approaches with focus on ARM NEON, but I think and hope that we are finally converging to the correct solution - Q4_2 and Q4_3.

I hope in the next couple of days we confirm that we will proceed with these quantization strategies and merge the ARM NEOM implementations to master. At this point, we can investigate alternative memory layouts that are suitable for other arches and also do not degrade the ARM NEON implementation (potentially like the proposed one in this PR)

unbounded · 2023-04-21T21:16:23Z

What do you think about having two separate arrays, one for qs and one for scales?

Not sure what you mean here, you mean keeping them as two separate allocations? That would somewhat simplify alignment, but I think it would be hard to do that generally for different formats, e.g. q4_1 might use three "arrays"

unbounded · 2023-04-21T22:18:47Z

ggml.c

+        const float dy0 = yds[dst0];
+        const float dy1 = yds[dst1];
+
+        // NOTE: having these as plain int triggers a bug with AVX512 on GCC 12.2


https://godbolt.org/z/zj4x6Kz8o
Let me know if you see something wrong with this code, but this looks like a compiler bug to me...
GCC spots that it can vectorize with vpdpbusd (impressive), but it looks like it forgot that the first vector has to be unsigned, (just like I tend to).

unbounded · 2023-04-22T01:14:36Z

I'll hold a bit until it stabilizes, but it should be straightforward to test the same approach for the q4_2 format.

Added SIMD for Arm Neon - it's almost identical to q4_0 except we don't need the vuzp1q_s8 instructions anymore.

I don't have an M1 to test on, but got some timings on an Ampere Altra VM:

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
q4_0 7B:

llama_print_timings:        load time =  1560.01 ms
llama_print_timings:      sample time =    99.38 ms /   100 runs   (    0.99 ms per run)
llama_print_timings: prompt eval time =   949.90 ms /     6 tokens (  158.32 ms per token)
llama_print_timings:        eval time = 17879.96 ms /    99 runs   (  180.61 ms per run)
llama_print_timings:       total time = 19541.73 ms

q4_0c 7B:

llama_print_timings:        load time =  1460.82 ms
llama_print_timings:      sample time =    99.60 ms /   100 runs   (    1.00 ms per run)
llama_print_timings: prompt eval time =   865.02 ms /     6 tokens (  144.17 ms per token)
llama_print_timings:        eval time = 19523.40 ms /    99 runs   (  197.21 ms per run)
llama_print_timings:       total time = 21086.20 ms

unbounded · 2023-04-24T17:05:17Z

Prefetching seems to be important with this layout, probably the extra memory accesses confuse the hardware prefetcher.

AVX-512:
q4_0c 7B, without prefetch:

llama_print_timings: prompt eval time =   449.19 ms /     6 tokens (   74.86 ms per token)
llama_print_timings:        eval time = 12885.82 ms /    99 runs   (  130.16 ms per run)

q4_0c 7B, with prefetch:

llama_print_timings: prompt eval time =   399.98 ms /     6 tokens (   66.66 ms per token)
llama_print_timings:        eval time =  9684.23 ms /    99 runs   (   97.82 ms per run)

Arm Neon:
q4_0c 7B, without prefetch:

llama_print_timings: prompt eval time =   865.02 ms /     6 tokens (  144.17 ms per token)
llama_print_timings:        eval time = 19523.40 ms /    99 runs   (  197.21 ms per run)

q4_0c 7B, with prefetch:

llama_print_timings: prompt eval time =   777.49 ms /     6 tokens (  129.58 ms per token)
llama_print_timings:        eval time = 13856.25 ms /    99 runs   (  139.96 ms per run)

ggerganov · 2023-04-24T17:22:46Z

On M1 Pro, Q4_0c is currently more than 20% slower than Q4_0 using 8 threads (and even more for 4 threads):

 ggerganov  Georgis-MBP  ~/development/github/llama.cpp 
 20:15:59  ⚓ master-50cb666-9-g58e10f2  8⎘  $  make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 5 -n 64 -t 4
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: seed = 5
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 64, n_keep = 0


 I believe the meaning of life is to learn, live and love.
There are three keys to living a happy life: Love what you do, Work with passionate people, Serve others who need your help.
Doing the things that matter most in life.
Being able to see beauty in everything around you.
The ability to
llama_print_timings:        load time =   507.55 ms
llama_print_timings:      sample time =    47.07 ms /    64 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =   498.11 ms /     8 tokens (   62.26 ms per token)
llama_print_timings:        eval time =  3561.34 ms /    63 runs   (   56.53 ms per run)
llama_print_timings:       total time =  4116.99 ms
 ggerganov  Georgis-MBP  ~/development/github/llama.cpp 
 20:16:13  ⚓ master-50cb666-9-g58e10f2  8⎘  $ make -j && ./main -m ./models/7B/ggml-model-q4_0c.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 5 -n 64 -t 4
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: seed = 5
llama.cpp: loading model from ./models/7B/ggml-model-q4_0c.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q4_0C)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 64, n_keep = 0


 I believe the meaning of life is to serve others with love and compassion. To share my time, energy and money to make this world a better place. That's where my passion comes from!
My parents gave me so much love, that I want to give it back to people who need it most. There are so many children around
llama_print_timings:        load time =   668.47 ms
llama_print_timings:      sample time =    47.20 ms /    64 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =   659.63 ms /     8 tokens (   82.45 ms per token)
llama_print_timings:        eval time =  5920.76 ms /    63 runs   (   93.98 ms per run)
llama_print_timings:       total time =  6637.47 ms
 ggerganov  Georgis-MBP  ~/development/github/llama.cpp 
 20:16:25  ⚓ master-50cb666-9-g58e10f2  8⎘  $ make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 5 -n 64 -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: seed = 5
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 64, n_keep = 0


 I believe the meaning of life is to learn, live and love.
There are three keys to living a happy life: Love what you do, Work with passionate people, Serve others who need your help.
Doing the things that matter most in life.
Being able to see beauty in everything around you.
The ability to
llama_print_timings:        load time =   335.87 ms
llama_print_timings:      sample time =    47.24 ms /    64 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =   326.36 ms /     8 tokens (   40.79 ms per token)
llama_print_timings:        eval time =  3228.42 ms /    63 runs   (   51.24 ms per run)
llama_print_timings:       total time =  3612.50 ms
 ggerganov  Georgis-MBP  ~/development/github/llama.cpp 
 20:16:35  ⚓ master-50cb666-9-g58e10f2  8⎘  $ make -j && ./main -m ./models/7B/ggml-model-q4_0c.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 5 -n 64 -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: seed = 5
llama.cpp: loading model from ./models/7B/ggml-model-q4_0c.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q4_0C)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 64, n_keep = 0


 I believe the meaning of life is to serve others with love and compassion. To share my time, energy and money to make this world a better place. That's where my passion comes from!
My parents gave me so much love, that I want to give it back to people who need it most. There are so many children around
llama_print_timings:        load time =   452.52 ms
llama_print_timings:      sample time =    46.99 ms /    64 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time =   443.12 ms /     8 tokens (   55.39 ms per token)
llama_print_timings:        eval time =  4073.24 ms /    63 runs   (   64.65 ms per run)
llama_print_timings:       total time =  4573.76 ms
 ggerganov  Georgis-MBP  ~/development/github/llama.cpp 
 20:16:46  ⚓ master-50cb666-9-g58e10f2  8⎘  $

Btw, without prefetching (i.e. previous commit) Q4_0c is about the same as Q4_0:

 20:20:57  ⚓ master-50cb666-8-g64a6a29  8⎘  $  make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 5 -n 64 -t 4
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: seed = 5
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 64, n_keep = 0


 I believe the meaning of life is to learn, live and love.
There are three keys to living a happy life: Love what you do, Work with passionate people, Serve others who need your help.
Doing the things that matter most in life.
Being able to see beauty in everything around you.
The ability to
llama_print_timings:        load time =   491.92 ms
llama_print_timings:      sample time =    46.61 ms /    64 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time =   482.84 ms /     8 tokens (   60.36 ms per token)
llama_print_timings:        eval time =  3435.16 ms /    63 runs   (   54.53 ms per run)
llama_print_timings:       total time =  3974.66 ms
 ggerganov  Georgis-MBP  ~/development/github/llama.cpp 
 20:21:06  ⚓ master-50cb666-8-g64a6a29  8⎘  $ make -j && ./main -m ./models/7B/ggml-model-q4_0c.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 5 -n 64 -t 4
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: seed = 5
llama.cpp: loading model from ./models/7B/ggml-model-q4_0c.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q4_0C)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 64, n_keep = 0


 I believe the meaning of life is to serve others with love and compassion. To share my time, energy and money to make this world a better place. That's where my passion comes from!
My parents gave me so much love, that I want to give it back to people who need it most. There are so many children around
llama_print_timings:        load time =   552.26 ms
llama_print_timings:      sample time =    46.86 ms /    64 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time =   542.64 ms /     8 tokens (   67.83 ms per token)
llama_print_timings:        eval time =  3856.36 ms /    63 runs   (   61.21 ms per run)
llama_print_timings:       total time =  4456.43 ms
 ggerganov  Georgis-MBP  ~/development/github/llama.cpp 
 20:21:18  ⚓ master-50cb666-8-g64a6a29  8⎘  $ make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 5 -n 64 -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: seed = 5
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 64, n_keep = 0


 I believe the meaning of life is to learn, live and love.
There are three keys to living a happy life: Love what you do, Work with passionate people, Serve others who need your help.
Doing the things that matter most in life.
Being able to see beauty in everything around you.
The ability to
llama_print_timings:        load time =   350.05 ms
llama_print_timings:      sample time =    46.93 ms /    64 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time =   340.74 ms /     8 tokens (   42.59 ms per token)
llama_print_timings:        eval time =  2984.46 ms /    63 runs   (   47.37 ms per run)
llama_print_timings:       total time =  3382.31 ms
 ggerganov  Georgis-MBP  ~/development/github/llama.cpp 
 20:21:31  ⚓ master-50cb666-8-g64a6a29  8⎘  $  make -j && ./main -m ./models/7B/ggml-model-q4_0c.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 5 -n 64 -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: seed = 5
llama.cpp: loading model from ./models/7B/ggml-model-q4_0c.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q4_0C)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 64, n_keep = 0


 I believe the meaning of life is to serve others with love and compassion. To share my time, energy and money to make this world a better place. That's where my passion comes from!
My parents gave me so much love, that I want to give it back to people who need it most. There are so many children around
llama_print_timings:        load time =   374.11 ms
llama_print_timings:      sample time =    47.05 ms /    64 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =   364.90 ms /     8 tokens (   45.61 ms per token)
llama_print_timings:        eval time =  2990.87 ms /    63 runs   (   47.47 ms per run)
llama_print_timings:       total time =  3412.83 ms
 ggerganov  Georgis-MBP  ~/development/github/llama.cpp 
 20:21:40  ⚓ master-50cb666-8-g64a6a29  8⎘  $

unbounded · 2023-04-24T19:06:34Z

Hm, sounds like it is sensitive to what prefetch distance to use then, that's unfortunate. Thanks for checking!
It executes strictly less instructions, so I think performance should always be at least on par, but possibly the M1 just doesn't benefit from prefetching hints.

unbounded · 2023-04-24T22:00:41Z

For reference, here are the times I saw on the Ampere Altra for different values of ahead
For eval:

And for prompt evaluation:

sw · 2023-04-25T14:57:30Z

Somewhat related to this is the fact that Q8_0 as it is after #1083, #1109 now has two floats that go to waste for Q4_0 and Q4_2, at least for the AVX2 implementation. This makes quantization slower due to calculating unused values, and the vector dot product slower, as it has to churn through more memory.

We could define a new format, but this again makes the source code longer:

#define QK8_0 32
typedef struct {
    float   d;          // delta
    int8_t  qs[QK8_0];  // quants
} block_q8_0;

#define QK8_1 32
typedef struct {
    float   d;          // delta
    float   s0;         // d * sum(qs[i]) low
    float   s1;         // d * sum(qs[i]) high
    int8_t  qs[QK8_1];  // quants
} block_q8_1;

Edit: this was done in #1179

Introduce alternative quantized formats q4_0c and q8_0c, corresponding exactly to q4_0 and q8_0, except that quantized values and scales are laid out continuously in memory, and the nibbles in q4_0 are rearranged. This should simplify SIMD implementations, at the expense of slighly more complex scalar implementations.

Mostly copied from the q4_0 implementation

Seems significant especially for evaluation time

Not supported on some GCC versions

unbounded · 2023-05-04T07:58:01Z

I started a branch for the same approach with the q4_2 data format:
https://github.com/unbounded/llama.cpp/tree/continuous-q4_2c

For AMD64 w AVX512:
q4_2:

llama_print_timings: prompt eval time =   899.29 ms /     6 tokens (  149.88 ms per token)
llama_print_timings:        eval time = 19050.60 ms /    99 runs   (  192.43 ms per run)

q4_2c:

llama_print_timings: prompt eval time =   564.52 ms /     6 tokens (   94.09 ms per token)
llama_print_timings:        eval time = 10912.85 ms /    99 runs   (  110.23 ms per run)

ggerganov · 2023-05-04T15:25:40Z

@unbounded

I am a bit torn about the proposed big block size of 4*32 - it will limit the application since not all models have rows divisible by 128.

I was wondering, if we proceed with 32 block size as in #1305, would it make sense for the AVX512 dot product implementation to memcpy 4 blocks into a local array in the proper 512-bit layout and fallback to the implementation here. The memcpy would obviously be an overhead, but maybe it wouldn't be too big and we still get most of the benefit from this implementation.

Would this work?

unbounded · 2023-05-04T20:54:23Z

@ggerganov That is a thing I can test the performance impact of, but it will probably be a little while before I get around to it.

I don't see any reason we couldn't add a "rest" handler for non-block sizes - main disadvantage would be the padding and the bit of extra code.
For now I'm looking to establish a baseline of what's achievable - possibly most of the wins can be achieved with less drastic changes like #1305.
Next step here is probably testing some more variations on an M1 as well and see if there is performance improvements to be had there.

unbounded · 2023-05-04T21:32:18Z

q4_2 timings on Ampere Altra, 7B

q4_2:

llama_print_timings: prompt eval time =  1481.07 ms /     6 tokens (  246.85 ms per token)
llama_print_timings:        eval time = 24994.51 ms /    99 runs   (  252.47 ms per run)

q4_2c:

llama_print_timings: prompt eval time =  1212.94 ms /     6 tokens (  202.16 ms per token)
llama_print_timings:        eval time = 21673.20 ms /    99 runs   (  218.92 ms per run)

unbounded · 2023-05-06T08:17:52Z

Did some performance testing on M1 and updated the q4_2c branch:
Now I see a performance win here as well:

q4_2:

llama_print_timings: prompt eval time =   592.93 ms /     6 tokens (   98.82 ms per token)
llama_print_timings:        eval time =  4458.55 ms /    50 runs   (   89.17 ms per run)

q4_2c:

llama_print_timings: prompt eval time =   486.42 ms /     6 tokens (   81.07 ms per token)
llama_print_timings:        eval time =  7533.07 ms /    99 runs   (   76.09 ms per run)

unbounded · 2023-05-08T20:55:50Z

Some miscellaneous performance observations:

I saw no performance difference using 64-byte aligned loads with AVX-512.

Prefetching instructions give no benefit at all on M1 processors.
On other CPUs they helped narrow the difference between prompt eval and prediction. Prediction time being much slower than prompt eval may indicate that prefetch instructions can improve performance.

Some very interesting work in #1256, the "super blocks" mentioned there are probably large enough to capture most of the benefit of this layout, if they are properly arranged. Also uses drastically fewer Float16 numbers, so there is less benefit in doing 2 or 4 F16->F32 conversions at once like we can here.

unbounded force-pushed the quantization-continuous-layouts branch from 48c84f7 to 4f149c2 Compare April 19, 2023 21:59

unbounded added performance Speed related topics research 🔬 labels Apr 21, 2023

unbounded commented Apr 21, 2023

View reviewed changes

unbounded force-pushed the quantization-continuous-layouts branch from 4f149c2 to 64a6a29 Compare April 22, 2023 00:46

ggerganov assigned unbounded Apr 22, 2023

unbounded added 7 commits May 4, 2023 09:02

test-quantize: fix for q8_0 intermediates

2219467

q4_0c: quantize support

4bd781c

q4_0c: AVX512 vec_dot and quantize impl

ab543dc

q4_0c: Arm Neon acceleration

1b49d26

Mostly copied from the q4_0 implementation

q4_0c: prefetch on AVX-512 and ARM

2949725

Seems significant especially for evaluation time

q4_0c: disable prefetching on M1

d53f767

unbounded force-pushed the quantization-continuous-layouts branch from 58e10f2 to d53f767 Compare May 4, 2023 07:20

unbounded mentioned this pull request May 4, 2023

Use different bit arrangement for quants (nibbles) #1241

Closed

q4_0c: avoid _mm512_loadu_epi64 instruction

76692c9

Not supported on some GCC versions

slaren mentioned this pull request May 6, 2023

[DRAFT] Speedup dequantize kernels #1221

Closed

This was referenced May 13, 2023

LLaMA NUMA could be better #1437

Closed

ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 #1508

Merged

ggerganov mentioned this pull request May 30, 2023

llama : Metal inference #1642

Merged

20 tasks

ggerganov mentioned this pull request Jul 1, 2023

dequantize + matrix multiplication CUDA kernels #2043

Closed

unbounded closed this Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous layouts for quantization q4_0c #1073

Continuous layouts for quantization q4_0c #1073

unbounded commented Apr 19, 2023 •

edited

Loading

pubby commented Apr 20, 2023

ggerganov commented Apr 21, 2023

unbounded commented Apr 21, 2023

unbounded Apr 21, 2023

unbounded commented Apr 22, 2023

unbounded commented Apr 24, 2023

ggerganov commented Apr 24, 2023

unbounded commented Apr 24, 2023

unbounded commented Apr 24, 2023

sw commented Apr 25, 2023 •

edited

Loading

unbounded commented May 4, 2023

ggerganov commented May 4, 2023

unbounded commented May 4, 2023 •

edited

Loading

unbounded commented May 4, 2023

unbounded commented May 6, 2023

unbounded commented May 8, 2023

Continuous layouts for quantization q4_0c #1073

Continuous layouts for quantization q4_0c #1073

Conversation

unbounded commented Apr 19, 2023 • edited Loading

pubby commented Apr 20, 2023

ggerganov commented Apr 21, 2023

unbounded commented Apr 21, 2023

unbounded Apr 21, 2023

Choose a reason for hiding this comment

unbounded commented Apr 22, 2023

unbounded commented Apr 24, 2023

ggerganov commented Apr 24, 2023

unbounded commented Apr 24, 2023

unbounded commented Apr 24, 2023

sw commented Apr 25, 2023 • edited Loading

unbounded commented May 4, 2023

ggerganov commented May 4, 2023

unbounded commented May 4, 2023 • edited Loading

unbounded commented May 4, 2023

unbounded commented May 6, 2023

unbounded commented May 8, 2023

unbounded commented Apr 19, 2023 •

edited

Loading

sw commented Apr 25, 2023 •

edited

Loading

unbounded commented May 4, 2023 •

edited

Loading