KV cache quantized to q8_0 #2969

JohannesGaessler · 2023-09-02T15:19:40Z

This PR aims to implement quantizing the KV cache as q8_0. As of right now it is a proof of concept and I would like to discuss the best way of doing a proper implementation ahead of time.

Potential benefits:

Roughly 50% less RAM/VRAM usage for the KV cache, particularly relevant for long contexts and batched evaluation, as may become more important due to speculative : PoC for speeding-up inference via speculative sampling #2926 .
Faster evaluation due to integer SIMD instructions and less I/O.
Less reliance on external BLAS libraries that are at this point mostly needed for floating point GEMM; as of right now the KV cache is the only remaining use case for quantized models.

The current state of this PR is that only the CPU implementation works (with some hacks). These are the performance results on my system:

Model	CUDA?	ngl	test	t/s master	t/s PR	Speedup
7B Q4_0	No	0	pp 512	25.05 ± 0.04	25.12 ± 0.20	1.00
7B Q4_0	Yes	0	pp 512	309.63 ± 1.71	289.90 ± 0.60	0.94
7B Q4_0	Yes	33	pp 512	585.59 ± 2.64	519.90 ± 1.97	0.89
7B Q4_0	No	0	tg 128	9.99 ± 0.03	10.03 ± 0.03	1.00
7B Q4_0	Yes	0	tg 128	9.98 ± 0.04	9.97 ± 0.05	1.00
7B Q4_0	Yes	33	tg 128	89.74 ± 0.13	84.01 ± 0.02	0.94

The GPU is an RTX 3090. The use of CUDA up to 33 layers is possible because only at 34 layers the KV cache starts being offloaded. In its current form there seems to be a small performance regression. The Perplexity on wikitext-2 changed from 5.9679 +/- 0.03352 to 5.9668 +/- 0.03349 which is actually a very small improvement, likely due to randomness (calculated with 33 layers offloaded).

The implementation for the K component of the KV cache was fairly straightforward: the values copied to the K component always have a row size of n_embd_head which is a multiple of 32 (q8_0 block size). So you can essentially just change the datatype of the K component to q8_0 and the preexisting implementations for copying and matrix multiplication will work out-of-the-box.

The implementation for the V component was more tricky because the row size is equal to the number of processed tokens which then need to be copied to an arbitrary location in a q8_0 block. I did the following things to make it work:

Extend ggml_compute_forward_dup_f32 by an extra case that so far was not implemented: copying non-contiguous f32 values to non-contiguous q8_0 values. My solution was to fill a zero-initialized temporary buffer behind the q8_0 block with the unquantized values. Then re-create the q8_0 block from those when either the buffer contains enough values for an entire q8_0 block or when there are no more new unquantized values to copy for a row. This requires an additional 128 bytes per V component row (hacked by just increasing the size by 128 elements).
The aforementioned extra case needs to know the starting position inside the q8_0 block in order to determine the location of the temporary buffer and when it should be converted to q8_0. I hacked this by writing this information to the op_params of the corresponding tensor in llm_build_llama.
The matrix multiplication code for q8_0 expects the row size to be a multiple of 32 (q8_0 block size). For the V component of the KV cache this can be done by just taking a larger slice from it. The hidden state needs to be padded with extra values. I hacked this by creating a new tensor that has the padded size, then creating a view of that tensor, and then copying the hidden state to said view.

I plan to do a proper implementation of this PR in the following way:

Modify ggml_view so that the position inside a block is written to op_params if the underlying tensor is quantized.
Add a GGML_OP like GGML_OP_PAD that pads a tensor with a constant value. Alternatively the code for quantizing the hidden state could be modified which would probably have better performance. But this would also be more complicated and require changes across a much wider range of hardware (to some of which I don't have access for testing). For the CUDA implementation only very minimal changes would be necessary to avoid padding the hidden state in an extra GGML_OP.
I will implement q8_0 KV cache quantization for CPU and CUDA.
I will replace the --memory-f32 CLI argument with a new one that lets you choose between q8_0, f16, and f32 for the KV cache type.

If there are issues with the way I plan to do the implementation, please let me know ahead of time.

JohannesGaessler · 2023-09-02T17:01:54Z

Add a GGML_OP like GGML_OP_PAD that pads a tensor with a constant value. Alternatively the code for quantizing the hidden state could be modified which would probably have better performance.

On second thought, maybe not. The order of tensors would be matrix multiplication -> scale -> mask -> softmax -> pad -> matrix multiplication. For optimal performance you should be fusing those tensors anyways and adding one more unary tensor to the pipeline would not make a difference.

ggerganov

I don't see anything immediately wrong with this approach, so interesting if we can make it work all the way.

Not sure if we would win much from a dedicated PAD op. We can have it as a combination of view + cpy as you have proposed. Either way would be fine I guess

Why do you add 128 again?
Also, probably better to utilize the wdata buffer that each op uses for writing work data

JohannesGaessler · 2023-09-02T18:16:25Z

Why do you add 128 again?

I'm adding 128 so each row has some extra space to temporarily store the unquantized values as the context fills up.
It's to avoid precision loss when only part of a q8_0 block has been filled.

KerfuffleV2 · 2023-09-02T19:58:04Z

This seems to work on ROCM (RX 6600) but weirdly uses considerably more memory than master for the same context size. Generation speed was about the same (perplexity not tested). This was a 7B LLaMA1 Q4_K model with -ngl 33 -c 2048. For master, it used ~4,500 MiB VRAM. This pull ~6,500 MiB.

JohannesGaessler · 2023-09-02T22:27:38Z

Also, probably better to utilize the wdata buffer that each op uses for writing work data

From what I can tell wdata would not be suitable because I need persistence across multiple eval calls.

This seems to work on ROCM (RX 6600) but weirdly uses considerably more memory than master for the same context size.

I think this is due to the hack that I'm currently using to pad one of the tensors. It should not be an issue in the final version.

kiratp · 2023-09-02T22:32:54Z

Less reliance on external BLAS libraries that are at this point mostly needed for floating point GEMM; as of right now the KV cache is the only remaining use case for quantized models.

Unless the plan is to call the new AMX (and VNNI) instructions manually, not using MKL would be slower on any newer (post Sapphire Rapids) Intel CPUs. These new instructions perform tiled Int8 multiplies in HW.

JohannesGaessler · 2023-09-04T08:42:45Z

I added new functions like ggml_view_blck_1d that allow you to set the starting position in a block (I think this is preferable over changing the interface of e.g. ggml_view_1d and adding an extra parameter).

I replaced the padding hack with an extension to the matrix multiplication ggml code. The solution that I came up with is to quantize the last block of the hidden state without any SIMD instruction if it is incomplete. The rest of the tensor is still quantiized using SIMD instructions. This should keep implementation/maintenance overhead low with negligible impact on performance.

I pushed CUDA implementations that produce correct results but the performance is bad.

examples/server/server.cpp

cebtenzzre · 2023-09-05T19:42:21Z

test-opt and test-grad0 are failing an assertion:

GGML_ASSERT: ggml.c:11303: from_float_to_vec_dot != NULL

edit: type_traits[GGML_TYPE_F32].from_float is NULL in ggml_compute_forward_mul_mat.

JohannesGaessler · 2023-09-05T19:43:52Z

Alright, this PR should now be feature complete. I added a new CLI argument --kv-type that lets users set the type as which the KV cache is stored. By default q8_0 is used since the memory savings are quite substantial: for 4096 context with 7b q4_0 the VRAM usage goes down from 6478 MiB to 5536 MiB (942 MiB reduction). The token generation performance seems to be slightly slower, prompt processing on average a little faster:

model	GPU	test	t/s master	t/s PR	Speedup
7b q4_0	RTX 3090	pp 512	2187.37 ± 34.18	2303.08 ± 24.19	1.05
7b q4_0	RTX 3090	tg 128	141.11 ± 0.30	140.19 ± 0.31	0.99
7b q4_0	P40	pp 512	875.56 ± 4.72	870.33 ± 2.09	0.99
7b q4_0	P40	tg 128	59.45 ± 0.07	57.23 ± 0.01	0.96
7b q4_0	RX 6800	pp 512	1061.05 ± 6.51	1159.28 ± 7.16	1.09
7b q4_0	RX 6800	tg 128	77.89 ± 0.02	72.95 ± 0.01	0.94

The perplexity changes as follows:

model	PPL master	PPL PR	PPL diff
7b f16	5.7962	5.7989	2.7e-3
7b q4_0	5.9670	5.9658	-1.2e-3
7b q4_1	6.0035	6.0038	3e-4
7b q5_0	5.8292	5.8327	3.5e-3
7b q5_1	5.8538	5.8560	2.2e-3
7b q8_0	5.8014	5.8018	4e-4
7b q2_k	6.4466	6.4514	4.8e-3
7b q3_k_s	6.2950	6.2955	5e-4
7b q3_k_m	6.0264	6.0297	3.3e-3
7b q3_k_l	5.9854	5.9911	5.7e-3
7b q4_k_s	5.8891	5.8934	4.3e-3
7b q4_k_m	5.8798	5.8840	4.2e-3
7b q5_k_s	5.8215	5.8260	4.5e-3
7b q5_k_m	5.8292	5.8332	4e-3
7b q6_k	5.8098	5.8141	4.3e-3

The increase in perplexity is on the same order as the perplexity differences between f16 and q8_0 or q8_0 and q6_K.

cebtenzzre · 2023-09-05T19:52:54Z

make LLAMA_CUBLAS=1 is failing on my computer:

ggml-cuda.cu(2107): error: the size of an array must be greater than zero
      __attribute__((shared)) int tile_x_qs[mmq_y * (32) + mmq_y];
                                            ^
          detected during:
            instantiation of "void allocate_tiles_q4_0<mmq_y>(int **, half2 **, int **, int **) [with mmq_y=-1]" at line 3483
            instantiation of "void mul_mat_q4_0<need_check>(const void *, const void *, float *, int, int, int, int, int, int, int, int) [with need_check=false]" at line 4662

I have a GTX 970 installed (compute 5.2), maybe that's the problem? You can test with make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_52.

JohannesGaessler · 2023-09-05T20:19:49Z

The failing asserts were added by me and I removed them again since the assumptions that went into them seem to have been incorrect. The issue with CUDA compilation was caused by bad fallback values (which get used by cmake for the CC 5.2 PTX code) and should be fixed now.

KerfuffleV2 · 2023-09-05T20:36:32Z

Running perplexity with ROCM, I just get NANs when -kvt q8_0. It doesn't seem to make a difference whether -ngl 0 or offloading the whole model.

ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6600, compute capability 10.3
[...]
llm_load_tensors: ggml ctx size =    0.08 MB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required  =  103.84 MB (+   97.12 MB per state)
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: VRAM used: 3466 MB
[...]
llama_new_context_with_model: kv self size  =   97.12 MB
llama_new_context_with_model: compute buffer total size =   70.22 MB
llama_new_context_with_model: VRAM scratch buffer: 68.75 MB
system_info: n_threads = 5 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 759.39 ms
perplexity: calculating perplexity over 616 chunks, batch_size=512
perplexity: 0.59 seconds per pass - ETA 6.10 minutes
[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan

Token generation seemed to work okay.

JohannesGaessler · 2023-09-05T20:45:11Z

Do you still get NaN with -nommq?

KerfuffleV2 · 2023-09-05T20:50:13Z

Do you still get NaN with -nommq?

Yes, I do.

perplexity: 1.27 seconds per pass - ETA 13.03 minutes
[1]nan,[2]nan,[3]nan

I get them slower! (Not that this information is likely to be too helpful.)

cebtenzzre · 2023-09-05T20:52:14Z

I get an assertion failure if I set my GTX 970 as the main GPU with -mg 1.
Full command: ./main -ngl 100 -t 1 -m llama-7b.q4_0.gguf -n 128 --ignore-eos -mg 1

GGML_ASSERT: ggml-cuda.cu:5782: buffers_contiguous || ne02 == 1

Backtrace:

#3  0x00007fffcea264b8 in __GI_abort () at abort.c:79
#4  0x000055555568bb21 in ggml_cuda_op_mul_mat_vec<false> (src0=0x20ea29be0, src1=0x20ea29a90, dst=0x20ea29d30, 
    src0_ddq_i=0x93a000000 "", src0_ddf_i=0x0, src1_ddf_i=0x93e408040, dst_ddf_i=0x93e404040, i02=0, i01_low=0, i01_high=1,
    i1=0, cudaStream_main=@0x7fffffff77e0: 0x555555b9d430) at /home/cebtenzzre/src/forks/llama.cpp/ggml-cuda.cu:5782
#5  0x000055555567918b in ggml_cuda_mul_mat_vec_p021 (src0=0x20ea29be0, src1=0x20ea29a90, dst=0x20ea29d30)
    at /home/cebtenzzre/src/forks/llama.cpp/ggml-cuda.cu:6492
#6  0x0000555555679668 in ggml_cuda_mul_mat (src0=0x20ea29be0, src1=0x20ea29a90, dst=0x20ea29d30)
    at /home/cebtenzzre/src/forks/llama.cpp/ggml-cuda.cu:6543
#7  0x000055555567b89e in ggml_cuda_compute_forward (params=0x7fffffff7980, tensor=0x20ea29d30)
    at /home/cebtenzzre/src/forks/llama.cpp/ggml-cuda.cu:7062
#8  0x00005555555ad84f in ggml_compute_forward (params=0x7fffffff7980, tensor=0x20ea29d30) at ggml.c:15760
#9  0x00005555555b1e01 in ggml_graph_compute_thread (data=0x7fffffff79c0) at ggml.c:17335
#10 0x00005555555b3432 in ggml_graph_compute (cgraph=0x20ea00020, cplan=0x7fffffff7ad0) at ggml.c:17845
#11 0x00005555555be540 in ggml_graph_compute_helper (buf=std::vector of length 16384, capacity 32768 = {...},
    graph=0x20ea00020, n_threads=1) at llama.cpp:440
#12 0x00005555555ca37a in llama_eval_internal (lctx=..., tokens=0x555558214510, embd=0x0, n_tokens=1, n_past=0, n_threads=1,
    cgraph_fname=0x0) at llama.cpp:3014
#13 0x00005555555d6109 in llama_eval (ctx=0x555557305580, tokens=0x555558214510, n_tokens=1, n_past=0, n_threads=1)
    at llama.cpp:6164
#14 0x00005555555690aa in main (argc=12, argv=0x7fffffffd998) at examples/main/main.cpp:617

JohannesGaessler · 2023-09-05T23:11:02Z

I've pushed a fix for incorrect results with compute capabilities < 6.1. The results should now be correct but the performance will be bad.

KerfuffleV2 · 2023-09-05T23:18:54Z

Tested at e0d5c0b - no difference as far as I can see. With -ngl 0, with -nommq, both. All NaNs. Perhaps your compute level checking isn't working correctly for ROCM? The speed seems the same as before.

JohannesGaessler · 2023-09-05T23:26:06Z

It's not just a check for compute capability, it's primarily that the old dequantize_mul_mat_vec implementation that was used as a fallback could not correctly handle non-contiguous tensors. I implemented support for non-contiguous tensors but it seems that this was a separate issue from yours.

KerfuffleV2 · 2023-09-05T23:27:56Z

Alrighty. Just let me know if/when you want me to try a new version, different settings, etc. I can test out whatever.

JohannesGaessler · 2023-09-05T23:28:11Z

@KerfuffleV2 which GPU do you have again?

KerfuffleV2 · 2023-09-05T23:29:44Z

ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6600, compute capability 10.3

Arch Linux, the ROCM/ROCBLAS stuff seems to be at 5.6.0 if that makes a difference.

JohannesGaessler · 2023-09-05T23:53:05Z

You could maybe try editing line 6593 of ggml_cuda.cu. There should be a boolean nc_okay that indicates whether directly handling non-contiguous data is going to give you correct results. I would be interested in whether you still get incorrect results if you manually set nc_okay = false which should then trigger the fallback implementation where the data is first made contiguous.

KerfuffleV2 · 2023-09-05T23:57:43Z

diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index ebe267f..98847cf 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -6590,7 +6590,7 @@ void ggml_cuda_mul_mat(const ggml_tensor * src0, const ggml_tensor * src1, ggml_
     }
 
     // no quantized non-contiguous support for lower CC kernels implemented
-    const bool nc_okay = src0->type == GGML_TYPE_F16 || g_compute_capabilities[g_main_device] >= MIN_CC_DP4A;
+    const bool nc_okay = false;
 
     if (all_on_device && nc_okay && ggml_is_permuted(src0) && ggml_is_permuted(src1) && src1->ne[1] == 1) {
         ggml_cuda_mul_mat_vec_p021(src0, src1, dst);

No effect. Also tested with -nommq.

edit: Kind of just randomly changing stuff but I also tried:

-        if (src1->ne[1] == 1 && src0->ne[0] % GGML_CUDA_DMMV_X == 0) {
+        if (false && src1->ne[1] == 1 && src0->ne[0] % GGML_CUDA_DMMV_X == 0) {

Also no difference.

edit 2: Continuing that approach, I forced it to end up at line 6612:

ggml_cuda_op(src0, src1, dst, ggml_cuda_op_mul_mat_cublas, true, false);

Still no difference. (And I verified it was getting there with with a GGML_ASSERT(false), though I don't know if it was for the KV stuff specifically.)

JohannesGaessler · 2023-09-13T20:31:23Z

This model: https://huggingface.co/pankajmathur/orca_mini_3b - converted/quantized to q8_0

Okay, that's the piece that I was missing. Using the same model I can now reproduce the issue. The problem seems to be that 3b has n_embd_head == 100 which is not a multiple of 32 so the K component of the KV cache does not work properly. The fix is going to be annoying though; padding K to 128 is probably the least bad option.

Dampfinchen · 2023-09-14T08:32:16Z

I pushed a fix for partial offloading when the input or batch size is not a multiple of 32.

Can confirm this fixes it. Awesome work!

Green-Sky · 2023-09-16T00:16:28Z

EDIT3: seems to be a bug in the setup/build process, tracked by: #3202
Got another problematic model for you: tinyllama (1.1B)
(commit 91523fb)

ppl for f16:

$ perplexity -m models/TinyLlama-1.1B-step-50K-105b/ggml-model-f32.gguf -f wikitext-2-raw/wiki.test.raw -ngl 0 --chunks 300 -kvt f16
[1]212.4852,[2]299.6576,[3]302.9353,[4]341.5807,^C

ppl for q8_0:

$ perplexity -m models/TinyLlama-1.1B-step-50K-105b/ggml-model-f32.gguf -f wikitext-2-raw/wiki.test.raw -ngl 0 --chunks 300 -kvt q8_0
[1]12633.6645,[2]16262.5879,[3]17373.0927,[4]16899.1684,^C

ppl for q8_0 nommq:

$ perplexity -m models/TinyLlama-1.1B-step-50K-105b/ggml-model-f32.gguf -f wikitext-2-raw/wiki.test.raw -ngl 0 --chunks 300 -kvt q8_0 -nommq
[1]212.1701,[2]299.5440,[3]302.8249,[4]341.6177,[5]333.2243,[6]330.2414,^C

edit: also with e0d0a0f

edit2: i realize that this is more of a generic mmq issue

JohannesGaessler · 2023-09-16T14:18:59Z

I pushed a fix for small models with n_embd_head % 32 != 0. I fixed it by adding op_params to ggml_cpy. If defined the rows of the source tensor are padded to whole blocks of the type of the destination tensor (with zeros).

cebtenzzre · 2023-09-18T01:21:46Z

Could you resolve the merge conflicts? I wanted to try this out on my fork, but there have been a lot of changes to ggml-cuda.cu on master.

JohannesGaessler · 2023-09-18T07:42:20Z

Already on it, I'll push it later today.

JohannesGaessler · 2023-09-18T13:57:22Z

I pushed the rebased version. From what I can tell it's working correctly. If you do encounter issues, please compare against both master and the branch kv-q8_0-6 which is the version immediately prior to rebasing so I will know whether the issue was caused by the rebase.

JohannesGaessler · 2023-09-18T14:29:49Z

ggml.c

                char * dst_ptr = (char *) dst->data;
+                float src0_padded[ne00_padded];


I don't understand why the CI runs are failing. From what I can tell the issue is that for some reason it tries to allocate an array of size 0 here. This would imply ne00_padded == 0 which in turn would imply ne00 == 0. Is there a situation in which calling this function with that configuration actually makes sense?

MSVC doesn't support VLA, the size of the array must be a compile-time constant.

KerfuffleV2 · 2023-09-19T07:43:27Z

Tested with my weird 3B Orca model and a LLaMA2-70B for token generation and perplexity. Both seem to work fine now on ROCM.

Nexesenex · 2023-09-19T17:59:18Z

CodeLlama 34b (Airoboros, Spicyboros, WizardLM) outputs rubbish, tested both in KoboldCPP chat and with LLama perplexity test (KV_Q_8_0 branch, around 50,000 perplexity at every context length).

Example : Output: _\\\\:\ \Item ( =,,
,
,,,
,,, =OfN of of <\xl(alias T,
,
,
,, = \AST of the Års
, = \point \bib no, = "\prog
,

No problem with CodeLlama 7b & 13b (although I didn't test all quants).

Also, the Q_8_0 branch shows again the VRAM "leak", while the Q_8_0_6 branches doesn't suffer from the problem, and shows a flat VRAM occupation as the context grows (as does show a flat VRAM occupation the current KoboldCPP Experimental build with the usual "KV16" implementation since my previous occurrence of the "VRAM leak" problem was resolved a week ago).

Sorry for my lack of precise technical terms, I'm just a user testing above my pay-grade !

BarfingLemurs · 2023-09-20T10:57:19Z

On the latest build, got the same errors as before on android in termux:
I don't know if the behavior is reproducible across Mac, or an SBC, but I tested with a pixel and another phone with snapdragon SoC.

llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.08 MB
llm_load_tensors: mem required  = 1887.56 MB (+  109.21 MB per state)
..............................................................................................
llama_new_context_with_model: kv self size  =  109.21 MB
llama_new_context_with_model: compute buffer total size =   70.22 MB
GGML_ASSERT: /data/data/com.termux/files/home/yomama/2/llama.cpp/ggml.c:3523: nb % 2 == 0
GGML_ASSERT: /data/data/com.termux/files/home/yomama/2/llama.cpp/ggml.c:3523: nb % 2 == 0
GGML_ASSERT: /data/data/com.termux/files/home/yomama/2/llama.cpp/ggml.c:3523: nb % 2 == 0
GGML_ASSERT: /data/data/com.termux/files/home/yomama/2/llama.cpp/ggml.c:3523: nb % 2 == 0
Aborted
~/.../2/llama.cpp $

tested various models: 3b q4_1, q4_0 7b, q2k 7b, q4_k_m 13b

cebtenzzre · 2023-09-30T05:12:36Z

I have been using q8_0 kv cache on my local fork without issues. But I have to disable it for now because there are merge conflicts after #3301.

Dampfinchen · 2023-10-01T19:25:09Z

Personally, I would definately like to have this merged as a seperate option. I didn't encounter any issues on my end and it does help a lot.

ann-wzhao · 2023-10-03T10:49:02Z

Alright, this PR should now be feature complete. I added a new CLI argument --kv-type that lets users set the type as which the KV cache is stored. By default q8_0 is used since the memory savings are quite substantial: for 4096 context with 7b q4_0 the VRAM usage goes down from 6478 MiB to 5536 MiB (942 MiB reduction). The token generation performance seems to be slightly slower, prompt processing on average a little faster:

model GPU test t/s master t/s PR Speedup
7b q4_0 RTX 3090 pp 512 2187.37 ± 34.18 2303.08 ± 24.19 1.05
7b q4_0 RTX 3090 tg 128 141.11 ± 0.30 140.19 ± 0.31 0.99
7b q4_0 P40 pp 512 875.56 ± 4.72 870.33 ± 2.09 0.99
7b q4_0 P40 tg 128 59.45 ± 0.07 57.23 ± 0.01 0.96
7b q4_0 RX 6800 pp 512 1061.05 ± 6.51 1159.28 ± 7.16 1.09
7b q4_0 RX 6800 tg 128 77.89 ± 0.02 72.95 ± 0.01 0.94
The perplexity changes as follows:

model PPL master PPL PR PPL diff
7b f16 5.7962 5.7989 2.7e-3
7b q4_0 5.9670 5.9658 -1.2e-3
7b q4_1 6.0035 6.0038 3e-4
7b q5_0 5.8292 5.8327 3.5e-3
7b q5_1 5.8538 5.8560 2.2e-3
7b q8_0 5.8014 5.8018 4e-4
7b q2_k 6.4466 6.4514 4.8e-3
7b q3_k_s 6.2950 6.2955 5e-4
7b q3_k_m 6.0264 6.0297 3.3e-3
7b q3_k_l 5.9854 5.9911 5.7e-3
7b q4_k_s 5.8891 5.8934 4.3e-3
7b q4_k_m 5.8798 5.8840 4.2e-3
7b q5_k_s 5.8215 5.8260 4.5e-3
7b q5_k_m 5.8292 5.8332 4e-3
7b q6_k 5.8098 5.8141 4.3e-3
The increase in perplexity is on the same order as the perplexity differences between f16 and q8_0 or q8_0 and q6_K.

Hi, I use this pr to quantize KV cache and I found that when I offload 32 repeated layers to GPU, the answer of the model is almost the same as the fp16 kv cache, but when I load one or some repeated layers to CPU, the answer of this kv int8 model for long prompt is total wrong(for example: always output "!!!!!!!!!!!!!!!!!!!!!!!!!"), so I wonder how to fix this bug? Thanks!

I debug further and found some intermediate tensor's data is nan value if use kv cache q8_0.

 Assertion failed: !isnan(sp[i]), file D:\llm_inference\gguf\llama.cpp-kv-q8_0-8\ggml.c, line 12412

So how should I debug next?

cebtenzzre · 2023-10-09T04:22:40Z

I merged this onto 45855b3, among some other local changes, and hit this assertion while using it via oobabooga's UI:

GGML_ASSERT: /home/cebtenzzre/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:5550: i_blck_0 + ne00 < QK8_0

This must be related to this PR, because nothing else in the model I'm testing with should be quantized to 8-bit. I had just edited the prompt, so it rewound the KV cache (eval was called with a smaller n_past than before).

JohannesGaessler · 2023-10-09T08:21:31Z

I'm assuming the issue is caused by the assert in ggml_cpy_f32_q8_0_cuda. I think the following patch for cpy_f32_q8_0 would work:

    float val;
    if (first_incomplete) {
        memcpy(&val, src, sizeof(float));
    } else {
        val = *((float *) src);
    }

    if (first_incomplete && last_incomplete) {
        __syncthreads();
    }

    if (save_unquantized && last_incomplete && i0 / QK8_0 == (i_blck_0 + ne00) / QK8_0) {
        memcpy(&dst[1 + iqs/8].qs[sizeof(float) * (iqs % 8)], src, sizeof(float));
    }

Note that I did not test this code and that I currently do not have the time to work on llama.cpp, especially since there is no indication that this PR would be merged.

Nexesenex · 2023-10-09T10:21:53Z

@ggerganov This PR really deserves to be promised a merge, and his author to be encouraged. KV-Q8_0 is really a boner for LlamaCPP and its users compared to the competition.

FNsi · 2023-10-09T15:04:46Z

If anyone wants to open a long enough but easy to achieve context window, let's say the long llamaLora merge, 16k q8_0 kv could save 10g+ memory usage.

KerfuffleV2 · 2023-10-09T17:19:39Z

This PR really deserves to be promised a merge, and his author to be encouraged.

Note that this is just one random person's opinion: I'd say it's not quite that simple. How using less memory is useful for the KV cache is very easy to see but it's not the only consideration.

One thing to note is the changes are pretty complicated and there have been a lot of issues that have cropped up in testing. It seems like we haven't even reached the end of finding new ones. This isn't an easy thing to get right, even for someone as skilled as JG.

It also isn't just a self-contained set of changes that only affects people that decide to opt into using the Q8_0 quantized KV: the changes touch a lot of stuff that can have an effect whether or not quantized KV is actually being used.

Also another consideration is that supporting quantized KV can limit the other design choices that are possible. Just an example, KV cache fragmentation is currently an active issue which can come up when doing parallel generation. If the KV cache is quantized, it becomes a lot harder (maybe even not possible) to just move the entries around.

Of course, right now it has conflicts and can't be merged. (Also, resolving those conflicts and incorporating other changes can cause new bugs to crop up.)

Anyway, someone maintaining a project can't really think like "I have to merge this pull or the author will be unhappy/discouraged". They have to look at the broader picture, or they'll kill their own project by making bad choices. Personally, I know I couldn't handle it. Of course, it also feels bad to put a lot of work into a pull and have it rejected or not know if will ever be accepted.

cebtenzzre · 2023-10-09T23:49:51Z

I'm assuming the issue is caused by the assert in ggml_cpy_f32_q8_0_cuda. I think the following patch for cpy_f32_q8_0 would work:

Unfortunately, I still hit the same assertion with the patch applied.

JohannesGaessler · 2023-10-10T07:54:00Z

What I meant is that with the patch it should be fine to remove the assert.

FNsi · 2023-10-10T12:55:47Z

decide to opt into using the Q8_0 quantized KV: the changes touch a lot of stuff that can have an effect whether or not quantized KV is actually being used.

Forgive me again but i asked myself why llama.cpp? Why a lot of people use it in the very beginning?

I think it's because,

Because we all are GPU poor people.

Because we all want something freaking great.

😂

cebtenzzre · 2023-10-12T01:43:18Z

What I meant is that with the patch it should be fine to remove the assert.

I have the patch applied and the assertion removed, but now in similar situations the KV cache gets corrupted and the model just starts generating hash marks no matter how many times I regenerate the response or even remove tokens from the prompt. Reloading the model fixes it.

cebtenzzre · 2023-10-12T04:52:08Z

And with or without replacing the assertion with the syncthreads call, using the 'main' example with spicyboros-c34b-2.2.Q5_0.gguf produces garbage output for me on this PR.

cebtenzzre · 2023-12-10T18:45:35Z

obsoleted by #4312

ggerganov reviewed Sep 2, 2023

View reviewed changes

JohannesGaessler mentioned this pull request Sep 5, 2023

llama : understand why GPU results are different for different batch sizes #3014

Closed

JohannesGaessler force-pushed the kv-q8_0 branch from 9331895 to e569a9f Compare September 5, 2023 18:10

cebtenzzre reviewed Sep 5, 2023

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

JohannesGaessler force-pushed the kv-q8_0 branch from e569a9f to 35b7f08 Compare September 5, 2023 19:32

JohannesGaessler marked this pull request as ready for review September 5, 2023 19:33

JohannesGaessler force-pushed the kv-q8_0 branch 2 times, most recently from 554d4bc to 82b34d8 Compare September 5, 2023 20:18

KV cache quantized to q8_0

c620f4d

JohannesGaessler force-pushed the kv-q8_0 branch from 980604d to c620f4d Compare September 18, 2023 13:54

JohannesGaessler commented Sep 18, 2023

View reviewed changes

Olexorus mentioned this pull request Oct 2, 2023

llama : store non-RoPEd K cache #3234

Draft

Green-Sky mentioned this pull request Dec 3, 2023

llama : support quantum K cache #4312

Merged

12 tasks

cebtenzzre closed this Dec 10, 2023

		char * dst_ptr = (char *) dst->data;
		float src0_padded[ne00_padded];

KV cache quantized to q8_0 #2969

KV cache quantized to q8_0 #2969

Conversation

JohannesGaessler commented Sep 2, 2023

JohannesGaessler commented Sep 2, 2023

ggerganov left a comment • edited Loading

Choose a reason for hiding this comment

JohannesGaessler commented Sep 2, 2023

KerfuffleV2 commented Sep 2, 2023 • edited Loading

JohannesGaessler commented Sep 2, 2023

kiratp commented Sep 2, 2023 • edited Loading

JohannesGaessler commented Sep 4, 2023

cebtenzzre commented Sep 5, 2023 • edited Loading

JohannesGaessler commented Sep 5, 2023 • edited Loading

cebtenzzre commented Sep 5, 2023 • edited Loading

JohannesGaessler commented Sep 5, 2023

KerfuffleV2 commented Sep 5, 2023

JohannesGaessler commented Sep 5, 2023

KerfuffleV2 commented Sep 5, 2023

cebtenzzre commented Sep 5, 2023 • edited Loading

JohannesGaessler commented Sep 5, 2023

KerfuffleV2 commented Sep 5, 2023

JohannesGaessler commented Sep 5, 2023

KerfuffleV2 commented Sep 5, 2023

JohannesGaessler commented Sep 5, 2023

KerfuffleV2 commented Sep 5, 2023

JohannesGaessler commented Sep 5, 2023

KerfuffleV2 commented Sep 5, 2023 • edited Loading

JohannesGaessler commented Sep 13, 2023

Dampfinchen commented Sep 14, 2023 • edited Loading

Green-Sky commented Sep 16, 2023 • edited Loading

JohannesGaessler commented Sep 16, 2023

cebtenzzre commented Sep 18, 2023

JohannesGaessler commented Sep 18, 2023

JohannesGaessler commented Sep 18, 2023

JohannesGaessler Sep 18, 2023

Choose a reason for hiding this comment

slaren Sep 18, 2023

Choose a reason for hiding this comment

KerfuffleV2 commented Sep 19, 2023

Nexesenex commented Sep 19, 2023 • edited Loading

BarfingLemurs commented Sep 20, 2023 • edited Loading

cebtenzzre commented Sep 30, 2023

Dampfinchen commented Oct 1, 2023 • edited Loading

ann-wzhao commented Oct 3, 2023 • edited Loading

cebtenzzre commented Oct 9, 2023

JohannesGaessler commented Oct 9, 2023

Nexesenex commented Oct 9, 2023 • edited Loading

FNsi commented Oct 9, 2023

KerfuffleV2 commented Oct 9, 2023

cebtenzzre commented Oct 9, 2023

JohannesGaessler commented Oct 10, 2023

FNsi commented Oct 10, 2023

cebtenzzre commented Oct 12, 2023 • edited Loading

cebtenzzre commented Oct 12, 2023 • edited Loading

cebtenzzre commented Dec 10, 2023

ggerganov left a comment •

edited

Loading

KerfuffleV2 commented Sep 2, 2023 •

edited

Loading

kiratp commented Sep 2, 2023 •

edited

Loading

cebtenzzre commented Sep 5, 2023 •

edited

Loading

JohannesGaessler commented Sep 5, 2023 •

edited

Loading

cebtenzzre commented Sep 5, 2023 •

edited

Loading

cebtenzzre commented Sep 5, 2023 •

edited

Loading

KerfuffleV2 commented Sep 5, 2023 •

edited

Loading

Dampfinchen commented Sep 14, 2023 •

edited

Loading

Green-Sky commented Sep 16, 2023 •

edited

Loading

Nexesenex commented Sep 19, 2023 •

edited

Loading

BarfingLemurs commented Sep 20, 2023 •

edited

Loading

Dampfinchen commented Oct 1, 2023 •

edited

Loading

ann-wzhao commented Oct 3, 2023 •

edited

Loading

Nexesenex commented Oct 9, 2023 •

edited

Loading

cebtenzzre commented Oct 12, 2023 •

edited

Loading

cebtenzzre commented Oct 12, 2023 •

edited

Loading