Add reference implementation for `q4k` and `q5k` #586

LLukas22 · 2023-08-24T15:39:10Z

With this k-quant models should be able to be executed, i tried to run some q4_k_M models but the results were gibberish. I then extended the unit tests to check against the results produced by the ggml matmul unit tests but everything seams fine, although the current implementation is off by one at the sixth or seventh decimal place, which probably shouldn't influence the model results.

LaurentMazare · 2023-08-24T16:25:01Z

Thanks, would probably be good to understand why the text generation does not work.
Which model file did you try out? If you don't have the time to try to line it up, I can probably take a stab at it at some point.

LLukas22 · 2023-08-24T19:05:27Z

I believe my vec-dot implementations are funky, but i don't know why as they pass ggml's unit tests and the random tensor mat-muls seam reasonable. The problem im facing is that the llama 2 k-quant models contain multiple different quant types per file meaning i can't test them in isolation.

Here are some results from TheBlokes llama-2 models:

	working	contains quants
llama-2-7b.ggmlv3.q2_K.bin	X	Q2K,Q4K,Q6K
llama-2-7b.ggmlv3.q3_K_M.bin	X	Q3K,Q4K,Q6K
llama-2-7b.ggmlv3.q4_K_M.bin	X	Q4K,Q6K
llama-2-7b.ggmlv3.q5_K_M.bin	X	Q5K,Q6K
llama-2-7b.ggmlv3.q6_K.bin	✔️	Q6K

From the test i would conclude that i probably got something wrong in the q4k and q5k implementations, but i don't know what 🤔

LaurentMazare · 2023-08-25T16:47:05Z

I looked a bit at this and here is a patch that seems to fix things for llama-2-7b.ggmlv3.q4_K_M.bin.
Overall I would be in favor of never using transmute or transmute_copy and instead use a combination of LittleEndian::{read_u32,read_u32_into,write_u32,write_u32_into}. All these are safe ops and much easier to reason about.

Let me know if you have the time to do this on your patch or if you're already too much in armored core and if the latter I'll take a stab at fixing these and merge.

edit: I've merged #607 that convert most existing transmutes to byteorder versions so might give some good examples. Haven't done anything on the ones this PR introduces.

diff --git a/candle-core/src/quantized/k_quants.rs b/candle-core/src/quantized/k_quants.rs
index c5ce97f..3852b59 100644
--- a/candle-core/src/quantized/k_quants.rs
+++ b/candle-core/src/quantized/k_quants.rs
@@ -4,6 +4,7 @@ use super::utils::{
 };
 use super::GgmlDType;
 use crate::Result;
+use byteorder::{ByteOrder, LittleEndian};
 use half::f16;
 use rayon::prelude::*;
 
@@ -926,11 +927,7 @@ impl GgmlType for BlockQ4K {
                 q4 = &q4[32..];
             }
 
-            let utmp_raw = unsafe {
-                std::mem::transmute::<&mut [u8; 12], &mut [u32; 3]>(&mut x.scales.clone())
-            };
-
-            utmp[0..3].copy_from_slice(utmp_raw);
+            LittleEndian::read_u32_into(&x.scales, &mut utmp[0..3]);
 
             utmp[3] = ((utmp[2] >> 4) & KMASK2) | (((utmp[1] >> 6) & KMASK3) << 4);
             let uaux = utmp[1] & KMASK1;

LLukas22 · 2023-08-25T18:46:04Z

I looked a bit at this and here is a patch that seems to fix things for llama-2-7b.ggmlv3.q4_K_M.bin.
Overall I would be in favor of never using transmute or transmute_copy and instead use a combination of LittleEndian::{read_u32,read_u32_into,write_u32,write_u32_into}. All these are safe ops and much easier to reason about.

This change seems quite sensible. I initially used transmute mainly due to my limited familiarity with Rust 😅. Feel free to commit the patch directly to this PR since you should have write access. Alternatively, I can replace the transmutes on my end and test it with my collection of GGML models tomorrow. Just let me know what works best for you.

LLukas22 · 2023-08-26T08:31:20Z

Alright i replaced the transmutes and now everything seams to work just fine. 👍
I tested it again against my collection of models and the output looks reasonable, but i haven't compared it against llama.cpp's output.

LaurentMazare · 2023-08-26T10:27:37Z

Amazing, I'll have a quick look at the PR later today and hopefully merge this, I'll also try it vs llama.cpp just to check that everything is more or less in sync (from my experience so far with q4_0/q8_0, it's likely to be a bug on our side when it's not the case :) ).

LaurentMazare · 2023-08-26T11:08:08Z

Merged, thanks a lot for all the hard work!

LaurentMazare · 2023-08-26T11:30:50Z

Also just merged #609 that gets rid of the remaining transmute and of using intermediary values that you may find helpful. My feeling after this is that the main drawback of transmute is that they do both aliasing (possibly mutable) and typecasting whereas in this case we mostly care about the casting bit. Anyway, great to have all these quantizations available now!

LLukas22 added 5 commits August 23, 2023 16:53

add q2k vec-dot

3700bda

q3k vec-dot + quantization bugfix

e3be860

q4k vec-dot

f4ece79

q5k vec-dot

0a397e0

Validate against GGML unit test results.

9061595

LLukas22 added 2 commits August 26, 2023 10:17

Merge remote-tracking branch 'upstream/main' into feat/k-qmatmuls

d12f2f3

Remove some more transmutes

9edc785

LaurentMazare approved these changes Aug 26, 2023

View reviewed changes

LaurentMazare merged commit c72eb3d into huggingface:main Aug 26, 2023
10 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reference implementation for `q4k` and `q5k` #586

Add reference implementation for `q4k` and `q5k` #586

LLukas22 commented Aug 24, 2023

LaurentMazare commented Aug 24, 2023

LLukas22 commented Aug 24, 2023

LaurentMazare commented Aug 25, 2023 •

edited

Loading

LLukas22 commented Aug 25, 2023

LLukas22 commented Aug 26, 2023

LaurentMazare commented Aug 26, 2023

LaurentMazare commented Aug 26, 2023

LaurentMazare commented Aug 26, 2023

Add reference implementation for q4k and q5k #586

Add reference implementation for q4k and q5k #586

Conversation

LLukas22 commented Aug 24, 2023

LaurentMazare commented Aug 24, 2023

LLukas22 commented Aug 24, 2023

LaurentMazare commented Aug 25, 2023 • edited Loading

LLukas22 commented Aug 25, 2023

LLukas22 commented Aug 26, 2023

LaurentMazare commented Aug 26, 2023

LaurentMazare commented Aug 26, 2023

LaurentMazare commented Aug 26, 2023

Add reference implementation for `q4k` and `q5k` #586

Add reference implementation for `q4k` and `q5k` #586

LaurentMazare commented Aug 25, 2023 •

edited

Loading