Add Q3_K_XS #5060

ikawrakow · 2024-01-21T07:58:20Z

TL;DR See #5055

Before the recent two-bit quantization and importance matrix related changes, there were two low-bit quantization types available in llama.cpp: Q2_K and Q3_K_S. Q2_K was basically a 3-bit quantization with just the attn_k and attn_q tensors quantized with 2 bit. The table shows their model sizes and perplexities (wiki.test.raw, n_ctx = 512) for LLaMA-v2-70B:

Quantization	Model size (GiB)	Perplexity
Old Q2_K	27.11	3.8164
Old Q3_K_S	27.70	3.7800

After the recent changes, Q2_K has become an actual 2-bit quantization (less than 3 bits-per-weight), has a LLaMA-v-70B model size of 23.71 GiB, and a perplexity of 4.0039 (using an importance matrix derived from wiki.train.raw). Q3_K_S has increased very slightly to 27.86 GiB, but has a better perplexity of 3.6603. Based on #5005 there is a need to have an intermediate step in terms of model size between the new Q2_K and Q3_K_S. This PR adds such a quantization type as Q3_K_XS. The following table summarizes the new situation for LLaMA-v2-70B

Quantization	Model size (GiB)	Perplexity
Q2_K	23.71	4.0039
Q3_K_XS	26.31	3.7808
Q3_K_S	27.86	3.6603

The table on a graph:

Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size.

Nexesenex · 2024-01-21T19:37:21Z

Just a reminder of the table obtained after some optimizations you made on Q2_K and Q3_K_S in late August 2023. #2807

That Q2_K was the one I spoke about to rename in Q3_K_XS, because it already exists and is proofed for a long time, its perplexity bump (<1%) was more than twice inferior to its size shrinking (>2%), and there's a gain of 1k context at stake in KV f16 just with that change.

But it'd ofc be great to have an intermediate quant below, the Q3_K_XS that you PRed, and which looks like a Q3_K_XXS to me!

ikawrakow · 2024-01-22T07:39:50Z

Just a reminder of the table obtained after some optimizations you made on Q2_K and Q3_K_S in late August 2023. #2807

I was taking the values from my notes, and I guess I forgot to update the notes when I made PR #2807. So, what we see in the above tables/graph is what we had before PR #2807. Here is an updated graph with the values post #2807 (i.e., current master)

Artefact2 · 2024-01-24T12:34:00Z

Q3_K_XS seems to give broken results for mixtral-type models. Generation just ends immediately or prints a few symbols then stops.

I've tested and hit the bug with Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss, TenyxChat-8x7B and bagel-dpo-8x7b-v0.2.

ikawrakow · 2024-01-24T15:24:37Z

Q3_K_XS seems to give broken results for mixtral-type models. Generation just ends immediately or prints a few symbols then stops.

I've tested and hit the bug with Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss, TenyxChat-8x7B and bagel-dpo-8x7b-v0.2.

Thank you for noticing. It should be fixed via PR #5113

Artefact2 · 2024-01-24T15:40:10Z

I've tested the patch, it works. Thanks!

* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S * Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Kawrakow added 2 commits January 21, 2024 08:48

Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S

ec4b801

Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K

29c41d4

Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size.

ggerganov approved these changes Jan 22, 2024

View reviewed changes

ggerganov merged commit 66d575c into master Jan 22, 2024
41 of 47 checks passed

PallHaraldsson mentioned this pull request Jan 22, 2024

Add QuickGELU (lookup-table based) FluxML/NNlib.jl#561

Open

ikawrakow deleted the ik/q3_k_xs branch January 24, 2024 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Q3_K_XS #5060

Add Q3_K_XS #5060

ikawrakow commented Jan 21, 2024

Nexesenex commented Jan 21, 2024 •

edited

Loading

ikawrakow commented Jan 22, 2024

Artefact2 commented Jan 24, 2024

ikawrakow commented Jan 24, 2024

Artefact2 commented Jan 24, 2024

Add Q3_K_XS #5060

Add Q3_K_XS #5060

Conversation

ikawrakow commented Jan 21, 2024

Nexesenex commented Jan 21, 2024 • edited Loading

ikawrakow commented Jan 22, 2024

Artefact2 commented Jan 24, 2024

ikawrakow commented Jan 24, 2024

Artefact2 commented Jan 24, 2024

Nexesenex commented Jan 21, 2024 •

edited

Loading