Importance matrix support for legacy quants #4969

ikawrakow · 2024-01-16T09:11:41Z

TL;DR See title and PR #4861, #4930 for more details.

Opinions on adding importance matrix support for legacy quants were divided (see #4932), but given @ggerganov's comment there I decided to go ahead and prepare this PR.

I observe quite significant improvement in perplexity for all models I have tested. In addition, Q4_1 and Q5_1 no longer have the erratic behavior of having a higher perplexity than Q4_0/Q5_0 for some models despite using more bits.

The following tables give a few representative perplexity examples. The QError columns are defined as PPL(Q)/PPL(fp16)-1. Perplexity is for a context of 512 tokens.

Q4_0

Model	PPL(Master)	PPL (PR)	QError (Master)	QError (PR)	QError ratio PR/Master
LLaMA-v1-7B	6.1162	6.0276	3.55%	2.05%	0.577
LLaMA-v2-7B	5.9635	5.9107	2.86%	1.95%	0.682
Mistral-7B	5.8189	5.7993	2.22%	1.88%	0.847
LLaMA-v1-13B	5.3639	5.3104	2.07%	1.05%	0.507
LLaMA-v2-13B	5.1994	5.1875	1.95%	1.71%	0.877

Q4_1

Model	PPL(Master)	PPL (PR)	QError (Master)	QError (PR)	QError ratio PR/Master
LLaMA-v1-7B	6.0653	5.9725	2.69%	1.12%	0.416
LLaMA-v2-7B	6.0008	5.8605	3.50%	1.08%	0.309
Mistral-7B	5.8244	5.7458	2.32%	0.94%	0.405
LLaMA-v1-13B	5.3416	5.2997	1.65%	0.85%	0.515
LLaMA-v2-13B	5.2151	5.1635	2.25%	1.24%	0.551

Q5_0

Model	PPL(Master)	PPL (PR)	QError (Master)	QError (PR)	QError ratio PR/Master
LLaMA-v1-7B	5.9803	5.9298	1.25%	0.39%	0.312
LLaMA-v2-7B	5.8282	5.8138	0.53%	0.28%	0.528
Mistral-7B	5.7180	5.7113	0.45%	0.33%	0.733
LLaMA-v1-13B	5.2844	5.2648	0.56%	0.18%	0.321
LLaMA-v2-13B	5.1412	5.1368	0.81%	0.72%	0.889

Q5_1

Model	PPL(Master)	PPL (PR)	QError (Master)	QError (PR)	QError ratio PR/Master
LLaMA-v1-7B	5.9418	5.9116	0.60%	0.08%	0.134
LLaMA-v2-7B	5.8468	5.8104	0.85%	0.22%	0.259
Mistral-7B	5.7128	5.7057	0.36%	0.23%	0.639
LLaMA-v1-13B	5.2682	5.2634	0.25%	0.16%	0.640
LLaMA-v2-13B	5.1448	5.1340	0.88%	0.66%	0.750

kalomaze · 2024-01-16T11:48:16Z

I notice with the Importance Matrix calculations that, if you go past the original model's supported context length (32k tokens is when it starts, your recommendation is to use 50k tokens), the PPL of the batches collected seems to start rising. This is odd because I specify a short context length which matches my batch size exactly [-c 2048 -b 2048], so each chunk should be roughly the same, and there should be no slow regression of average PPL over time as such. Do you know why this happens?

ikawrakow · 2024-01-16T12:04:45Z

I notice with the Importance Matrix calculations that, if you go past the original model's supported context length (32k tokens is when it starts, your recommendation is to use 50k tokens), the PPL of the batches collected seems to start rising. This is odd because I specify a short context length which matches my batch size exactly [-c 2048 -b 2048], so each chunk should be roughly the same, and there should be no slow regression of average PPL over time as such. Do you know why this happens?

Don't really understand the question (or better, don't understand what is being done and what is the observation)

kalomaze · 2024-01-16T12:28:46Z

I notice with the Importance Matrix calculations that, if you go past the original model's supported context length (32k tokens is when it starts, your recommendation is to use 50k tokens), the PPL of the batches collected seems to start rising. This is odd because I specify a short context length which matches my batch size exactly [-c 2048 -b 2048], so each chunk should be roughly the same, and there should be no slow regression of average PPL over time as such. Do you know why this happens?

Don't really understand the question (or better, don't understand what is being done and what is the observation)

I am trying to make an Importance Matrix from a dataset, using a Mistral 7b model.
The calibration dataset is larger than the supported context size of the model; Mistral 7b models usually support up to 8192 context.
It seems it is inferencing the full dataset within the same context window:

4th chunk is 8192 tokens point - has the lowest ppl - afterwards, it rises again.

This worries me, because in my experience, whenever you inference a model past the native context length, incoherence quickly takes over, and as such, it stops being a meaningful measure of the activations.

ikawrakow · 2024-01-16T12:39:06Z

It seems it is inferencing the full dataset within the same context window:

No. It splits into chunks of n_ctx. n_ctx is 512 by default, and can be overwritten with the -c argument.

You gain nothing by calculating the importance matrix with a large context. My experience is that a context of 512 works best, at least according to perplexity. I.e., if I prepare an importance matrix with a context of 512 and then use it to quantize and run perplexity for a context of 8192, the PPL is slightly lower compared to using a context of 8192 for the importance matrix and running perplexity for a context of 8192.

kalomaze · 2024-01-16T12:43:23Z

It seems it is inferencing the full dataset within the same context window:

No. It splits into chunks of n_ctx. n_ctx is 512 by default, and can be overwritten with the -c argument.

You gain nothing by calculating the importance matrix with a large context. My experience is that a context of 512 works best, at least according to perplexity. I.e., if I prepare an importance matrix with a context of 512 and then use it to quantize and run perplexity for a context of 8192, the PPL is slightly lower compared to using a context of 8192 for the importance matrix and running perplexity for a context of 8192.

No. It splits into chunks of n_ctx. n_ctx is 512 by default, and can be overwritten with the -c argument.

Then what is the explanation for why there seems to be an observable trend of ppl declining as if the context size was the size of the dataset for all the batches? I saw this for a 32k context model as well after the ~32k mark

ikawrakow · 2024-01-16T12:49:36Z

Then what is the explanation for why there seems to be an observable trend of ppl declining as if the context size was the size of the dataset for all the batches? I saw this for a 32k context model as well after the ~32k mark

Perplexity goes up and down, no? It depends on the text being processed. Some part of the test set is predicted better and the perplexity goes down. Some other part is predicted worse, perplexity goes up. That's why we run all ~330k tokens from wiki.test.raw. A few thousand tokens can never give you a good estimate.

kalomaze · 2024-01-16T14:02:01Z

Then what is the explanation for why there seems to be an observable trend of ppl declining as if the context size was the size of the dataset for all the batches? I saw this for a 32k context model as well after the ~32k mark

Perplexity goes up and down, no? It depends on the text being processed. Some part of the test set is predicted better and the perplexity goes down. Some other part is predicted worse, perplexity goes up. That's why we run all ~330k tokens from wiki.test.raw. A few thousand tokens can never give you a good estimate.

Hmm, I think it was probably just a coincidence then, that it happened to look like the average kept going down consistently on both models at around the context length point.
EDIT: I misinterpreted it. It's the average perplexity and not the ppl of the current batch.

JianbangZ · 2024-01-16T15:11:51Z

@ikawrakow did you recently test the hellaswag scores? I have been getting very low scores (even for fp16 models).
More details here #4980

sorasoras · 2024-01-16T18:59:54Z

FP16=Final estimate: PPL = 548.0413 +/- 11.80300

old without legacy quants.
.\sakura0.9_Q2_K_imx.gguf -f .\calib_jp_to_zh.raw PPL = 613.2313 +/- 13.09685
.\sakura0.9_Q3_K_M_imx.gguf -f .\calib_jp_to_zh.raw PPL = 607.2495 +/- 13.37079
.\sakura0.9_Q4_K_M_imx.gguf -f .\calib_jp_to_zh.raw PPL = 608.0662 +/- 13.42431

New with legacy quants
.\sakura0.9_Q2_K_imx.gguf -f .\calib_jp_to_zh.raw PPL = 590.4338 +/- 12.69024
.\sakura0.9_Q3_K_M_imx.gguf -f .\calib_jp_to_zh.raw PPL = 599.0831 +/- 13.13132
.\sakura0.9_Q3_K_L_imx.gguf -f .\calib_jp_to_zh.raw PPL = 553.8032 +/- 11.90981
.\sakura0.9_Q4_K_M_imx.gguf -f .\calib_jp_to_zh.raw PPL = 589.7105 +/- 12.96020

This is unusal so this is just for fun.,but legacy imatrix does help a lot for those fallback quants.

* imatrix: adding support for legacy quants * imatrix: guard Q4_0/Q5_0 against ffn_down craziness --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Kawrakow added 2 commits January 16, 2024 08:37

imatrix: adding support for legacy quants

6f9ec42

imatrix: guard Q4_0/Q5_0 against ffn_down craziness

bb9abb5

ggerganov added the high priority Very important issue label Jan 16, 2024

ikawrakow mentioned this pull request Jan 16, 2024

llama : ggml-backend integration #4766

Merged

7 tasks

ggerganov approved these changes Jan 16, 2024

View reviewed changes

ggerganov merged commit 334a835 into master Jan 16, 2024
46 checks passed

ikawrakow mentioned this pull request Jan 17, 2024

Add importance matrix support for legacy quants? #4932

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importance matrix support for legacy quants #4969

Importance matrix support for legacy quants #4969

ikawrakow commented Jan 16, 2024

kalomaze commented Jan 16, 2024 •

edited

Loading

ikawrakow commented Jan 16, 2024

kalomaze commented Jan 16, 2024 •

edited

Loading

ikawrakow commented Jan 16, 2024

kalomaze commented Jan 16, 2024 •

edited

Loading

ikawrakow commented Jan 16, 2024 •

edited

Loading

kalomaze commented Jan 16, 2024 •

edited

Loading

JianbangZ commented Jan 16, 2024

sorasoras commented Jan 16, 2024

Importance matrix support for legacy quants #4969

Importance matrix support for legacy quants #4969

Conversation

ikawrakow commented Jan 16, 2024

Q4_0

Q4_1

Q5_0

Q5_1

kalomaze commented Jan 16, 2024 • edited Loading

ikawrakow commented Jan 16, 2024

kalomaze commented Jan 16, 2024 • edited Loading

ikawrakow commented Jan 16, 2024

kalomaze commented Jan 16, 2024 • edited Loading

ikawrakow commented Jan 16, 2024 • edited Loading

kalomaze commented Jan 16, 2024 • edited Loading

JianbangZ commented Jan 16, 2024

sorasoras commented Jan 16, 2024

kalomaze commented Jan 16, 2024 •

edited

Loading

kalomaze commented Jan 16, 2024 •

edited

Loading

kalomaze commented Jan 16, 2024 •

edited

Loading

ikawrakow commented Jan 16, 2024 •

edited

Loading

kalomaze commented Jan 16, 2024 •

edited

Loading