Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importance matrix support for legacy quants #4969

Merged
merged 2 commits into from
Jan 16, 2024
Merged

Conversation

ikawrakow
Copy link
Contributor

TL;DR See title and PR #4861, #4930 for more details.

Opinions on adding importance matrix support for legacy quants were divided (see #4932), but given @ggerganov's comment there I decided to go ahead and prepare this PR.

I observe quite significant improvement in perplexity for all models I have tested. In addition, Q4_1 and Q5_1 no longer have the erratic behavior of having a higher perplexity than Q4_0/Q5_0 for some models despite using more bits.

The following tables give a few representative perplexity examples. The QError columns are defined as PPL(Q)/PPL(fp16)-1. Perplexity is for a context of 512 tokens.

Q4_0

Model PPL(Master) PPL (PR) QError (Master) QError (PR) QError ratio PR/Master
LLaMA-v1-7B 6.1162 6.0276 3.55% 2.05% 0.577
LLaMA-v2-7B 5.9635 5.9107 2.86% 1.95% 0.682
Mistral-7B 5.8189 5.7993 2.22% 1.88% 0.847
LLaMA-v1-13B 5.3639 5.3104 2.07% 1.05% 0.507
LLaMA-v2-13B 5.1994 5.1875 1.95% 1.71% 0.877

Q4_1

Model PPL(Master) PPL (PR) QError (Master) QError (PR) QError ratio PR/Master
LLaMA-v1-7B 6.0653 5.9725 2.69% 1.12% 0.416
LLaMA-v2-7B 6.0008 5.8605 3.50% 1.08% 0.309
Mistral-7B 5.8244 5.7458 2.32% 0.94% 0.405
LLaMA-v1-13B 5.3416 5.2997 1.65% 0.85% 0.515
LLaMA-v2-13B 5.2151 5.1635 2.25% 1.24% 0.551

Q5_0

Model PPL(Master) PPL (PR) QError (Master) QError (PR) QError ratio PR/Master
LLaMA-v1-7B 5.9803 5.9298 1.25% 0.39% 0.312
LLaMA-v2-7B 5.8282 5.8138 0.53% 0.28% 0.528
Mistral-7B 5.7180 5.7113 0.45% 0.33% 0.733
LLaMA-v1-13B 5.2844 5.2648 0.56% 0.18% 0.321
LLaMA-v2-13B 5.1412 5.1368 0.81% 0.72% 0.889

Q5_1

Model PPL(Master) PPL (PR) QError (Master) QError (PR) QError ratio PR/Master
LLaMA-v1-7B 5.9418 5.9116 0.60% 0.08% 0.134
LLaMA-v2-7B 5.8468 5.8104 0.85% 0.22% 0.259
Mistral-7B 5.7128 5.7057 0.36% 0.23% 0.639
LLaMA-v1-13B 5.2682 5.2634 0.25% 0.16% 0.640
LLaMA-v2-13B 5.1448 5.1340 0.88% 0.66% 0.750

@kalomaze
Copy link
Contributor

kalomaze commented Jan 16, 2024

I notice with the Importance Matrix calculations that, if you go past the original model's supported context length (32k tokens is when it starts, your recommendation is to use 50k tokens), the PPL of the batches collected seems to start rising. This is odd because I specify a short context length which matches my batch size exactly [-c 2048 -b 2048], so each chunk should be roughly the same, and there should be no slow regression of average PPL over time as such. Do you know why this happens?

@ikawrakow
Copy link
Contributor Author

I notice with the Importance Matrix calculations that, if you go past the original model's supported context length (32k tokens is when it starts, your recommendation is to use 50k tokens), the PPL of the batches collected seems to start rising. This is odd because I specify a short context length which matches my batch size exactly [-c 2048 -b 2048], so each chunk should be roughly the same, and there should be no slow regression of average PPL over time as such. Do you know why this happens?

Don't really understand the question (or better, don't understand what is being done and what is the observation)

@kalomaze
Copy link
Contributor

kalomaze commented Jan 16, 2024

I notice with the Importance Matrix calculations that, if you go past the original model's supported context length (32k tokens is when it starts, your recommendation is to use 50k tokens), the PPL of the batches collected seems to start rising. This is odd because I specify a short context length which matches my batch size exactly [-c 2048 -b 2048], so each chunk should be roughly the same, and there should be no slow regression of average PPL over time as such. Do you know why this happens?

Don't really understand the question (or better, don't understand what is being done and what is the observation)

  • I am trying to make an Importance Matrix from a dataset, using a Mistral 7b model.
  • The calibration dataset is larger than the supported context size of the model; Mistral 7b models usually support up to 8192 context.
  • It seems it is inferencing the full dataset within the same context window:
image

4th chunk is 8192 tokens point - has the lowest ppl - afterwards, it rises again.

This worries me, because in my experience, whenever you inference a model past the native context length, incoherence quickly takes over, and as such, it stops being a meaningful measure of the activations.

@ikawrakow
Copy link
Contributor Author

It seems it is inferencing the full dataset within the same context window:

No. It splits into chunks of n_ctx. n_ctx is 512 by default, and can be overwritten with the -c argument.

You gain nothing by calculating the importance matrix with a large context. My experience is that a context of 512 works best, at least according to perplexity. I.e., if I prepare an importance matrix with a context of 512 and then use it to quantize and run perplexity for a context of 8192, the PPL is slightly lower compared to using a context of 8192 for the importance matrix and running perplexity for a context of 8192.

@kalomaze
Copy link
Contributor

kalomaze commented Jan 16, 2024

It seems it is inferencing the full dataset within the same context window:

No. It splits into chunks of n_ctx. n_ctx is 512 by default, and can be overwritten with the -c argument.

You gain nothing by calculating the importance matrix with a large context. My experience is that a context of 512 works best, at least according to perplexity. I.e., if I prepare an importance matrix with a context of 512 and then use it to quantize and run perplexity for a context of 8192, the PPL is slightly lower compared to using a context of 8192 for the importance matrix and running perplexity for a context of 8192.

No. It splits into chunks of n_ctx. n_ctx is 512 by default, and can be overwritten with the -c argument.

Then what is the explanation for why there seems to be an observable trend of ppl declining as if the context size was the size of the dataset for all the batches? I saw this for a 32k context model as well after the ~32k mark

@ikawrakow
Copy link
Contributor Author

ikawrakow commented Jan 16, 2024

Then what is the explanation for why there seems to be an observable trend of ppl declining as if the context size was the size of the dataset for all the batches? I saw this for a 32k context model as well after the ~32k mark

Perplexity goes up and down, no? It depends on the text being processed. Some part of the test set is predicted better and the perplexity goes down. Some other part is predicted worse, perplexity goes up. That's why we run all ~330k tokens from wiki.test.raw. A few thousand tokens can never give you a good estimate.

@ggerganov ggerganov added the high priority Very important issue label Jan 16, 2024
@kalomaze
Copy link
Contributor

kalomaze commented Jan 16, 2024

Then what is the explanation for why there seems to be an observable trend of ppl declining as if the context size was the size of the dataset for all the batches? I saw this for a 32k context model as well after the ~32k mark

Perplexity goes up and down, no? It depends on the text being processed. Some part of the test set is predicted better and the perplexity goes down. Some other part is predicted worse, perplexity goes up. That's why we run all ~330k tokens from wiki.test.raw. A few thousand tokens can never give you a good estimate.

Hmm, I think it was probably just a coincidence then, that it happened to look like the average kept going down consistently on both models at around the context length point.
EDIT: I misinterpreted it. It's the average perplexity and not the ppl of the current batch.

@JianbangZ
Copy link

@ikawrakow did you recently test the hellaswag scores? I have been getting very low scores (even for fp16 models).
More details here #4980

@ikawrakow ikawrakow mentioned this pull request Jan 16, 2024
7 tasks
@ggerganov ggerganov merged commit 334a835 into master Jan 16, 2024
46 checks passed
@sorasoras
Copy link

FP16=Final estimate: PPL = 548.0413 +/- 11.80300

old without legacy quants.
.\sakura0.9_Q2_K_imx.gguf -f .\calib_jp_to_zh.raw PPL = 613.2313 +/- 13.09685
.\sakura0.9_Q3_K_M_imx.gguf -f .\calib_jp_to_zh.raw PPL = 607.2495 +/- 13.37079
.\sakura0.9_Q4_K_M_imx.gguf -f .\calib_jp_to_zh.raw PPL = 608.0662 +/- 13.42431

New with legacy quants
.\sakura0.9_Q2_K_imx.gguf -f .\calib_jp_to_zh.raw PPL = 590.4338 +/- 12.69024
.\sakura0.9_Q3_K_M_imx.gguf -f .\calib_jp_to_zh.raw PPL = 599.0831 +/- 13.13132
.\sakura0.9_Q3_K_L_imx.gguf -f .\calib_jp_to_zh.raw PPL = 553.8032 +/- 11.90981
.\sakura0.9_Q4_K_M_imx.gguf -f .\calib_jp_to_zh.raw PPL = 589.7105 +/- 12.96020

This is unusal so this is just for fun.,but legacy imatrix does help a lot for those fallback quants.

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* imatrix: adding support for legacy quants

* imatrix: guard Q4_0/Q5_0 against ffn_down craziness

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* imatrix: adding support for legacy quants

* imatrix: guard Q4_0/Q5_0 against ffn_down craziness

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue
Projects
Development

Successfully merging this pull request may close these issues.

6 participants