Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.5 bit: we can do even better #5999

Merged
merged 6 commits into from
Mar 11, 2024
Merged

1.5 bit: we can do even better #5999

merged 6 commits into from
Mar 11, 2024

Conversation

ikawrakow
Copy link
Contributor

Sorry for this series of backwards incompatible changes to IQ1_S, but the gains are too significant to ignore.

In the previous version (PR #5971) there was a 4-bit scale for every 32 weights. Spending 4 bits for a scale in a sub 2-bit quantization is wasteful, but I didn't have a good idea what to do with a spare bit. Going to 3-bit scales would have made the bit arrangement very awkward to work with, so I accepted the waste of 1 bit per 32 weights (0.03125 bpw).

But after merging #5971 I thought about using the spare bit for a quant shift in the block of 32. I.e., instead of the quants being {-1, 0, 1}, use {-1+delta, delta, 1+delta}, where delta is ± some_value, and we use the spare bit to encode the sign. It turns out that this improves PPL quite a bit with some_value = 0.125.

The table shows a PPL comparison between IQ1_S on master (after PR #5971) and this PR. Context is 2048 tokens for LLaMA-v1 and 4096 for all other models. The last column shows the rms_norm_epsilon used to generate the PR results (I did not re-tune rms_norm_epsilon here but just re-used the values from #5971, so there may be some small additional improvements possible).

Model PPL (PR #5971) PPL (this PR) rms_norm_epsilon
LLaMA-v1-7B 14.20 12.83 5e-5
LLaMA-v1-13B 8.941 8.338 4e-5
LLaMA-v1-30B 6.999 6.722 2.5e-5
LLaMA-v2-7B 13.51 11.86 1.875e-5
LLaMA-v2-13B 8.134 7.741 2e-5
LLaMA-v2-70B 5.343 5.211 3e-5
Mistral-7B 11.21 10.42 default
Mixtral8x7B 6.354 6.168 default

Spent one of the 4 scale bits on a signs of a 0.125 shift.
I.e., quants are now -1 + delta, delta, 1 + delta, where delta
is +/- 0.125.

CUDA works, same performance as before.
PPL(LLaMA-v2-7B) is now 11.85!
~10% drop in performance, so will need some more work.
@ikawrakow ikawrakow added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Mar 11, 2024
@ggerganov ggerganov merged commit 44ca159 into master Mar 11, 2024
44 of 63 checks passed
@Artefact2
Copy link
Collaborator

Artefact2 commented Mar 11, 2024

Just as an aside, I'd really like some kind of versioning in the gguf metadata (not asking for backward compatibility, just a simple "fail if version doesn't match" check). Otherwise, if changes like this keep happening, it's going to create a lot of confusion for users down the road.

@okpatil4u
Copy link

Is there a walkthrough on how to reproduce these results starting from the base model ?

@ikawrakow
Copy link
Contributor Author

Is there a walkthrough on how to reproduce these results starting from the base model ?

Yes:

  1. Create imatrix. E.g., ./bin/imatrix -m base_model -f wiki.train.raw --chunks 1000 -o imatrix_name -t 1 -ngl 100. If the model does not fit in your GPU (or you are not using a GPU), adjust -t and -ngl accordingly
  2. Quantize. E.g., ./bin/quantize --imatrix imatrix_name base_model quantized_model iq1_s
  3. Run perplexity. E.g., ./bin/perplexity -m quantized_model -f wiki.test.raw -t 1 -ngl 100 -c 4096. Same comment as in 1. about GPU. Change -c 4096 to -c 2048 for LLaMA-v1 models.

NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
* iq1_s: we can do even better

Spent one of the 4 scale bits on a signs of a 0.125 shift.
I.e., quants are now -1 + delta, delta, 1 + delta, where delta
is +/- 0.125.

CUDA works, same performance as before.
PPL(LLaMA-v2-7B) is now 11.85!

* iq1_s: make scalar and AVX2 work with the new version

* iq1_s: make Neon work with new version.

~10% drop in performance, so will need some more work.

* iq1_s: make Metal work with new version

* iq1_s: very slightly faster dequantize on Metal

* iq1_s: fix dequantize on the CPU

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* iq1_s: we can do even better

Spent one of the 4 scale bits on a signs of a 0.125 shift.
I.e., quants are now -1 + delta, delta, 1 + delta, where delta
is +/- 0.125.

CUDA works, same performance as before.
PPL(LLaMA-v2-7B) is now 11.85!

* iq1_s: make scalar and AVX2 work with the new version

* iq1_s: make Neon work with new version.

~10% drop in performance, so will need some more work.

* iq1_s: make Metal work with new version

* iq1_s: very slightly faster dequantize on Metal

* iq1_s: fix dequantize on the CPU

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* iq1_s: we can do even better

Spent one of the 4 scale bits on a signs of a 0.125 shift.
I.e., quants are now -1 + delta, delta, 1 + delta, where delta
is +/- 0.125.

CUDA works, same performance as before.
PPL(LLaMA-v2-7B) is now 11.85!

* iq1_s: make scalar and AVX2 work with the new version

* iq1_s: make Neon work with new version.

~10% drop in performance, so will need some more work.

* iq1_s: make Metal work with new version

* iq1_s: very slightly faster dequantize on Metal

* iq1_s: fix dequantize on the CPU

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants