-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.5 bit: we can do even better #5999
Conversation
Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85!
~10% drop in performance, so will need some more work.
Just as an aside, I'd really like some kind of versioning in the gguf metadata (not asking for backward compatibility, just a simple "fail if version doesn't match" check). Otherwise, if changes like this keep happening, it's going to create a lot of confusion for users down the road. |
Is there a walkthrough on how to reproduce these results starting from the base model ? |
Yes:
|
* iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sorry for this series of backwards incompatible changes to
IQ1_S
, but the gains are too significant to ignore.In the previous version (PR #5971) there was a 4-bit scale for every 32 weights. Spending 4 bits for a scale in a sub 2-bit quantization is wasteful, but I didn't have a good idea what to do with a spare bit. Going to 3-bit scales would have made the bit arrangement very awkward to work with, so I accepted the waste of 1 bit per 32 weights (0.03125 bpw).
But after merging #5971 I thought about using the spare bit for a quant shift in the block of 32. I.e., instead of the quants being
{-1, 0, 1}
, use{-1+delta, delta, 1+delta}
, wheredelta
is± some_value
, and we use the spare bit to encode the sign. It turns out that this improves PPL quite a bit withsome_value = 0.125
.The table shows a PPL comparison between
IQ1_S
on master (after PR #5971) and this PR. Context is 2048 tokens for LLaMA-v1 and 4096 for all other models. The last column shows therms_norm_epsilon
used to generate the PR results (I did not re-tunerms_norm_epsilon
here but just re-used the values from #5971, so there may be some small additional improvements possible).