-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IQ4_XS: a 4.25 bpw quantization #5747
Conversation
As usual, Metal / Apple Silicon don't like my quants.
PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS.
Great work! Why not make an IQ4_XXS by using IQ3_S for the attn_k and attn_q ? |
get_k_quant_type : tensor cols 13696 x 5120 are not divisible by 256, required for iq4_xsllama_model_quantize: failed to quantize: |
Because I need to fix |
@sorasoras Thanks! I keep forgetting this check. It should be fixed now. |
It's working.
That's great!
|
@ikawrakow Thanks a lot for your hard work! It is very much appreciated. Do you think that we can fix the slower Metal speeds with better kernels or does it require a whole new quantisation type? I am wondering why there is such a difference. Is it because of the additional overhead/calculations that are required for the new IQ quant methods? |
KL-divergence data for Mistral-7B
Very nice, seems to be a solid replacement for Q4KS, which was my default recommendation. |
The quantization in this PR is non-linear, hence it requires a table lookup. If you compare to |
* Try IQ4_NL with blocks of 64 - does not look good * iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32 * iq4_xs: CUDA works - 133.2 t/s * iq4_xs: AVX2 dot product * iq4_xs: ARM_NEON dot product * iq4_nl: Metal implementation As usual, Metal / Apple Silicon don't like my quants. * iq3_xs: minor fix * iq4_xs: shrink by using IQ3_S for attn_k and attn_q * iq4_xs: revert using IQ3_S for attn_k and attn_v PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS. * Fix CI * iq4_xs: Added forgotten check for 256 divisibility --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Try IQ4_NL with blocks of 64 - does not look good * iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32 * iq4_xs: CUDA works - 133.2 t/s * iq4_xs: AVX2 dot product * iq4_xs: ARM_NEON dot product * iq4_nl: Metal implementation As usual, Metal / Apple Silicon don't like my quants. * iq3_xs: minor fix * iq4_xs: shrink by using IQ3_S for attn_k and attn_q * iq4_xs: revert using IQ3_S for attn_k and attn_v PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS. * Fix CI * iq4_xs: Added forgotten check for 256 divisibility --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@ikawrakow |
This is basically the same as
IQ4_NL
, but in super-blocks of 256 with 6-bit scales for the blocks of 32 weights. It looks pretty good on the quantization error vs quantized model size curve:It is possible to move the point closer to the
IQ2_XXS...IQ3_M
fit line by usingIQ3_S
for theattn_k
andattn_q
tensors. This reduces the quantized model size to about 4.1 bpw at the expense of a ~0.3% increase in PPL. But given that currently CPU performance forIQ3_S
is pretty bad, I decided against this. Speaking of performance, it is excellent on all platforms where I can test except Metal (as usual):Q4_0
Q4_0
Q4_0
Q4_0