Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQ4_XS: a 4.25 bpw quantization #5747

Merged
merged 11 commits into from
Feb 27, 2024
Merged

IQ4_XS: a 4.25 bpw quantization #5747

merged 11 commits into from
Feb 27, 2024

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Feb 27, 2024

This is basically the same as IQ4_NL, but in super-blocks of 256 with 6-bit scales for the blocks of 32 weights. It looks pretty good on the quantization error vs quantized model size curve:

legacy_vs_iq_l2_13

It is possible to move the point closer to the IQ2_XXS...IQ3_M fit line by using IQ3_S for the attn_k and attn_q tensors. This reduces the quantized model size to about 4.1 bpw at the expense of a ~0.3% increase in PPL. But given that currently CPU performance for IQ3_S is pretty bad, I decided against this. Speaking of performance, it is excellent on all platforms where I can test except Metal (as usual):

  • 133.7 t/s on CUDA (RTX-4080) vs 128.8 t/s for Q4_0
  • 15.8 t/s on AVX2 (Ryzen-7950X) vs 14.5 t/s for Q4_0
  • 28.8 t/s on ARM_NEON (M2 Max CPU) vs 28.2 t/s for Q4_0
  • 53.9 t/s on Metal (30-core M2 Max GPU) vs 63.1 t/s for Q4_0

As usual, Metal / Apple Silicon don't like my quants.
PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.
@Nexesenex
Copy link
Contributor

Great work!

Why not make an IQ4_XXS by using IQ3_S for the attn_k and attn_q ?
At 4.1bpw, that fits the bill!

@sorasoras
Copy link

This is basically the same as IQ4_NL, but in super-blocks of 256 with 6-bit scales for the blocks of 32 weights. It looks pretty good on the quantization error vs quantized model size curve:

legacy_vs_iq_l2_13

It is possible to move the point closer to the IQ2_XXS...IQ3_M fit line by using IQ3_S for the attn_k and attn_q tensors. This reduces the quantized model size to about 4.1 bpw at the expense of a ~0.3% increase in PPL. But given that currently CPU performance for IQ3_S is pretty bad, I decided against this. Speaking of performance, it is excellent on all platforms where I can test except Metal (as usual):

  • 133.7 t/s on CUDA (RTX-4080) vs 128.8 t/s for Q4_0
  • 15.8 t/s on AVX2 (Ryzen-7950X) vs 14.5 t/s for Q4_0
  • 28.8 t/s on ARM_NEON (M2 Max CPU) vs 28.2 t/s for Q4_0
  • 53.9 t/s on Metal (30-core M2 Max GPU) vs 63.1 t/s for Q4_0
    @ikawrakow
    I think i found a bug

get_k_quant_type : tensor cols 13696 x 5120 are not divisible by 256, required for iq4_xsllama_model_quantize: failed to quantize:
Unsupported tensor size encountered
It should be fallback to IQ4NL?

@ikawrakow
Copy link
Contributor Author

Why not make an IQ4_XXS by using IQ3_S for the attn_k and attn_q ? At 4.1bpw, that fits the bill!

Because I need to fix IQ3_S performance on the CPU first. With attn_q and attn_k quantized with IQ3_S (these two tensors contain about 16% of the model weights for a 7B LLaMA model), performance on my M2 Max CPU drops from 28.8 t/s to 21 t/s. On a Ryzen-7950X CPU performance goes down from 15.8 t/s (achieved with 4 threads) to 12.5 t/s (4 threads) or 14.5 (8 threads). I think IQ4_XS is a really nice alternative to Q4_0 with 6% smaller model size combined with better inference performance (except on Metal), so I don't want to destroy the performance benefit. Let me look more into what is the best way to get to 4 bpw quants.

@ikawrakow
Copy link
Contributor Author

@sorasoras Thanks! I keep forgetting this check. It should be fixed now.

@sorasoras
Copy link

sorasoras commented Feb 27, 2024

@sorasoras Thanks! I keep forgetting this check. It should be fixed now.

It's working.

Q4KM   4.6321                                       8.79 GB
Q3KXS  4.6299 +/- 0.04409                    6.12 GB   
IQ4NL  4.6048 +/- 0.04419                     7.61 GB
IQ4XS  4.5885 +/- 0.04395                     7.30 GB
Q6K    4.5787 +/- 0.04407                      11.4 GB
Q5_KS  4.5761 +/- 0.04412                     9.33 GB

That's great!

| qwen 13B IQ4_NL - 4.5 bpw      |   7.61 GiB |    14.17 B | ROCm       |  99 | pp 512     |  1488.78 ± 11.45 |
| qwen 13B IQ4_NL - 4.5 bpw      |   7.61 GiB |    14.17 B | ROCm       |  99 | tg 128     |     73.13 ± 0.18 |
| qwen 13B IQ4_XS - 4.25 bpw     |   7.30 GiB |    14.17 B | ROCm       |  99 | pp 512     |   1547.23 ± 9.30 |
| qwen 13B IQ4_XS - 4.25 bpw     |   7.30 GiB |    14.17 B | ROCm       |  99 | tg 128     |     76.88 ± 0.78 |

@CyborgArmy83
Copy link

@ikawrakow Thanks a lot for your hard work! It is very much appreciated. Do you think that we can fix the slower Metal speeds with better kernels or does it require a whole new quantisation type? I am wondering why there is such a difference. Is it because of the additional overhead/calculations that are required for the new IQ quant methods?

@Artefact2
Copy link
Collaborator

Artefact2 commented Feb 27, 2024

KL-divergence data for Mistral-7B

image

Bits per weight KL-divergence median KL-divergence q99 Top tokens differ ln(PPL(Q)/PPL(base))
IQ1_S 1.78 0.5495 5.5174 0.3840 0.9235
IQ2_XXS 2.20 0.1751 2.4983 0.2313 0.2988
IQ2_XS 2.43 0.1146 1.7693 0.1943 0.2046
IQ2_S 2.55 0.0949 1.6284 0.1806 0.1722
IQ2_M 2.76 0.0702 1.0935 0.1557 0.1223
Q2_K_S 2.79 0.0829 1.5111 0.1735 0.1600
Q2_K 3.00 0.0588 1.0337 0.1492 0.1103
IQ3_XXS 3.21 0.0330 0.5492 0.1137 0.0589
IQ3_XS 3.32 0.0296 0.4550 0.1071 0.0458
Q3_K_S 3.50 0.0304 0.4481 0.1068 0.0511
IQ3_S 3.52 0.0205 0.3018 0.0895 0.0306
IQ3_M 3.63 0.0186 0.2740 0.0859 0.0268
Q3_K_M 3.89 0.0171 0.2546 0.0839 0.0258
Q3_K_L 4.22 0.0152 0.2202 0.0797 0.0205
>>> IQ4_XS 4.32 0.0088 0.1082 0.0606 0.0079
IQ4_NL 4.56 0.0085 0.1077 0.0605 0.0074
Q4_K_S 4.57 0.0083 0.1012 0.0600 0.0081
Q4_K_M 4.83 0.0075 0.0885 0.0576 0.0060
Q5_K_S 5.52 0.0045 0.0393 0.0454 0.0005
Q5_K_M 5.67 0.0043 0.0368 0.0444 0.0005
Q6_K 6.57 0.0032 0.0222 0.0394 −0.0008

Very nice, seems to be a solid replacement for Q4KS, which was my default recommendation.

@ikawrakow
Copy link
Contributor Author

Do you think that we can fix the slower Metal speeds with better kernels or does it require a whole new quantisation type?

The quantization in this PR is non-linear, hence it requires a table lookup. If you compare to Q4_0, there are two quants packed in one uint8_t, so getting these is just a matter of q & 0xf and q >> 4. Here we need lookup_table[q & 0xf] and lookup_table[q >> 4]. On the other platforms this makes zero difference. At least on my GPU the calculation is almost always memory bound, so this one additional lookup doesn't matter. On the CPU there are vector shuffle instructions that are very fast, so the cost is negligible too. But for some reason the Apple GPU very much dislikes this additional memory load. I'm already putting the lookup table in shared memory (and that gave a ~30% boost in performance compared to having the lookup table in constant memory), so not sure what else can be done. But I would very much appreciate if someone more knowledgeable than me in Apple GPU matters would find a better approach compared to my implementation.

@ikawrakow ikawrakow merged commit 0becb22 into master Feb 27, 2024
60 of 61 checks passed
@ikawrakow ikawrakow deleted the ik/iq4_nl_xs branch February 27, 2024 14:34
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* Try IQ4_NL with blocks of 64 - does not look good

* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

* iq4_xs: CUDA works - 133.2 t/s

* iq4_xs: AVX2 dot product

* iq4_xs: ARM_NEON dot product

* iq4_nl: Metal implementation

As usual, Metal / Apple Silicon don't like my quants.

* iq3_xs: minor fix

* iq4_xs: shrink by using IQ3_S for attn_k and attn_q

* iq4_xs: revert using IQ3_S for attn_k and attn_v

PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.

* Fix CI

* iq4_xs: Added forgotten check for 256 divisibility

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* Try IQ4_NL with blocks of 64 - does not look good

* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

* iq4_xs: CUDA works - 133.2 t/s

* iq4_xs: AVX2 dot product

* iq4_xs: ARM_NEON dot product

* iq4_nl: Metal implementation

As usual, Metal / Apple Silicon don't like my quants.

* iq3_xs: minor fix

* iq4_xs: shrink by using IQ3_S for attn_k and attn_q

* iq4_xs: revert using IQ3_S for attn_k and attn_v

PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.

* Fix CI

* iq4_xs: Added forgotten check for 256 divisibility

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@sorasoras
Copy link

@ikawrakow
IQs don't seems support Forced dmmv.
Forced dmmv is about 7-8 percent faster for Q5KS. Do you any plan to implements that in the future to further improve performance of iq quants?

@mofosyne mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes labels May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : High Generally require indepth knowledge of LLMs or GPUs Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants