[Feature Request] Why we can't use Q8 quants? #173

alexcardo · 2024-06-07T18:02:15Z

What is your request?

I discovered that the only possibility to run a quantized model is to use q4 and q6 quants. Why not adding q8 quants? Seems very strange. Is there a chance to enable it?

What is your motivation for this change?

As a rule, q8 quant is the best option when you don't want the model to not losing its quality.

Any other details?

No response

dukebw · 2024-06-09T14:55:09Z

Hi! Thanks for pointing this out. We actually implemented the block-wise quantization used in GGML k-quants. So the 4 and 6-bit element quantization encodings are GGML's Q4_K and Q6_K, respectively.

In contrast to row-wise or full tensor quantization, the block-wise quantization has a relatively good compression/accuracy tradeoff. In particular, Q6_K has comparably good model quality to float16. See for example this chart of perplexities from the original k-quants PR in llama.cpp.

That said Q8_0 is on our roadmap and will be no trouble to add soon 😄. I can update here once we've done so!

BradLarson · 2024-10-14T18:12:57Z

Thank you again for the suggestion. As Brendan noted earlier, we prioritized q4_k and q6_k quantization schemes, given that those were the most popular ones we'd observed in the field. While we may still eventually add q8_0 quantization, our current focus is on making MAX work as well as it can on GPUs.

I didn't want to have you wait for this feature forever, so I'm going to close this issue until such time as we revisit q8_0 weight quantization support. Again, sorry that we're not able to add this feature just yet, but we appreciate you bringing it to our attention.

alexcardo added the enhancement New feature or request label Jun 7, 2024

iamtimdavis assigned BradLarson Oct 14, 2024

BradLarson closed this as completed Oct 14, 2024

BradLarson closed this as not planned Won't fix, can't repro, duplicate, stale Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Why we can't use Q8 quants? #173

[Feature Request] Why we can't use Q8 quants? #173

alexcardo commented Jun 7, 2024

dukebw commented Jun 9, 2024

BradLarson commented Oct 14, 2024

[Feature Request] Why we can't use Q8 quants? #173

[Feature Request] Why we can't use Q8 quants? #173

Comments

alexcardo commented Jun 7, 2024

What is your request?

What is your motivation for this change?

Any other details?

dukebw commented Jun 9, 2024

BradLarson commented Oct 14, 2024