You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I discovered that the only possibility to run a quantized model is to use q4 and q6 quants. Why not adding q8 quants? Seems very strange. Is there a chance to enable it?
What is your motivation for this change?
As a rule, q8 quant is the best option when you don't want the model to not losing its quality.
Any other details?
No response
The text was updated successfully, but these errors were encountered:
Hi! Thanks for pointing this out. We actually implemented the block-wise quantization used in GGML k-quants. So the 4 and 6-bit element quantization encodings are GGML's Q4_K and Q6_K, respectively.
In contrast to row-wise or full tensor quantization, the block-wise quantization has a relatively good compression/accuracy tradeoff. In particular, Q6_K has comparably good model quality to float16. See for example this chart of perplexities from the original k-quants PR in llama.cpp.
That said Q8_0 is on our roadmap and will be no trouble to add soon 😄. I can update here once we've done so!
Thank you again for the suggestion. As Brendan noted earlier, we prioritized q4_k and q6_k quantization schemes, given that those were the most popular ones we'd observed in the field. While we may still eventually add q8_0 quantization, our current focus is on making MAX work as well as it can on GPUs.
I didn't want to have you wait for this feature forever, so I'm going to close this issue until such time as we revisit q8_0 weight quantization support. Again, sorry that we're not able to add this feature just yet, but we appreciate you bringing it to our attention.
What is your request?
I discovered that the only possibility to run a quantized model is to use q4 and q6 quants. Why not adding q8 quants? Seems very strange. Is there a chance to enable it?
What is your motivation for this change?
As a rule, q8 quant is the best option when you don't want the model to not losing its quality.
Any other details?
No response
The text was updated successfully, but these errors were encountered: