Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Why we can't use Q8 quants? #173

Closed
alexcardo opened this issue Jun 7, 2024 · 2 comments
Closed

[Feature Request] Why we can't use Q8 quants? #173

alexcardo opened this issue Jun 7, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@alexcardo
Copy link

What is your request?

I discovered that the only possibility to run a quantized model is to use q4 and q6 quants. Why not adding q8 quants? Seems very strange. Is there a chance to enable it?

What is your motivation for this change?

As a rule, q8 quant is the best option when you don't want the model to not losing its quality.

Any other details?

No response

@alexcardo alexcardo added the enhancement New feature or request label Jun 7, 2024
@dukebw
Copy link
Contributor

dukebw commented Jun 9, 2024

Hi! Thanks for pointing this out. We actually implemented the block-wise quantization used in GGML k-quants. So the 4 and 6-bit element quantization encodings are GGML's Q4_K and Q6_K, respectively.

In contrast to row-wise or full tensor quantization, the block-wise quantization has a relatively good compression/accuracy tradeoff. In particular, Q6_K has comparably good model quality to float16. See for example this chart of perplexities from the original k-quants PR in llama.cpp.

That said Q8_0 is on our roadmap and will be no trouble to add soon 😄. I can update here once we've done so!

@BradLarson
Copy link
Contributor

Thank you again for the suggestion. As Brendan noted earlier, we prioritized q4_k and q6_k quantization schemes, given that those were the most popular ones we'd observed in the field. While we may still eventually add q8_0 quantization, our current focus is on making MAX work as well as it can on GPUs.

I didn't want to have you wait for this feature forever, so I'm going to close this issue until such time as we revisit q8_0 weight quantization support. Again, sorry that we're not able to add this feature just yet, but we appreciate you bringing it to our attention.

@BradLarson BradLarson closed this as not planned Won't fix, can't repro, duplicate, stale Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants