Fix Qwen docs about gptq quantization quality + sharding + bad quants + autogptq #817

Qubitium · 2024-12-11T01:37:59Z

Is there anyone we can connect to on the Qwen team to see if we can fix the issues I see with Qwen documentation regarding gptq? I feel many parts of this is result of using out-dated code (auto-gptq) along with bad quants and does not reflect the quality of properly quantized gptq models.

https://qwen.readthedocs.io/en/latest/quantization/gptq.html

Specific issues:

Qwen is suggesting using auto-gptq which has not been updated in almost a year. (I am part-time maintainer at auto-gptq)
Talks about save_quantized has no sharding support. GPTQModel has sharding support and unit-tested.
Talks about repeating problems and non-stopping problems on vllm. I personally believe this is due to bad quantization, not necessarily a 4bit issue or even a gptq issue. We have quantized some very good Qwen models and did not experience this issue. If Qwen team and test our quantized models at https://hf.co/modelcloud/ and still get the vllm issues, we are happy to debug this.
Unlike auto-gptq, we have multiple kernels with auto-padding so that users do not need to do anything special to inference quantized gptq models that are not properly divisible by group_size or in/out features.
Hints at using awq or 8bit for inference issues we believe is not relevant and are result of bad quantization.

We would like to work together with Qwen doc team to make sure the most up-to-date data and performance is reflected on the best gptq quantization toolkit with properly quantized and validated models. Most if not all the issues are addressed by GPTQModel or models quantized by ModelCloud using GPTQModel.

Thanks!

The text was updated successfully, but these errors were encountered:

MDR-EX1000 · 2024-12-11T12:16:55Z

BTW, is this implementation superior to AWQ at 4-bit?
AutoGPTQ and AutoAWQ is easy to use, but sometimes you have to hack the code to adapt to custom models.
What's more, the support for VLM model is kind of weak

Qubitium · 2024-12-12T03:32:41Z

@MDR-EX1000 Our GPTQ models are benchmarked. https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2

Can't say the same for random awq, or even gptq models.

When it comes to model quality I have shown you our scores, now awq can present their own models and scores. We rather present benchmarks and instead of saying one is better than other. You decide.

As far as vl models, better multi-modal support is coming and we support 100% of all models that are relevant in 2024.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Qwen docs about gptq quantization quality + sharding + bad quants + autogptq #817

Fix Qwen docs about gptq quantization quality + sharding + bad quants + autogptq #817

Qubitium commented Dec 11, 2024 •

edited

Loading

MDR-EX1000 commented Dec 11, 2024

Qubitium commented Dec 12, 2024 •

edited

Loading

Fix Qwen docs about gptq quantization quality + sharding + bad quants + autogptq #817

Fix Qwen docs about gptq quantization quality + sharding + bad quants + autogptq #817

Comments

Qubitium commented Dec 11, 2024 • edited Loading

MDR-EX1000 commented Dec 11, 2024

Qubitium commented Dec 12, 2024 • edited Loading

Qubitium commented Dec 11, 2024 •

edited

Loading

Qubitium commented Dec 12, 2024 •

edited

Loading