Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Qwen docs about gptq quantization quality + sharding + bad quants + autogptq #817

Open
Qubitium opened this issue Dec 11, 2024 · 2 comments

Comments

@Qubitium
Copy link
Collaborator

Qubitium commented Dec 11, 2024

Hi @JustinLin610 @huybery,

Is there anyone we can connect to on the Qwen team to see if we can fix the issues I see with Qwen documentation regarding gptq? I feel many parts of this is result of using out-dated code (auto-gptq) along with bad quants and does not reflect the quality of properly quantized gptq models.

https://qwen.readthedocs.io/en/latest/quantization/gptq.html

Specific issues:

  1. Qwen is suggesting using auto-gptq which has not been updated in almost a year. (I am part-time maintainer at auto-gptq)
  2. Talks about save_quantized has no sharding support. GPTQModel has sharding support and unit-tested.
  3. Talks about repeating problems and non-stopping problems on vllm. I personally believe this is due to bad quantization, not necessarily a 4bit issue or even a gptq issue. We have quantized some very good Qwen models and did not experience this issue. If Qwen team and test our quantized models at https://hf.co/modelcloud/ and still get the vllm issues, we are happy to debug this.
  4. Unlike auto-gptq, we have multiple kernels with auto-padding so that users do not need to do anything special to inference quantized gptq models that are not properly divisible by group_size or in/out features.
  5. Hints at using awq or 8bit for inference issues we believe is not relevant and are result of bad quantization.

We would like to work together with Qwen doc team to make sure the most up-to-date data and performance is reflected on the best gptq quantization toolkit with properly quantized and validated models. Most if not all the issues are addressed by GPTQModel or models quantized by ModelCloud using GPTQModel.

Thanks!

@MDR-EX1000
Copy link

BTW, is this implementation superior to AWQ at 4-bit?
AutoGPTQ and AutoAWQ is easy to use, but sometimes you have to hack the code to adapt to custom models.
What's more, the support for VLM model is kind of weak

@Qubitium
Copy link
Collaborator Author

Qubitium commented Dec 12, 2024

@MDR-EX1000 Our GPTQ models are benchmarked. https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2

Can't say the same for random awq, or even gptq models.

When it comes to model quality I have shown you our scores, now awq can present their own models and scores. We rather present benchmarks and instead of saying one is better than other. You decide.

As far as vl models, better multi-modal support is coming and we support 100% of all models that are relevant in 2024.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants