You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there anyone we can connect to on the Qwen team to see if we can fix the issues I see with Qwen documentation regarding gptq? I feel many parts of this is result of using out-dated code (auto-gptq) along with bad quants and does not reflect the quality of properly quantized gptq models.
Qwen is suggesting using auto-gptq which has not been updated in almost a year. (I am part-time maintainer at auto-gptq)
Talks about save_quantized has no sharding support. GPTQModel has sharding support and unit-tested.
Talks about repeating problems and non-stopping problems on vllm. I personally believe this is due to bad quantization, not necessarily a 4bit issue or even a gptq issue. We have quantized some very good Qwen models and did not experience this issue. If Qwen team and test our quantized models at https://hf.co/modelcloud/ and still get the vllm issues, we are happy to debug this.
Unlike auto-gptq, we have multiple kernels with auto-padding so that users do not need to do anything special to inference quantized gptq models that are not properly divisible by group_size or in/out features.
Hints at using awq or 8bit for inference issues we believe is not relevant and are result of bad quantization.
We would like to work together with Qwen doc team to make sure the most up-to-date data and performance is reflected on the best gptq quantization toolkit with properly quantized and validated models. Most if not all the issues are addressed by GPTQModel or models quantized by ModelCloud using GPTQModel.
Thanks!
The text was updated successfully, but these errors were encountered:
BTW, is this implementation superior to AWQ at 4-bit?
AutoGPTQ and AutoAWQ is easy to use, but sometimes you have to hack the code to adapt to custom models.
What's more, the support for VLM model is kind of weak
Can't say the same for random awq, or even gptq models.
When it comes to model quality I have shown you our scores, now awq can present their own models and scores. We rather present benchmarks and instead of saying one is better than other. You decide.
As far as vl models, better multi-modal support is coming and we support 100% of all models that are relevant in 2024.
Hi @JustinLin610 @huybery,
Is there anyone we can connect to on the Qwen team to see if we can fix the issues I see with Qwen documentation regarding gptq? I feel many parts of this is result of using out-dated code (auto-gptq) along with bad quants and does not reflect the quality of properly quantized gptq models.
https://qwen.readthedocs.io/en/latest/quantization/gptq.html
Specific issues:
auto-gptq
which has not been updated in almost a year. (I am part-time maintainer at auto-gptq)save_quantized
has no sharding support. GPTQModel has sharding support and unit-tested.We would like to work together with Qwen doc team to make sure the most up-to-date data and performance is reflected on the best
gptq
quantization toolkit with properly quantized and validated models. Most if not all the issues are addressed by GPTQModel or models quantized by ModelCloud using GPTQModel.Thanks!
The text was updated successfully, but these errors were encountered: