-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(vllm): upgrade vllm and expose more params for bfloat16 quant compatibility #835
Comments
Example of TheBloke's model quantizations being outdated: vllm-project/vllm#2422 (comment) |
The configuration we pass to vLLM should not include Also, |
Above screenshots comparing (generally) Phi-3-mini-128k-instruct outperforming all other Mistral-7b-instruct variants. Working on outside-spike to create a quantized version of Phi-3-mini-128k-instruct: https://github.com/justinthelaw/gptqmodel-pipeline |
Describe what should be investigated or refactored
vLLM is currently not compatible with all GPTQ BFLOAT16 quanitzed models due to the dependency version (0.4.2). This needs to be upgraded to the next patch version (0.4.3), or just completely upgraded to the next minor version (0.5.2).
The following test model should work if this issue has been fixed (fits on RTX 4060 - 4090): https://huggingface.co/TheBloke/phi-2-orange-GPTQ/blob/main/config.json
Links to any relevant code
Example model that wouldn't work, but should: https://huggingface.co/TheBloke/phi-2-orange-GPTQ/blob/main/config.json
Issue related to the vLLM GPTQ BFLOAT16 PR: vllm-project/vllm#2149
Additional context
This issue was confirmed when deploying Nous-Hermes-2-8x7b-DPO-GPTQ (8bit, 128g group size and Act Order) to an H100 GPU. Changing the
config.json
tofloat16
, despite imprecision, allows the model to be inferenced by vLLM.The text was updated successfully, but these errors were encountered: