Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(vllm): upgrade vllm and expose more params for bfloat16 quant compatibility #835

Closed
justinthelaw opened this issue Jul 25, 2024 · 3 comments · Fixed by #854
Closed
Assignees
Labels
dependencies Pull requests that update a dependency file tech-debt Not a feature, but still necessary

Comments

@justinthelaw
Copy link
Contributor

justinthelaw commented Jul 25, 2024

Describe what should be investigated or refactored

vLLM is currently not compatible with all GPTQ BFLOAT16 quanitzed models due to the dependency version (0.4.2). This needs to be upgraded to the next patch version (0.4.3), or just completely upgraded to the next minor version (0.5.2).

The following test model should work if this issue has been fixed (fits on RTX 4060 - 4090): https://huggingface.co/TheBloke/phi-2-orange-GPTQ/blob/main/config.json

Links to any relevant code

Example model that wouldn't work, but should: https://huggingface.co/TheBloke/phi-2-orange-GPTQ/blob/main/config.json

Issue related to the vLLM GPTQ BFLOAT16 PR: vllm-project/vllm#2149

Additional context

This issue was confirmed when deploying Nous-Hermes-2-8x7b-DPO-GPTQ (8bit, 128g group size and Act Order) to an H100 GPU. Changing the config.json to float16, despite imprecision, allows the model to be inferenced by vLLM.

@justinthelaw justinthelaw added dependencies Pull requests that update a dependency file tech-debt Not a feature, but still necessary labels Jul 25, 2024
@justinthelaw
Copy link
Contributor Author

justinthelaw commented Jul 29, 2024

Example of TheBloke's model quantizations being outdated: vllm-project/vllm#2422 (comment)

@justinthelaw
Copy link
Contributor Author

justinthelaw commented Jul 29, 2024

The configuration we pass to vLLM should not include quantization, as that prevents automatic marlin_gptq quantization which uses a different algorithm to perform faster inferencing and less memory usage. Quantization is defined in all models' quantization_config.json.

Also, trust_remote_code refers to the code downloaded as part of the model download, so this can be safely turned on as long as we review the extra Python scripts downloaded as part of the model download. These scripts usually just tell vLLM how to configure itself for inferencing the model architecture (e.g., Phi-3 GPTQ).

@justinthelaw
Copy link
Contributor Author

Screenshot 2024-07-30 103756
Screenshot 2024-07-30 103906

Above screenshots comparing (generally) Phi-3-mini-128k-instruct outperforming all other Mistral-7b-instruct variants.

Working on outside-spike to create a quantized version of Phi-3-mini-128k-instruct: https://github.com/justinthelaw/gptqmodel-pipeline

@justinthelaw justinthelaw changed the title chore(vllm): upgrade vllm for gptq bfloat16 inferencing chore(vllm): upgrade vllm to latest for SOTA model compatibility Jul 30, 2024
@justinthelaw justinthelaw changed the title chore(vllm): upgrade vllm to latest for SOTA model compatibility chore(vllm): upgrade vllm for bfloat16 quant compatibility Aug 19, 2024
@justinthelaw justinthelaw changed the title chore(vllm): upgrade vllm for bfloat16 quant compatibility feat(vllm): upgrade vllm and expose more params for bfloat16 quant compatibility Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file tech-debt Not a feature, but still necessary
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant