-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Phi3 mini and medium #1299
Conversation
@janimo Nice work. Thank you! |
Could you provide the results of these two?
|
One note, this only works with --disable-flashinfer since only the Triton kernels were made to support non-square attention heads. |
make sense |
For the Phi 3.5-mini model. REFERENCE:
SGLANG:
|
For Phi 3-mini-4k-instruct
|
When testing other existing models (stable-lm, gemma-2, qwen) the two scripts do not output the exact same logits. I ran the script with --disable-flashinfer too for reasons stated above. |
great work!! what do you use for |
whereas SGLANG is for
Confusing naming sorry, I ran them in a single wrapper script and put these. |
@zhyncs should the logits be mostly the same? (neglect numerical errors) |
FWIW the logits did not seem to match even for other existing non-Phi3 models I have tested using those scripts. |
@janimo triton backend is currently slower than flashinfer backend. What is the error message you got with the flashinfer backend? We can file an issue to flashinfer. |
Find the issue flashinfer-ai/flashinfer#455 |
Hi, since this is merged into v0.3.0 without #1281 , I presume we would still have to wait till the next version bumps for Phi3 / Phi3.5 support? I have tried loading Phi3 / 3.5 with the latest v0.3.0 to verify this, it is still blocked because the changes for the attention head sizes for Triton is not fixed |
Yes |
After we review and merge the changes to the Triton kernel, we will update the supported models in the README. It has not been updated yet. |
Well noted, thank you so much for the hard work @zhyncs |
Tested with all Phi-3 mini and medium variants and Phi-3.5 mini.
Fixes #1283
It requires the recent #1281 fix for supporting non n^2 attention head sizes.
Motivation
Support for Phi-3 and Phi-3.5 mini and medium models.
Modifications
It is using the Llama 2 architecture so just like vllm have it handled by the llama2.py model with updates to the weight names copied from vllm.
Additionally for the 128k context variants handle a missing factor field in the config.