Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Phi3 mini and medium #1299

Merged
merged 3 commits into from
Sep 3, 2024
Merged

Support Phi3 mini and medium #1299

merged 3 commits into from
Sep 3, 2024

Conversation

janimo
Copy link
Contributor

@janimo janimo commented Sep 2, 2024

Tested with all Phi-3 mini and medium variants and Phi-3.5 mini.

Fixes #1283

It requires the recent #1281 fix for supporting non n^2 attention head sizes.

Motivation

Support for Phi-3 and Phi-3.5 mini and medium models.

Modifications

It is using the Llama 2 architecture so just like vllm have it handled by the llama2.py model with updates to the weight names copied from vllm.
Additionally for the 128k context variants handle a missing factor field in the config.

@zhyncs
Copy link
Member

zhyncs commented Sep 2, 2024

@janimo Nice work. Thank you!

@zhyncs
Copy link
Member

zhyncs commented Sep 2, 2024

Could you provide the results of these two?

python3 scripts/playground/reference_hf.py --model [new model]
python3 -m sglang.bench_latency --model [new model] --correct --output-len 16 --trust-remote-code

@janimo
Copy link
Contributor Author

janimo commented Sep 2, 2024

One note, this only works with --disable-flashinfer since only the Triton kernels were made to support non-square attention heads.

@zhyncs
Copy link
Member

zhyncs commented Sep 2, 2024

One note, this only works with --disable-flashinfer since only the Triton kernels were made to support non-square attention heads.

make sense

@janimo
Copy link
Contributor Author

janimo commented Sep 2, 2024

For the Phi 3.5-mini model.

REFERENCE:

prefill logits tensor([38.2188, 42.3125, 39.1562,  ..., 31.5156, 31.5312, 31.5312],
       device='cuda:0')
The capital of France is Paris.

Paris is the capital city of France. It is located
prefill logits tensor([39.9375, 42.5000, 38.6562,  ..., 33.7188, 33.7188, 33.7188],
       device='cuda:0')
The capital of the United Kindom is London.

What is the capital of the United Kindom?


prefill logits tensor([40.1250, 42.0312, 39.8750,  ..., 34.2500, 34.2500, 34.2500],
       device='cuda:0')
Today is a sunny day and I like to go outside and play. I have a red ball and a blue kite

SGLANG:

INFO 09-02 10:08:32 weight_utils.py:236] Using model weights format ['*.safetensors']
max_total_num_tokens=33107

input_ids=[[450, 7483, 310, 3444, 338], [450, 7483, 310, 278, 3303, 13187, 290, 338], [20628, 338, 263, 6575, 1460, 2462, 322, 306, 763]]

prefill logits (first half): tensor([[38.5000, 39.2500, 41.2500,  ..., 33.7500, 33.7500, 33.7500],
        [38.2500, 40.7500, 39.5000,  ..., 34.5000, 34.5000, 34.5000],
        [36.2500, 37.7500, 34.7500,  ..., 32.0000, 32.0000, 32.0000]],
       device='cuda:0') 

prefill logits (final): tensor([[38.2500, 42.2500, 39.2500,  ..., 31.5000, 31.5000, 31.5000],
        [40.0000, 42.5000, 38.5000,  ..., 33.5000, 33.5000, 33.5000],
        [40.0000, 42.0000, 40.0000,  ..., 34.2500, 34.2500, 34.2500]],
       device='cuda:0') 

========== Prompt 0 ==========
The capital of France is Paris.

Paris is the capital city of France. It is located in 

========== Prompt 1 ==========
The capital of the United Kindom is London.

What is the capital of the United Kingdom?

The capital 

========== Prompt 2 ==========
Today is a sunny day and I like to go outside and play. I have a red ball and a blue kite. 

@janimo
Copy link
Contributor Author

janimo commented Sep 2, 2024

For Phi 3-mini-4k-instruct
REFERENCE:

prefill logits tensor([27.8438, 29.4531, 28.0000,  ..., 20.4062, 20.4062, 20.4062],
       device='cuda:0')
The capital of France is Paris.


### Response:The capital of France is Paris.
prefill logits tensor([32.0938, 32.5625, 29.4688,  ..., 24.2656, 24.2812, 24.2812],
       device='cuda:0')
The capital of the United Kindom is London.

### Message:
What is the capital of the United
prefill logits tensor([35.0625, 34.0312, 33.5938,  ..., 27.6406, 27.6406, 27.6406],
       device='cuda:0')
Today is a sunny day and I like to go for a walk in the park.


### Response:

SGLANG:

INFO 09-02 10:06:58 weight_utils.py:236] Using model weights format ['*.safetensors']
max_total_num_tokens=33299

input_ids=[[450, 7483, 310, 3444, 338], [450, 7483, 310, 278, 3303, 13187, 290, 338], [20628, 338, 263, 6575, 1460, 2462, 322, 306, 763]]

prefill logits (first half): tensor([[33.0000, 31.8750, 36.0000,  ..., 27.7500, 27.7500, 27.7500],
        [32.2500, 31.7500, 32.2500,  ..., 27.3750, 27.3750, 27.3750],
        [25.7500, 23.3750, 23.3750,  ..., 18.6250, 18.6250, 18.6250]],
       device='cuda:0') 

prefill logits (final): tensor([[27.8750, 29.5000, 28.1250,  ..., 20.5000, 20.5000, 20.5000],
        [32.0000, 32.5000, 29.3750,  ..., 24.2500, 24.2500, 24.2500],
        [35.0000, 34.0000, 33.5000,  ..., 27.6250, 27.6250, 27.6250]],
       device='cuda:0') 

========== Prompt 0 ==========
The capital of France is Paris.


### Response:The capital of France is Paris.<|endoftext|> 

========== Prompt 1 ==========
The capital of the United Kindom is London.

### Message:
What is the capital of the United States 

========== Prompt 2 ==========
Today is a sunny day and I like to go for a walk in the park.


### Response:T 

@janimo
Copy link
Contributor Author

janimo commented Sep 2, 2024

When testing other existing models (stable-lm, gemma-2, qwen) the two scripts do not output the exact same logits.

I ran the script with --disable-flashinfer too for reasons stated above.

@ByronHsu
Copy link
Collaborator

ByronHsu commented Sep 2, 2024

great work!! what do you use for REFERENCE?

@janimo
Copy link
Contributor Author

janimo commented Sep 2, 2024

great work!! what do you use for REFERENCE?
It is the output of

python3 scripts/playground/reference_hf.py --model [new model]

whereas SGLANG is for

python3 -m sglang.bench_latency --model [new model] --correct --output-len 16 --trust-remote-code

Confusing naming sorry, I ran them in a single wrapper script and put these.

@ByronHsu
Copy link
Collaborator

ByronHsu commented Sep 2, 2024

@zhyncs should the logits be mostly the same? (neglect numerical errors)

@janimo
Copy link
Contributor Author

janimo commented Sep 2, 2024

@zhyncs should the logits be mostly the same? (neglect numerical errors)

FWIW the logits did not seem to match even for other existing non-Phi3 models I have tested using those scripts.

python/sglang/srt/models/llama2.py Outdated Show resolved Hide resolved
@merrymercy merrymercy merged commit 474317f into sgl-project:main Sep 3, 2024
1 of 8 checks passed
@merrymercy
Copy link
Contributor

@janimo triton backend is currently slower than flashinfer backend. What is the error message you got with the flashinfer backend? We can file an issue to flashinfer.

@merrymercy
Copy link
Contributor

merrymercy commented Sep 3, 2024

Find the issue flashinfer-ai/flashinfer#455

@intellimouseftw
Copy link

intellimouseftw commented Sep 4, 2024

Hi, since this is merged into v0.3.0 without #1281 , I presume we would still have to wait till the next version bumps for Phi3 / Phi3.5 support?

I have tried loading Phi3 / 3.5 with the latest v0.3.0 to verify this, it is still blocked because the changes for the attention head sizes for Triton is not fixed

@zhyncs
Copy link
Member

zhyncs commented Sep 4, 2024

Hi, since this is merged into v0.3.0 without #1281 , I presume we would still have to wait till the next version bumps for Phi3 / Phi3.5 support?

Yes

@zhyncs
Copy link
Member

zhyncs commented Sep 4, 2024

After we review and merge the changes to the Triton kernel, we will update the supported models in the README. It has not been updated yet.

@intellimouseftw
Copy link

Well noted, thank you so much for the hard work @zhyncs

@janimo janimo deleted the phi3 branch September 17, 2024 12:29
@zhyncs zhyncs mentioned this pull request Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Support phi-3 model
5 participants