Support Phi3 mini and medium #1299

janimo · 2024-09-02T09:25:56Z

Tested with all Phi-3 mini and medium variants and Phi-3.5 mini.

It requires the recent #1281 fix for supporting non n^2 attention head sizes.

Motivation

Support for Phi-3 and Phi-3.5 mini and medium models.

Modifications

It is using the Llama 2 architecture so just like vllm have it handled by the llama2.py model with updates to the weight names copied from vllm.
Additionally for the 128k context variants handle a missing factor field in the config.

zhyncs · 2024-09-02T09:27:43Z

@janimo Nice work. Thank you!

python/sglang/srt/models/llama2.py

zhyncs · 2024-09-02T09:32:43Z

Could you provide the results of these two?

python3 scripts/playground/reference_hf.py --model [new model]
python3 -m sglang.bench_latency --model [new model] --correct --output-len 16 --trust-remote-code

janimo · 2024-09-02T09:48:51Z

One note, this only works with --disable-flashinfer since only the Triton kernels were made to support non-square attention heads.

zhyncs · 2024-09-02T09:53:59Z

One note, this only works with --disable-flashinfer since only the Triton kernels were made to support non-square attention heads.

make sense

janimo · 2024-09-02T10:56:42Z

For the Phi 3.5-mini model.

REFERENCE:

prefill logits tensor([38.2188, 42.3125, 39.1562,  ..., 31.5156, 31.5312, 31.5312],
       device='cuda:0')
The capital of France is Paris.

Paris is the capital city of France. It is located
prefill logits tensor([39.9375, 42.5000, 38.6562,  ..., 33.7188, 33.7188, 33.7188],
       device='cuda:0')
The capital of the United Kindom is London.

What is the capital of the United Kindom?


prefill logits tensor([40.1250, 42.0312, 39.8750,  ..., 34.2500, 34.2500, 34.2500],
       device='cuda:0')
Today is a sunny day and I like to go outside and play. I have a red ball and a blue kite

SGLANG:

INFO 09-02 10:08:32 weight_utils.py:236] Using model weights format ['*.safetensors']
max_total_num_tokens=33107

input_ids=[[450, 7483, 310, 3444, 338], [450, 7483, 310, 278, 3303, 13187, 290, 338], [20628, 338, 263, 6575, 1460, 2462, 322, 306, 763]]

prefill logits (first half): tensor([[38.5000, 39.2500, 41.2500,  ..., 33.7500, 33.7500, 33.7500],
        [38.2500, 40.7500, 39.5000,  ..., 34.5000, 34.5000, 34.5000],
        [36.2500, 37.7500, 34.7500,  ..., 32.0000, 32.0000, 32.0000]],
       device='cuda:0') 

prefill logits (final): tensor([[38.2500, 42.2500, 39.2500,  ..., 31.5000, 31.5000, 31.5000],
        [40.0000, 42.5000, 38.5000,  ..., 33.5000, 33.5000, 33.5000],
        [40.0000, 42.0000, 40.0000,  ..., 34.2500, 34.2500, 34.2500]],
       device='cuda:0') 

========== Prompt 0 ==========
The capital of France is Paris.

Paris is the capital city of France. It is located in 

========== Prompt 1 ==========
The capital of the United Kindom is London.

What is the capital of the United Kingdom?

The capital 

========== Prompt 2 ==========
Today is a sunny day and I like to go outside and play. I have a red ball and a blue kite.

janimo · 2024-09-02T10:58:14Z

For Phi 3-mini-4k-instruct
REFERENCE:

prefill logits tensor([27.8438, 29.4531, 28.0000,  ..., 20.4062, 20.4062, 20.4062],
       device='cuda:0')
The capital of France is Paris.


### Response:The capital of France is Paris.
prefill logits tensor([32.0938, 32.5625, 29.4688,  ..., 24.2656, 24.2812, 24.2812],
       device='cuda:0')
The capital of the United Kindom is London.

### Message:
What is the capital of the United
prefill logits tensor([35.0625, 34.0312, 33.5938,  ..., 27.6406, 27.6406, 27.6406],
       device='cuda:0')
Today is a sunny day and I like to go for a walk in the park.


### Response:

SGLANG:

INFO 09-02 10:06:58 weight_utils.py:236] Using model weights format ['*.safetensors']
max_total_num_tokens=33299

input_ids=[[450, 7483, 310, 3444, 338], [450, 7483, 310, 278, 3303, 13187, 290, 338], [20628, 338, 263, 6575, 1460, 2462, 322, 306, 763]]

prefill logits (first half): tensor([[33.0000, 31.8750, 36.0000,  ..., 27.7500, 27.7500, 27.7500],
        [32.2500, 31.7500, 32.2500,  ..., 27.3750, 27.3750, 27.3750],
        [25.7500, 23.3750, 23.3750,  ..., 18.6250, 18.6250, 18.6250]],
       device='cuda:0') 

prefill logits (final): tensor([[27.8750, 29.5000, 28.1250,  ..., 20.5000, 20.5000, 20.5000],
        [32.0000, 32.5000, 29.3750,  ..., 24.2500, 24.2500, 24.2500],
        [35.0000, 34.0000, 33.5000,  ..., 27.6250, 27.6250, 27.6250]],
       device='cuda:0') 

========== Prompt 0 ==========
The capital of France is Paris.


### Response:The capital of France is Paris.<|endoftext|> 

========== Prompt 1 ==========
The capital of the United Kindom is London.

### Message:
What is the capital of the United States 

========== Prompt 2 ==========
Today is a sunny day and I like to go for a walk in the park.


### Response:T

janimo · 2024-09-02T11:01:00Z

When testing other existing models (stable-lm, gemma-2, qwen) the two scripts do not output the exact same logits.

I ran the script with --disable-flashinfer too for reasons stated above.

ByronHsu · 2024-09-02T17:28:15Z

great work!! what do you use for REFERENCE?

janimo · 2024-09-02T18:00:31Z

great work!! what do you use for REFERENCE?
It is the output of

python3 scripts/playground/reference_hf.py --model [new model]

whereas SGLANG is for

python3 -m sglang.bench_latency --model [new model] --correct --output-len 16 --trust-remote-code

Confusing naming sorry, I ran them in a single wrapper script and put these.

ByronHsu · 2024-09-02T18:03:07Z

@zhyncs should the logits be mostly the same? (neglect numerical errors)

janimo · 2024-09-02T18:54:03Z

@zhyncs should the logits be mostly the same? (neglect numerical errors)

FWIW the logits did not seem to match even for other existing non-Phi3 models I have tested using those scripts.

python/sglang/srt/models/llama2.py

merrymercy · 2024-09-03T04:59:52Z

@janimo triton backend is currently slower than flashinfer backend. What is the error message you got with the flashinfer backend? We can file an issue to flashinfer.

merrymercy · 2024-09-03T05:14:13Z

Find the issue flashinfer-ai/flashinfer#455

intellimouseftw · 2024-09-04T09:25:46Z

Hi, since this is merged into v0.3.0 without #1281 , I presume we would still have to wait till the next version bumps for Phi3 / Phi3.5 support?

I have tried loading Phi3 / 3.5 with the latest v0.3.0 to verify this, it is still blocked because the changes for the attention head sizes for Triton is not fixed

zhyncs · 2024-09-04T09:27:53Z

Hi, since this is merged into v0.3.0 without #1281 , I presume we would still have to wait till the next version bumps for Phi3 / Phi3.5 support?

Yes

zhyncs · 2024-09-04T09:28:51Z

After we review and merge the changes to the Triton kernel, we will update the supported models in the README. It has not been updated yet.

intellimouseftw · 2024-09-04T09:29:32Z

Well noted, thank you so much for the hard work @zhyncs

zhyncs requested review from Ying1123, merrymercy, zhyncs and hnyls2002 September 2, 2024 09:27

zhyncs reviewed Sep 2, 2024

View reviewed changes

python/sglang/srt/models/llama2.py Outdated Show resolved Hide resolved

merrymercy requested changes Sep 3, 2024

View reviewed changes

python/sglang/srt/models/llama2.py Outdated Show resolved Hide resolved

merrymercy reviewed Sep 3, 2024

View reviewed changes

python/sglang/srt/models/llama2.py Outdated Show resolved Hide resolved

janimo and others added 3 commits September 2, 2024 21:48

Load Phi-3 models

4e7182e

Handle missing factor field in the rope scaling config

e115fd6

Update python/sglang/srt/models/llama2.py

23f4beb

merrymercy force-pushed the phi3 branch from a62ec9b to 23f4beb Compare September 3, 2024 04:48

merrymercy merged commit 474317f into sgl-project:main Sep 3, 2024
1 of 8 checks passed

ByronHsu mentioned this pull request Sep 11, 2024

remove assertion in triton attention and add an unit test #1385

Merged

3 tasks

janimo deleted the phi3 branch September 17, 2024 12:29

zhyncs mentioned this pull request Sep 22, 2024

Please add Phi3 support #407

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Phi3 mini and medium #1299

Support Phi3 mini and medium #1299

janimo commented Sep 2, 2024

zhyncs commented Sep 2, 2024

zhyncs commented Sep 2, 2024

janimo commented Sep 2, 2024

zhyncs commented Sep 2, 2024

janimo commented Sep 2, 2024

janimo commented Sep 2, 2024

janimo commented Sep 2, 2024

ByronHsu commented Sep 2, 2024

janimo commented Sep 2, 2024

ByronHsu commented Sep 2, 2024

janimo commented Sep 2, 2024

merrymercy commented Sep 3, 2024

merrymercy commented Sep 3, 2024 •

edited

Loading

intellimouseftw commented Sep 4, 2024 •

edited

Loading

zhyncs commented Sep 4, 2024

zhyncs commented Sep 4, 2024

intellimouseftw commented Sep 4, 2024

Support Phi3 mini and medium #1299

Support Phi3 mini and medium #1299

Conversation

janimo commented Sep 2, 2024

Motivation

Modifications

zhyncs commented Sep 2, 2024

zhyncs commented Sep 2, 2024

janimo commented Sep 2, 2024

zhyncs commented Sep 2, 2024

janimo commented Sep 2, 2024

REFERENCE:

SGLANG:

janimo commented Sep 2, 2024

For Phi 3-mini-4k-instruct REFERENCE:

SGLANG:

janimo commented Sep 2, 2024

ByronHsu commented Sep 2, 2024

janimo commented Sep 2, 2024

ByronHsu commented Sep 2, 2024

janimo commented Sep 2, 2024

merrymercy commented Sep 3, 2024

merrymercy commented Sep 3, 2024 • edited Loading

intellimouseftw commented Sep 4, 2024 • edited Loading

zhyncs commented Sep 4, 2024

zhyncs commented Sep 4, 2024

intellimouseftw commented Sep 4, 2024

For Phi 3-mini-4k-instruct
REFERENCE:

merrymercy commented Sep 3, 2024 •

edited

Loading

intellimouseftw commented Sep 4, 2024 •

edited

Loading