feat: use FlashInfer rmsnorm and silu #907

zhyncs · 2024-08-03T22:46:41Z

Motivation

as titled cc @merrymercy @Ying1123 @hnyls2002

Wait for the FlashInfer review, PR, and new release @yzh119
https://github.com/flashinfer-ai/flashinfer/actions/runs/10330691213

Modification

add fused_add_rmsnorm and silu_and_mul in FlashInfer
use InternLM2 and Llama for test

Checklist

Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
Modify documentation as needed, such as docstrings or example tutorials.

zhyncs · 2024-08-03T22:54:19Z

python3 -m sglang.bench_latency --model internlm/internlm2-chat-7b --correct --output-len 16 --trust-remote-code

prefill logits (first half) tensor([[123.1250, 122.6250, 143.7500,  ..., 123.3125, 138.7500, 107.5000],
        [123.1250, 122.6250, 143.7500,  ..., 123.3125, 138.7500, 107.5000],
        [ 51.8438,  48.6250,  88.4375,  ...,  54.0000,  65.6875,  55.8438]],
       device='cuda:0')
prefill logits (final) tensor([[288.7500, 291.2500, 361.5000,  ..., 305.0000, 329.5000, 255.1250],
        [313.7500, 319.5000, 396.2500,  ..., 331.7500, 358.5000, 279.0000],
        [331.5000, 333.5000, 403.5000,  ..., 350.7500, 373.7500, 292.5000]],
       device='cuda:0')
 <s>The capital of France is Paris. Paris is the most visited city in the world, and it is known for
 <s>The capital of the United Kindom is London, and it is the largest city in the country. London is a global city
 <s>Today is a sunny day and I like to go out for a walk. I put on my shoes and go out. I

python3 -m sglang.bench_serving --backend sglang --num-prompts 5000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  244.55
Total input tokens:                      1125946
Total generated tokens:                  1027605
Total generated tokens (retokenized):    1031579
Request throughput (req/s):              20.45
Input token throughput (tok/s):          4604.21
Output token throughput (tok/s):         4202.08
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   127461.40
Median E2E Latency (ms):                 128010.06
---------------Time to First Token----------------
Mean TTFT (ms):                          82007.71
Median TTFT (ms):                        74332.81
P99 TTFT (ms):                           193344.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          296.22
Median TPOT (ms):                        245.38
P99 TPOT (ms):                           1540.70
---------------Inter-token Latency----------------
Mean ITL (ms):                           629.50
Median ITL (ms):                         210.30
P99 ITL (ms):                            757.17
==================================================

zhyncs · 2024-08-03T23:18:40Z

There is a trivial issue, let me fix it.

python/sglang/srt/layers/layernorm.py

zhyncs · 2024-08-04T06:20:08Z

There is a trivial issue, let me fix it.

fixed

zhyncs · 2024-08-04T06:49:09Z

update:

python3 python/sglang/test/test_layernorm.py

test_rms_norm (__main__.TestRMSNorm.test_rms_norm) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.608s

OK

@yzh119

ref sgl-project/sglang#907 cc @yzh119

zhyncs · 2024-08-04T08:07:24Z

Currently, e2e and unit tests are failing because fused_add_rmsnorm can only be used in the new version. This failure can be temporarily ignored.

zhyncs · 2024-08-11T05:08:23Z

Thanks so much for @yzh119's help!

zhyncs requested review from Ying1123, yzh119, merrymercy and hnyls2002 August 3, 2024 22:46

zhyncs mentioned this pull request Aug 3, 2024

feat: support fused add rmsnorm flashinfer-ai/flashinfer#419

Merged

Ying1123 requested changes Aug 4, 2024

View reviewed changes

python/sglang/srt/layers/layernorm.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/layernorm.py Show resolved Hide resolved

python/sglang/srt/layers/layernorm.py Show resolved Hide resolved

zhyncs force-pushed the upd branch from 2761f44 to 568cc0e Compare August 4, 2024 06:25

yzh119 pushed a commit to flashinfer-ai/flashinfer that referenced this pull request Aug 4, 2024

feat: support fused add rmsnorm (#419)

b781513

ref sgl-project/sglang#907 cc @yzh119

zhyncs force-pushed the upd branch 2 times, most recently from 93f96d0 to c795239 Compare August 4, 2024 08:03

zhyncs force-pushed the upd branch 5 times, most recently from f572809 to 095eb05 Compare August 9, 2024 12:22

zhyncs changed the title ~~[DO NOT MERGE] feat: use FlashInfer rmsnorm~~ [DO NOT MERGE] feat: use FlashInfer rmsnorm and silu Aug 9, 2024

zhyncs force-pushed the upd branch 2 times, most recently from d9c6bbf to 74c805d Compare August 10, 2024 15:52

zhyncs changed the title ~~[DO NOT MERGE] feat: use FlashInfer rmsnorm and silu~~ feat: use FlashInfer rmsnorm and silu Aug 10, 2024

zhyncs force-pushed the upd branch from 74c805d to 717d1d1 Compare August 10, 2024 18:21

feat: use FlashInfer rmsnorm and silu

0d1ee26

zhyncs force-pushed the upd branch from 717d1d1 to 0d1ee26 Compare August 11, 2024 02:37

zhyncs enabled auto-merge (squash) August 11, 2024 02:40

zhyncs requested a review from Ying1123 August 11, 2024 02:40

zhyncs disabled auto-merge August 11, 2024 04:57

zhyncs merged commit 94752ac into sgl-project:main Aug 11, 2024
3 checks passed

zhyncs deleted the upd branch August 11, 2024 04:57

objnf-dev mentioned this pull request Aug 19, 2024

[Bug] --disable-flashinfer is broken #1146

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use FlashInfer rmsnorm and silu #907

feat: use FlashInfer rmsnorm and silu #907

zhyncs commented Aug 3, 2024 •

edited

Loading

zhyncs commented Aug 3, 2024 •

edited

Loading

zhyncs commented Aug 3, 2024

zhyncs commented Aug 4, 2024

zhyncs commented Aug 4, 2024

zhyncs commented Aug 4, 2024

zhyncs commented Aug 11, 2024

feat: use FlashInfer rmsnorm and silu #907

feat: use FlashInfer rmsnorm and silu #907

Conversation

zhyncs commented Aug 3, 2024 • edited Loading

Motivation

Modification

Checklist

zhyncs commented Aug 3, 2024 • edited Loading

zhyncs commented Aug 3, 2024

zhyncs commented Aug 4, 2024

zhyncs commented Aug 4, 2024

zhyncs commented Aug 4, 2024

zhyncs commented Aug 11, 2024

zhyncs commented Aug 3, 2024 •

edited

Loading

zhyncs commented Aug 3, 2024 •

edited

Loading