Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use FlashInfer rmsnorm and silu #907

Merged
merged 1 commit into from
Aug 11, 2024
Merged

Conversation

zhyncs
Copy link
Member

@zhyncs zhyncs commented Aug 3, 2024

Motivation

as titled cc @merrymercy @Ying1123 @hnyls2002

Wait for the FlashInfer review, PR, and new release @yzh119
https://github.com/flashinfer-ai/flashinfer/actions/runs/10330691213

Modification

  1. add fused_add_rmsnorm and silu_and_mul in FlashInfer
  2. use InternLM2 and Llama for test

Checklist

  1. Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
  2. Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
  3. Modify documentation as needed, such as docstrings or example tutorials.

@zhyncs
Copy link
Member Author

zhyncs commented Aug 3, 2024

python3 -m sglang.bench_latency --model internlm/internlm2-chat-7b --correct --output-len 16 --trust-remote-code
prefill logits (first half) tensor([[123.1250, 122.6250, 143.7500,  ..., 123.3125, 138.7500, 107.5000],
        [123.1250, 122.6250, 143.7500,  ..., 123.3125, 138.7500, 107.5000],
        [ 51.8438,  48.6250,  88.4375,  ...,  54.0000,  65.6875,  55.8438]],
       device='cuda:0')
prefill logits (final) tensor([[288.7500, 291.2500, 361.5000,  ..., 305.0000, 329.5000, 255.1250],
        [313.7500, 319.5000, 396.2500,  ..., 331.7500, 358.5000, 279.0000],
        [331.5000, 333.5000, 403.5000,  ..., 350.7500, 373.7500, 292.5000]],
       device='cuda:0')
 <s>The capital of France is Paris. Paris is the most visited city in the world, and it is known for
 <s>The capital of the United Kindom is London, and it is the largest city in the country. London is a global city
 <s>Today is a sunny day and I like to go out for a walk. I put on my shoes and go out. I
python3 -m sglang.bench_serving --backend sglang --num-prompts 5000
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  244.55
Total input tokens:                      1125946
Total generated tokens:                  1027605
Total generated tokens (retokenized):    1031579
Request throughput (req/s):              20.45
Input token throughput (tok/s):          4604.21
Output token throughput (tok/s):         4202.08
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   127461.40
Median E2E Latency (ms):                 128010.06
---------------Time to First Token----------------
Mean TTFT (ms):                          82007.71
Median TTFT (ms):                        74332.81
P99 TTFT (ms):                           193344.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          296.22
Median TPOT (ms):                        245.38
P99 TPOT (ms):                           1540.70
---------------Inter-token Latency----------------
Mean ITL (ms):                           629.50
Median ITL (ms):                         210.30
P99 ITL (ms):                            757.17
==================================================

@zhyncs
Copy link
Member Author

zhyncs commented Aug 3, 2024

There is a trivial issue, let me fix it.

python/sglang/srt/layers/layernorm.py Outdated Show resolved Hide resolved
python/sglang/srt/layers/layernorm.py Show resolved Hide resolved
python/sglang/srt/layers/layernorm.py Show resolved Hide resolved
@zhyncs
Copy link
Member Author

zhyncs commented Aug 4, 2024

There is a trivial issue, let me fix it.

fixed

@zhyncs
Copy link
Member Author

zhyncs commented Aug 4, 2024

update:

python3 python/sglang/test/test_layernorm.py
test_rms_norm (__main__.TestRMSNorm.test_rms_norm) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.608s

OK

yzh119 pushed a commit to flashinfer-ai/flashinfer that referenced this pull request Aug 4, 2024
@zhyncs zhyncs force-pushed the upd branch 2 times, most recently from 93f96d0 to c795239 Compare August 4, 2024 08:03
@zhyncs
Copy link
Member Author

zhyncs commented Aug 4, 2024

Currently, e2e and unit tests are failing because fused_add_rmsnorm can only be used in the new version. This failure can be temporarily ignored.

@zhyncs zhyncs force-pushed the upd branch 5 times, most recently from f572809 to 095eb05 Compare August 9, 2024 12:22
@zhyncs zhyncs changed the title [DO NOT MERGE] feat: use FlashInfer rmsnorm [DO NOT MERGE] feat: use FlashInfer rmsnorm and silu Aug 9, 2024
@zhyncs zhyncs force-pushed the upd branch 2 times, most recently from d9c6bbf to 74c805d Compare August 10, 2024 15:52
@zhyncs zhyncs changed the title [DO NOT MERGE] feat: use FlashInfer rmsnorm and silu feat: use FlashInfer rmsnorm and silu Aug 10, 2024
@zhyncs zhyncs enabled auto-merge (squash) August 11, 2024 02:40
@zhyncs zhyncs requested a review from Ying1123 August 11, 2024 02:40
@zhyncs zhyncs disabled auto-merge August 11, 2024 04:57
@zhyncs zhyncs merged commit 94752ac into sgl-project:main Aug 11, 2024
3 checks passed
@zhyncs zhyncs deleted the upd branch August 11, 2024 04:57
@zhyncs
Copy link
Member Author

zhyncs commented Aug 11, 2024

Thanks so much for @yzh119's help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants