Optimize the update flashinfer indices #1262

xiaobochen123 · 2024-08-30T00:40:47Z

When running a large batch, sglang also have the CPU bottleneck. One of the bottleneck occurs when updating the flashinfer kv indices. The naive pytorch implement slow when batch is very large.

Hardware: 1xH800
Model: Llama-3-8B

// launch server
python3 -m sglang.launch_server    \
    --trust-remote-code                  \
    --disable-cuda-graph    \
    --model  xxxx   \
    --context-length 4096    \
    --max-running-requests 4096        \
    --tensor-parallel-size 1        \
    --chunked-prefill-size -1    \
    --disable-radix-cache 

// test 
python3 bench_serving.py \
        --backend sglang    \
        --tokenizer xxxxx    \
        --dataset-name random     \
        --num-prompts 5000    \
        --random-output-len 128 \
        --random-input-len 20

When running batch is very large, this PR can reduce 30% cpu time between steps (decoding stage). Improve 10% e2e-performance.

python/sglang/srt/model_executor/forward_batch_info.py

…rrency

zhyncs · 2024-08-30T06:47:05Z

Hi @xiaobochen123 Nice work! May you look at the Unit Test failure?

leo6022 · 2024-08-30T06:53:35Z

Hi @xiaobochen123 Nice work! May you look at the Unit Test failure?

@zhyncs Yes, I am looking at it.

python/sglang/srt/model_executor/forward_batch_info.py

merrymercy · 2024-08-30T14:21:26Z

@xiaobochen123 @leo6022 This is pretty good! How did you find this bottleneck? Can we fix the test cases and merge this as soon as possible?

xiaobochen123 · 2024-08-30T18:51:25Z

@merrymercy
I use nsys to profile server and find it.

The code meet a triton error(cause unknown). I write it on diff way to avoid the error.

zhyncs · 2024-08-30T18:53:30Z

@xiaobochen123 May you update the latest benchmark result?

zhyncs · 2024-08-30T19:19:05Z

CI E2E Test

this PR

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     400       
Benchmark duration (s):                  159.43    
Total input tokens:                      817076    
Total generated tokens:                  408097    
Total generated tokens (retokenized):    408188    
Request throughput (req/s):              2.51      
Input token throughput (tok/s):          5124.83   
Output token throughput (tok/s):         2559.65   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   98060.09  
Median E2E Latency (ms):                 105310.69 
---------------Time to First Token----------------
Mean TTFT (ms):                          45585.95  
Median TTFT (ms):                        33951.94  
P99 TTFT (ms):                           66.38 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.78     
Median TPOT (ms):                        57.19     
P99 TPOT (ms):                           139.83    
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.49     
Median ITL (ms):                         47.75     
P99 ITL (ms):                            194.01    
==================================================

main

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     400       
Benchmark duration (s):                  160.64    
Total input tokens:                      817076    
Total generated tokens:                  408097    
Total generated tokens (retokenized):    408200    
Request throughput (req/s):              2.49      
Input token throughput (tok/s):          5086.49   
Output token throughput (tok/s):         2540.50   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   97010.70  
Median E2E Latency (ms):                 102937.17 
---------------Time to First Token----------------
Mean TTFT (ms):                          43815.59  
Median TTFT (ms):                        32881.86  
P99 TTFT (ms):                           114076.64 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.97     
Median TPOT (ms):                        58.53     
P99 TPOT (ms):                           183.71    
---------------Inter-token Latency----------------
Mean ITL (ms):                           52.20     
Median ITL (ms):                         48.81     
P99 ITL (ms):                            184.93    
==================================================

Based on the results of CI, there is almost no improvement. Is the performance of the new fix implementation as expected? @xiaobochen123

xiaobochen123 · 2024-08-30T19:29:04Z

@zhyncs You're not testing enough concurrency and batch-size. The cpu bottlenecks only show up at very high concurrency, such as running-req at 4000+ in my tests. So you can see that my bench_serving.py set input-len=32 and output-len=128, just to run a lot of concurrency.

My new profile result:

Base: QPS=178，In-throughput=2961, Out-throughput=11486
This PR: QPS=190，In-throughput=3155, Out-throughput=12237

zhyncs · 2024-08-30T19:34:24Z

my bench_serving.py set input-len=32 and output-len=128

Where is this information mentioned?

zhyncs · 2024-08-30T19:35:45Z

CI E2E Test

this PR

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     400       
Benchmark duration (s):                  159.43    
Total input tokens:                      817076    
Total generated tokens:                  408097    
Total generated tokens (retokenized):    408188    
Request throughput (req/s):              2.51      
Input token throughput (tok/s):          5124.83   
Output token throughput (tok/s):         2559.65   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   98060.09  
Median E2E Latency (ms):                 105310.69 
---------------Time to First Token----------------
Mean TTFT (ms):                          45585.95  
Median TTFT (ms):                        33951.94  
P99 TTFT (ms):                           66.38 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.78     
Median TPOT (ms):                        57.19     
P99 TPOT (ms):                           139.83    
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.49     
Median ITL (ms):                         47.75     
P99 ITL (ms):                            194.01    
==================================================

main

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     400       
Benchmark duration (s):                  160.64    
Total input tokens:                      817076    
Total generated tokens:                  408097    
Total generated tokens (retokenized):    408200    
Request throughput (req/s):              2.49      
Input token throughput (tok/s):          5086.49   
Output token throughput (tok/s):         2540.50   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   97010.70  
Median E2E Latency (ms):                 102937.17 
---------------Time to First Token----------------
Mean TTFT (ms):                          43815.59  
Median TTFT (ms):                        32881.86  
P99 TTFT (ms):                           114076.64 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.97     
Median TPOT (ms):                        58.53     
P99 TPOT (ms):                           183.71    
---------------Inter-token Latency----------------
Mean ITL (ms):                           52.20     
Median ITL (ms):                         48.81     
P99 ITL (ms):                            184.93    
==================================================

Based on the results of CI, there is almost no improvement. Is the performance of the new fix implementation as expected? @xiaobochen123

In this case, the Median TTFT even got worse.

zhyncs · 2024-08-30T19:37:03Z

hold on I'll verify with your benchmark settings

zhyncs · 2024-08-30T19:53:00Z

# H100 SXM

# server
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --disable-radix-cache

# client
python3 -m sglang.bench_serving --backend sglang --dataset-name random  --num-prompts 5000 --random-output-len 128 --random-input-len 32

# main
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  26.01
Total input tokens:                      82922
Total generated tokens:                  321605
Total generated tokens (retokenized):    321028
Request throughput (req/s):              192.25
Input token throughput (tok/s):          3188.31
Output token throughput (tok/s):         12365.56
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18592.04
Median E2E Latency (ms):                 20878.07
---------------Time to First Token----------------
Mean TTFT (ms):                          10666.49
Median TTFT (ms):                        8186.85
P99 TTFT (ms):                           21438.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          129.69
Median TPOT (ms):                        152.35
P99 TPOT (ms):                           182.63
---------------Inter-token Latency----------------
Mean ITL (ms):                           145.53
Median ITL (ms):                         97.44
P99 ITL (ms):                            540.48
==================================================

# pr
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  26.41
Total input tokens:                      82922
Total generated tokens:                  321605
Total generated tokens (retokenized):    321065
Request throughput (req/s):              189.31
Input token throughput (tok/s):          3139.66
Output token throughput (tok/s):         12176.88
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18240.40
Median E2E Latency (ms):                 20422.47
---------------Time to First Token----------------
Mean TTFT (ms):                          10233.91
Median TTFT (ms):                        6640.07
P99 TTFT (ms):                           23509.95
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          131.43
Median TPOT (ms):                        159.41
P99 TPOT (ms):                           191.21
---------------Inter-token Latency----------------
Mean ITL (ms):                           150.97
Median ITL (ms):                         106.25
P99 ITL (ms):                            570.93
==================================================

This is my benchmark result on the H100 SXM, and compared to the main, there has been no improvement, even some decline. I think this PR still needs further confirmation on the details. cc @merrymercy @Ying1123

xiaobochen123 · 2024-08-30T21:24:41Z

@zhyncs I profiled the triton kernel and the torch native impl. Set batch=4096, max_context_len=4096, the triton kernel only took 70us, while the torch native implementation took 15ms.

I tested the server a few times and found fluctuations in performance. I will check the reason.

Ying1123 · 2024-08-31T20:01:38Z

I also observe non-trivial fluctuations. Overall, the e2e performance improvement could be ~2-3%. The code change is straightforward. Although the performance check is not super confident, I think this is a safe merge. @zhyncs

ispobock reviewed Aug 30, 2024

View reviewed changes

python/sglang/srt/model_executor/forward_batch_info.py Outdated Show resolved Hide resolved

opt creating flashinfer kv_indices, reduce the cpu time at high concu…

5fecd34

…rrency

xiaobochen123 force-pushed the opt/cpu branch from 378a222 to 5fecd34 Compare August 30, 2024 01:19

zhyncs requested review from Ying1123, merrymercy, ispobock, zhyncs, hnyls2002, xiezhq-hermann and yzh119 August 30, 2024 06:55

zhyncs reviewed Aug 30, 2024

View reviewed changes

python/sglang/srt/model_executor/forward_batch_info.py Outdated Show resolved Hide resolved

zhyncs reviewed Aug 30, 2024

View reviewed changes

python/sglang/srt/model_executor/forward_batch_info.py Outdated Show resolved Hide resolved

zhyncs reviewed Aug 30, 2024

View reviewed changes

python/sglang/srt/model_executor/forward_batch_info.py Outdated Show resolved Hide resolved

zhyncs changed the title ~~Optimize the update flash-infer indices~~ Optimize the update flashinfer indices Aug 30, 2024

xiaobochen123 added 2 commits August 30, 2024 18:42

fix a logic bug and avoid a triton=3.0.0 error

f798e25

add a unit test

83ca912

xiaobochen123 force-pushed the opt/cpu branch from f941361 to 83ca912 Compare August 30, 2024 18:45

Merge branch 'main' into opt/cpu

c3ce940

Merge branch 'main' into opt/cpu

4313534

Ying1123 approved these changes Sep 1, 2024

View reviewed changes

zhyncs approved these changes Sep 1, 2024

View reviewed changes

Ying1123 merged commit d134c13 into sgl-project:main Sep 1, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the update flashinfer indices #1262

Optimize the update flashinfer indices #1262

xiaobochen123 commented Aug 30, 2024 •

edited

Loading

zhyncs commented Aug 30, 2024

leo6022 commented Aug 30, 2024

merrymercy commented Aug 30, 2024

xiaobochen123 commented Aug 30, 2024

zhyncs commented Aug 30, 2024

zhyncs commented Aug 30, 2024

xiaobochen123 commented Aug 30, 2024

zhyncs commented Aug 30, 2024

zhyncs commented Aug 30, 2024

zhyncs commented Aug 30, 2024

zhyncs commented Aug 30, 2024 •

edited

Loading

xiaobochen123 commented Aug 30, 2024

Ying1123 commented Aug 31, 2024 •

edited

Loading

Optimize the update flashinfer indices #1262

Optimize the update flashinfer indices #1262

Conversation

xiaobochen123 commented Aug 30, 2024 • edited Loading

zhyncs commented Aug 30, 2024

leo6022 commented Aug 30, 2024

merrymercy commented Aug 30, 2024

xiaobochen123 commented Aug 30, 2024

zhyncs commented Aug 30, 2024

zhyncs commented Aug 30, 2024

xiaobochen123 commented Aug 30, 2024

zhyncs commented Aug 30, 2024

zhyncs commented Aug 30, 2024

zhyncs commented Aug 30, 2024

zhyncs commented Aug 30, 2024 • edited Loading

xiaobochen123 commented Aug 30, 2024

Ying1123 commented Aug 31, 2024 • edited Loading

xiaobochen123 commented Aug 30, 2024 •

edited

Loading

zhyncs commented Aug 30, 2024 •

edited

Loading

Ying1123 commented Aug 31, 2024 •

edited

Loading