Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the update flashinfer indices #1262

Merged
merged 5 commits into from
Sep 1, 2024

Conversation

xiaobochen123
Copy link
Contributor

@xiaobochen123 xiaobochen123 commented Aug 30, 2024

When running a large batch, sglang also have the CPU bottleneck. One of the bottleneck occurs when updating the flashinfer kv indices. The naive pytorch implement slow when batch is very large.

  • Hardware: 1xH800
  • Model: Llama-3-8B
// launch server
python3 -m sglang.launch_server    \
    --trust-remote-code                  \
    --disable-cuda-graph    \
    --model  xxxx   \
    --context-length 4096    \
    --max-running-requests 4096        \
    --tensor-parallel-size 1        \
    --chunked-prefill-size -1    \
    --disable-radix-cache 

// test 
python3 bench_serving.py \
        --backend sglang    \
        --tokenizer xxxxx    \
        --dataset-name random     \
        --num-prompts 5000    \
        --random-output-len 128 \
        --random-input-len 20  

When running batch is very large, this PR can reduce 30% cpu time between steps (decoding stage). Improve 10% e2e-performance.

image

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

Hi @xiaobochen123 Nice work! May you look at the Unit Test failure?

@leo6022
Copy link

leo6022 commented Aug 30, 2024

Hi @xiaobochen123 Nice work! May you look at the Unit Test failure?

@zhyncs Yes, I am looking at it.

@zhyncs zhyncs changed the title Optimize the update flash-infer indices Optimize the update flashinfer indices Aug 30, 2024
@merrymercy
Copy link
Contributor

@xiaobochen123 @leo6022 This is pretty good! How did you find this bottleneck? Can we fix the test cases and merge this as soon as possible?

@xiaobochen123
Copy link
Contributor Author

@merrymercy
I use nsys to profile server and find it.

The code meet a triton error(cause unknown). I write it on diff way to avoid the error.

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

@xiaobochen123 May you update the latest benchmark result?

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

CI E2E Test

this PR

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     400       
Benchmark duration (s):                  159.43    
Total input tokens:                      817076    
Total generated tokens:                  408097    
Total generated tokens (retokenized):    408188    
Request throughput (req/s):              2.51      
Input token throughput (tok/s):          5124.83   
Output token throughput (tok/s):         2559.65   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   98060.09  
Median E2E Latency (ms):                 105310.69 
---------------Time to First Token----------------
Mean TTFT (ms):                          45585.95  
Median TTFT (ms):                        33951.94  
P99 TTFT (ms):                           66.38 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.78     
Median TPOT (ms):                        57.19     
P99 TPOT (ms):                           139.83    
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.49     
Median ITL (ms):                         47.75     
P99 ITL (ms):                            194.01    
==================================================

main

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     400       
Benchmark duration (s):                  160.64    
Total input tokens:                      817076    
Total generated tokens:                  408097    
Total generated tokens (retokenized):    408200    
Request throughput (req/s):              2.49      
Input token throughput (tok/s):          5086.49   
Output token throughput (tok/s):         2540.50   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   97010.70  
Median E2E Latency (ms):                 102937.17 
---------------Time to First Token----------------
Mean TTFT (ms):                          43815.59  
Median TTFT (ms):                        32881.86  
P99 TTFT (ms):                           114076.64 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.97     
Median TPOT (ms):                        58.53     
P99 TPOT (ms):                           183.71    
---------------Inter-token Latency----------------
Mean ITL (ms):                           52.20     
Median ITL (ms):                         48.81     
P99 ITL (ms):                            184.93    
==================================================

Based on the results of CI, there is almost no improvement. Is the performance of the new fix implementation as expected? @xiaobochen123

@xiaobochen123
Copy link
Contributor Author

@zhyncs You're not testing enough concurrency and batch-size. The cpu bottlenecks only show up at very high concurrency, such as running-req at 4000+ in my tests. So you can see that my bench_serving.py set input-len=32 and output-len=128, just to run a lot of concurrency.

My new profile result:

  • Base: QPS=178,In-throughput=2961, Out-throughput=11486
  • This PR: QPS=190,In-throughput=3155, Out-throughput=12237

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

my bench_serving.py set input-len=32 and output-len=128

Where is this information mentioned?

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

CI E2E Test

this PR

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     400       
Benchmark duration (s):                  159.43    
Total input tokens:                      817076    
Total generated tokens:                  408097    
Total generated tokens (retokenized):    408188    
Request throughput (req/s):              2.51      
Input token throughput (tok/s):          5124.83   
Output token throughput (tok/s):         2559.65   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   98060.09  
Median E2E Latency (ms):                 105310.69 
---------------Time to First Token----------------
Mean TTFT (ms):                          45585.95  
Median TTFT (ms):                        33951.94  
P99 TTFT (ms):                           66.38 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.78     
Median TPOT (ms):                        57.19     
P99 TPOT (ms):                           139.83    
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.49     
Median ITL (ms):                         47.75     
P99 ITL (ms):                            194.01    
==================================================

main

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     400       
Benchmark duration (s):                  160.64    
Total input tokens:                      817076    
Total generated tokens:                  408097    
Total generated tokens (retokenized):    408200    
Request throughput (req/s):              2.49      
Input token throughput (tok/s):          5086.49   
Output token throughput (tok/s):         2540.50   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   97010.70  
Median E2E Latency (ms):                 102937.17 
---------------Time to First Token----------------
Mean TTFT (ms):                          43815.59  
Median TTFT (ms):                        32881.86  
P99 TTFT (ms):                           114076.64 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.97     
Median TPOT (ms):                        58.53     
P99 TPOT (ms):                           183.71    
---------------Inter-token Latency----------------
Mean ITL (ms):                           52.20     
Median ITL (ms):                         48.81     
P99 ITL (ms):                            184.93    
==================================================

Based on the results of CI, there is almost no improvement. Is the performance of the new fix implementation as expected? @xiaobochen123

In this case, the Median TTFT even got worse.

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

hold on I'll verify with your benchmark settings

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

# H100 SXM

# server
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --disable-radix-cache

# client
python3 -m sglang.bench_serving --backend sglang --dataset-name random  --num-prompts 5000 --random-output-len 128 --random-input-len 32

# main
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  26.01
Total input tokens:                      82922
Total generated tokens:                  321605
Total generated tokens (retokenized):    321028
Request throughput (req/s):              192.25
Input token throughput (tok/s):          3188.31
Output token throughput (tok/s):         12365.56
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18592.04
Median E2E Latency (ms):                 20878.07
---------------Time to First Token----------------
Mean TTFT (ms):                          10666.49
Median TTFT (ms):                        8186.85
P99 TTFT (ms):                           21438.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          129.69
Median TPOT (ms):                        152.35
P99 TPOT (ms):                           182.63
---------------Inter-token Latency----------------
Mean ITL (ms):                           145.53
Median ITL (ms):                         97.44
P99 ITL (ms):                            540.48
==================================================

# pr
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  26.41
Total input tokens:                      82922
Total generated tokens:                  321605
Total generated tokens (retokenized):    321065
Request throughput (req/s):              189.31
Input token throughput (tok/s):          3139.66
Output token throughput (tok/s):         12176.88
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18240.40
Median E2E Latency (ms):                 20422.47
---------------Time to First Token----------------
Mean TTFT (ms):                          10233.91
Median TTFT (ms):                        6640.07
P99 TTFT (ms):                           23509.95
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          131.43
Median TPOT (ms):                        159.41
P99 TPOT (ms):                           191.21
---------------Inter-token Latency----------------
Mean ITL (ms):                           150.97
Median ITL (ms):                         106.25
P99 ITL (ms):                            570.93
==================================================

This is my benchmark result on the H100 SXM, and compared to the main, there has been no improvement, even some decline. I think this PR still needs further confirmation on the details. cc @merrymercy @Ying1123

@xiaobochen123
Copy link
Contributor Author

@zhyncs I profiled the triton kernel and the torch native impl. Set batch=4096, max_context_len=4096, the triton kernel only took 70us, while the torch native implementation took 15ms.

I tested the server a few times and found fluctuations in performance. I will check the reason.

@Ying1123
Copy link
Member

Ying1123 commented Aug 31, 2024

I also observe non-trivial fluctuations. Overall, the e2e performance improvement could be ~2-3%. The code change is straightforward. Although the performance check is not super confident, I think this is a safe merge. @zhyncs

@Ying1123 Ying1123 merged commit d134c13 into sgl-project:main Sep 1, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants