-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the update flashinfer indices #1262
Conversation
Hi @xiaobochen123 Nice work! May you look at the Unit Test failure? |
@zhyncs Yes, I am looking at it. |
@xiaobochen123 @leo6022 This is pretty good! How did you find this bottleneck? Can we fix the test cases and merge this as soon as possible? |
@merrymercy The code meet a triton error(cause unknown). I write it on diff way to avoid the error. |
@xiaobochen123 May you update the latest benchmark result? |
CI E2E Test this PR
main
Based on the results of CI, there is almost no improvement. Is the performance of the new fix implementation as expected? @xiaobochen123 |
@zhyncs You're not testing enough concurrency and batch-size. The cpu bottlenecks only show up at very high concurrency, such as running-req at 4000+ in my tests. So you can see that my bench_serving.py set input-len=32 and output-len=128, just to run a lot of concurrency. My new profile result:
|
Where is this information mentioned? |
In this case, the Median TTFT even got worse. |
hold on I'll verify with your benchmark settings |
This is my benchmark result on the H100 SXM, and compared to the main, there has been no improvement, even some decline. I think this PR still needs further confirmation on the details. cc @merrymercy @Ying1123 |
@zhyncs I profiled the triton kernel and the torch native impl. Set batch=4096, max_context_len=4096, the triton kernel only took 70us, while the torch native implementation took 15ms. I tested the server a few times and found fluctuations in performance. I will check the reason. |
I also observe non-trivial fluctuations. Overall, the e2e performance improvement could be ~2-3%. The code change is straightforward. Although the performance check is not super confident, I think this is a safe merge. @zhyncs |
When running a large batch, sglang also have the CPU bottleneck. One of the bottleneck occurs when updating the flashinfer kv indices. The naive pytorch implement slow when batch is very large.
When running batch is very large, this PR can reduce 30% cpu time between steps (decoding stage). Improve 10% e2e-performance.