Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine async postprocessor and multi-step - first WIP version #7743

Closed

Conversation

alexm-neuralmagic
Copy link
Collaborator

This PR combines async postprocessor with the multi-step execution. The key idea to run sampler pythonization + process_model_outputs of the previous step concurrently with the GPU forward pass (which avoid running a postprocessor at the end of the multi-step for all generated steps at once and being blocked). The code here is WIP and depends on #7049 first landing.

Here are some first results for Llama3.1 8B on H100 with RPS==10, 8 multi-step, 1024 prompt and 512 decode. The TPOT improves by 31% from 27.71ms to 21.05ms.

8-multi-step execution

python benchmark_serving.py --backend vllm --host localhost --port 8888 --endpoint /v1/completions --model meta-llama/Meta-Llama-3.1-8B-Instruct  --num-prompts 200 --dataset-name sonnet --dataset-path sonnet.txt --sonnet-input-len 1024 --sonnet-prefix-len 512 --sonnet-output-len 512 --request-rate 10

============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  30.65     
Total input tokens:                      191053    
Total generated tokens:                  102400    
Request throughput (req/s):              6.53      
Input token throughput (tok/s):          6233.42   
Output token throughput (tok/s):         3340.97   
---------------Time to First Token----------------
Mean TTFT (ms):                          155.89    
Median TTFT (ms):                        144.32    
P99 TTFT (ms):                           419.22    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.71     
Median TPOT (ms):                        29.04     
P99 TPOT (ms):                           32.20     
---------------Inter-token Latency----------------
Mean ITL (ms):                           220.88    
Median ITL (ms):                         216.06    
P99 ITL (ms):                            429.27    
==================================================

8-multi-step execution + async postprocessing

============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  27.44     
Total input tokens:                      191053    
Total generated tokens:                  102400    
Request throughput (req/s):              7.29      
Input token throughput (tok/s):          6963.68   
Output token throughput (tok/s):         3732.37   
---------------Time to First Token----------------
Mean TTFT (ms):                          120.86    
Median TTFT (ms):                        113.91    
P99 TTFT (ms):                           239.73    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.05     
Median TPOT (ms):                        21.91     
P99 TPOT (ms):                           25.17     
---------------Inter-token Latency----------------
Mean ITL (ms):                           167.91    
Median ITL (ms):                         156.76    
P99 ITL (ms):                            366.86    
==================================================

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@alexm-neuralmagic
Copy link
Collaborator Author

@comaniac
Copy link
Collaborator

Will review after #7049 is merged.

@alexm-neuralmagic
Copy link
Collaborator Author

Closed in favor of #7921

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants