Combine async postprocessor and multi-step - first WIP version #7743

alexm-neuralmagic · 2024-08-21T16:43:00Z

This PR combines async postprocessor with the multi-step execution. The key idea to run sampler pythonization + process_model_outputs of the previous step concurrently with the GPU forward pass (which avoid running a postprocessor at the end of the multi-step for all generated steps at once and being blocked). The code here is WIP and depends on #7049 first landing.

Here are some first results for Llama3.1 8B on H100 with RPS==10, 8 multi-step, 1024 prompt and 512 decode. The TPOT improves by 31% from 27.71ms to 21.05ms.

8-multi-step execution

python benchmark_serving.py --backend vllm --host localhost --port 8888 --endpoint /v1/completions --model meta-llama/Meta-Llama-3.1-8B-Instruct  --num-prompts 200 --dataset-name sonnet --dataset-path sonnet.txt --sonnet-input-len 1024 --sonnet-prefix-len 512 --sonnet-output-len 512 --request-rate 10

============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  30.65     
Total input tokens:                      191053    
Total generated tokens:                  102400    
Request throughput (req/s):              6.53      
Input token throughput (tok/s):          6233.42   
Output token throughput (tok/s):         3340.97   
---------------Time to First Token----------------
Mean TTFT (ms):                          155.89    
Median TTFT (ms):                        144.32    
P99 TTFT (ms):                           419.22    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.71     
Median TPOT (ms):                        29.04     
P99 TPOT (ms):                           32.20     
---------------Inter-token Latency----------------
Mean ITL (ms):                           220.88    
Median ITL (ms):                         216.06    
P99 ITL (ms):                            429.27    
==================================================

8-multi-step execution + async postprocessing

============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  27.44     
Total input tokens:                      191053    
Total generated tokens:                  102400    
Request throughput (req/s):              7.29      
Input token throughput (tok/s):          6963.68   
Output token throughput (tok/s):         3732.37   
---------------Time to First Token----------------
Mean TTFT (ms):                          120.86    
Median TTFT (ms):                        113.91    
P99 TTFT (ms):                           239.73    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.05     
Median TPOT (ms):                        21.91     
P99 TPOT (ms):                           25.17     
---------------Inter-token Latency----------------
Mean ITL (ms):                           167.91    
Median ITL (ms):                         156.76    
P99 ITL (ms):                            366.86    
==================================================

github-actions · 2024-08-21T16:43:11Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

alexm-neuralmagic · 2024-08-21T16:43:47Z

@megha95 @WoosukKwon @robertgshaw2-neuralmagic @comaniac @SolitaryThinker

comaniac · 2024-08-21T20:35:45Z

Will review after #7049 is merged.

alexm-neuralmagic · 2024-08-27T17:53:31Z

Closed in favor of #7921

megha95 and others added 6 commits August 20, 2024 13:43

Add async postprocess support

1b2e046

rebase over multi-step and fix bugs

1356ab0

small cleanups

90e5a84

make metrics async

78cc4f1

ping

2628fc6

combine async and multi-step - first rought working version

682680d

megha95 mentioned this pull request Aug 21, 2024

[Core] Asynchronous Output Processor #7049

Merged

6 tasks

alexm-neuralmagic closed this Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine async postprocessor and multi-step - first WIP version #7743

Combine async postprocessor and multi-step - first WIP version #7743

alexm-neuralmagic commented Aug 21, 2024

github-actions bot commented Aug 21, 2024

alexm-neuralmagic commented Aug 21, 2024

comaniac commented Aug 21, 2024

alexm-neuralmagic commented Aug 27, 2024

Combine async postprocessor and multi-step - first WIP version #7743

Combine async postprocessor and multi-step - first WIP version #7743

Conversation

alexm-neuralmagic commented Aug 21, 2024

github-actions bot commented Aug 21, 2024

alexm-neuralmagic commented Aug 21, 2024

comaniac commented Aug 21, 2024

alexm-neuralmagic commented Aug 27, 2024