-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: The impact of CPU on vLLM performance is significant. #8147
Comments
Related experiments: #7540 |
@WoosukKwon @youkaichao Please provide some assistance. |
what is the vllm version you use? |
0.5.5 |
we are optimizing the cpu time, please stay tuned. it should not be so dependent on CPU performance in the future. |
What is the reason for VLLM's current heavy dependence on CPU, and what are the directions for optimization? |
Our team has developed some spec decoding features based on VLLM, which have been used internally and have yielded good performance benefits. How can we join the VLLM project, and where would be a good place to start? |
welcome to send emails to vllm-questions@lists.berkeley.edu |
Really interesting. Thanks for reporting. The GPUs are getting fast :) |
Hi @skylee-01 Thanks for reporting this! We also recently discovered the same problem. We plan to do more optimizations to mitigate the CPU effect. vLLM is a fully open community-driven project, so we'd appreciate any contributions, including submitting or reviewing PRs, answering questions, and helping documentation. |
may I know if such cpu optimization has been completed in latest master? @youkaichao |
In the latest version, this problem has been optimized. Of course, it will be optimized better in the future |
Proposal to improve performance
We used the same GPU on two machines but different CPUs. The following experimental conclusions were drawn:
Experimental results: The GPU is 3090, and the CPU was upgraded from Xeon Gold 6240 to i9-12900k. The impact is as follows.
a. vLLM achieved a 3.8x speedup in the agent scenario.
b. TGi achieved a 1.23x speedup in the agent scenario.
c. vLLM still has latency issues, but the time has been reduced to 100ms (previously 300ms).
e. GPU utilization has increased from 70% to 90%.
From the stress test data, it is evident that vLLM heavily relies on the performance of the CPU.
What are the main factors affecting CPU performance, and how can they be optimized?
The text was updated successfully, but these errors were encountered: