-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Performance Roadmap #6801
Comments
Let's pin this issue! |
@SolitaryThinker please pin the multi-step inference issue |
If we're talking performance, can someone explain how sglang is 50-100% faster than vLLM while using vLLM as a backend? Their code base looks like a pure Python vLLM wrapper. How are they that much faster? |
To be fully transparent, we are still figuring it out. @KuntaiDu is reproducing benchmark here: #6794 (comment) |
@w013nad there are some niche details, for example, HF duplicates KV heads from 8 to 16, but haven't update the config. We are waiting for HF to update the config and the weight from 16 to 8. However, sglang directly patches it: which means they use less memory for kv cache, and will have better throughput. |
Hi @w013nad I am a member of the SGLang Team. We are delighted that you have taken an interest in our project, SGLang https://github.com/sgl-project/sglang. We recently published a blog post introducing the performance advantages of v0.2 https://lmsys.org/blog/2024-07-25-sglang-llama3/. The performance data from the blog shows that on the Llama 3 model, from 8B to 70B to 405B, and from BF16 to FP8, on both A100 and H100, SGLang is far ahead in terms of performance due to its extensive engineering optimizations such as efficient scheduling, CUDA Graphs, Torch Compile etc. In fact, currently SGLang does not use vLLM as a backend, everything from Attention to scheduling and various optimizations are implemented by SGLang itself. At present we only utilize some linear classes from vLLM. Hope this answers your question. We greatly appreciate and thank vLLM for the excellent work, looking forward to continuing outstanding work on LLM inference acceleration, cheers. |
@w013nad It is incorrect to say that "SGLang is a pure Python vLLM wrapper." Otherwise, it cannot explain the speedup you mentioned.
SGLang has a clean and well-designed architecture. It is also a fully open community with many experienced inference optimization developers. I encourage everyone to give it a try and run the benchmark. Both SGLang and vLLM are great projects and they can learn from each other. |
Admittedly I didn't look too much through your code base. I just saw the github code overview that showed 99% Python and that you were importing a lot of vLLM components in your model executor. I was very impressed by your project though. I did a few of my own tests on my A100s and found that your codebase was much faster across the board |
@w013nad Great to hear that you got good numbers from SGLang.
As you said, SGLang is 99% Python. It reflects our philosophy of modular and minimal design. |
Thanks for the clarification! Really hope HF can fix the kv heads issue so that the community can benefit from it without using hacks. |
I tested sglang and vllm using the llama3.1-8B model on a single L40s GPU. I found that at low batch sizes, the speeds of sglang and vllm are similar. However, at higher batch sizes, sglang performs better than vllm. |
Hi @WangErXiao Thanks for your interest. For enterprise-level users, increasing the batch size is very necessary and reasonable, as they have a huge amount of traffic. SGLang has performed very well in terms of scalability, supporting batch sizes of up to thousands. |
When BS is small, vllm/sglang/trt-llm performance is similar. The difference only when BS is large. |
Hi @mpjlu Thanks for sharing your insights. In fact, the real online requests are measured in terms of request rate, not batch size. Even if the request rate is 1, the batch size can still accumulate to be very large. Currently, many benchmark tests have a problem: they do not realize the impact of the number of requests during benchmark. For example, if the request rate increases from 1 to 2 to 4 and up to 16, it's unreasonable to always use 1000 num prompts because aside from server warmup and cooldown taking up some time, if the number is too small, during the entire benchmark period, the proportion of time that truly puts the server at capacity is very small. This cannot reflect the server's true level. On statistical line charts, it would appear as though different frameworks' performances are converging when that's actually not the case. If the request rate exceeds server's capacity, the request will be queue up. Then larger number of prompts cause higher E2E latency. The actual online traffic, especially the online services of large internet companies such as search recommendation advertising, although there are peaks and troughs, still has a very high overall volume. If you have more in-depth thoughts, feel free to communicate with me, thanks! |
[Batch Size I means the real input batch of vllm.model_execution], we find many cases, the huge performance difference is caused by some arguments not setting properly. for example, llama 1 7b and llama3 7b, the network is the same, if use the same script to test performance, the performance will be different. |
@WangErXiao You can try to add |
|
Hi @mpjlu You might not have carefully understood the true meaning of input length and output length here. This uses a random dataset with a random ratio of 0, meaning that the input length is [1, 1024], and the output length is [1, 1024], and both are discrete uniform. |
Hi @zhyncs , thanks for your kindly reply. |
Hi @mpjlu Thank you for your detailed and patient explanation, your understanding is correct. The computational load of forward can be roughly estimated as 2ND calculation. I apologize for my previous aggressive reply, thank you! |
I have done some profile about the performance difference between sglang and vllm. When the request rate is small (e.g. the running request is less than 50), the performance gap is small between sglang and vllm (less than 10%). when the request rate is large or offline with many requests send to the server at the same time. The performance gap is large. The key reason is max_num_seqs arg of vllm, which controls the BS of model forward, the default value is 256, but for sglang, this value is setting by profiling, and can be very large. So for vllm, if you use default args to test the performance, the largest BS of model forward is 256, but sglang can be very larger. The other reason of vLLM performance is not good when BS is large is the time between two step is long. This includes sampling time, prepare input time, process_model_output time, and scheduler time. When BS is 200, all this together can be 20ms. In my test, the scheduler/torch compile difference is not the key reason of performance gap. |
Thanks for the details! We agree the time is too long between steps and are actively working on each of these. But you bring up a good point with our defaults. Its a good idea to do a sweep over these default parameters in addition. Thanks! |
@mpjlu this is a good insight about max_num_seqs default being 256. Any reason it is set to this specific value? |
@alexm-neuralmagic I think 256 was a very large value 6 months ago, because most models at that time don't use GQA, the kv cache is large and cannot run very large bs. |
If manually setting a larger
In our blog, we mentioned that vLLM suffers from high CPU scheduling overhead. This does not refer to pure scheduling like FCFS, but rather the scheduling between CPU and GPU workloads.
This is inaccurate. Real-world production involves more than online scenarios. My experience in search and recommendation, possibly the highest-traffic area in the internet industry, showed that we used both offline and online. Offline, our focus was solely on throughput without concern for latency. In online scenario, enabling streaming allows us to significantly increase batch size as long as TTFT and ITL are within acceptable limits. |
@mpjlu Thanks for the good insight! |
Now I don't have a 80G memory GPU to test a larger max_num_seqs. I just find in 46G memory GPU, the llama 3 8b model (in 1024- out 1024) BS can be 220, in a 80G memory GPU, this bs can be about 700. so vllm default bs 256 should impact performance. |
So these are your assumptions rather than actual test results? If you have concrete data to verify this, feel free to share it. |
Yes, this is my assumption。 But I have some other test: https://doc.weixin.qq.com/doc/w3_AH8AWgYyACkl1A4yKimS6SluHVrZ8 |
@mpjlu Raising |
@zhyncs , thanks very much. I will find the reason when I have a 80G GPU. |
Hi @mpjlu, we do noticed that default value of 256, and we tried increase it but was not successfully get stably better number. To give you one data point, in one of our offline benchmark of 4000 requests with input_len and output_len in [0, 1024], batch size 256 delivers 1816.09 output tok/s, and batch size 512 delievers 1687.79 output tok/s. |
vLLM also has --gpu-memory-utilization set to 0.9 by default. on an 80G GPU, this means that 8GB are potentially not used. I have seen that increasing --gpu-memory-utilization to 0.98 is possible and does provide better results. In general, the 0.9 default is a safe current choice because some memory (like cuda graphs) is not profiled accurately, but this may change with more precise bootstrap profiling |
@Ying1123 increasing default max batch size in vLLM increases the probability of request preemption (which has high overhead currently). Did you saw a log of preempted requests when you increased from 256 to 512? (vLLM usually warns the user when this happens) |
it is ok on a100 increase bs from 256 to 512 does not increase tpt. only when memory bound increase bs can increase tpt。for a100, bs 256 is compute bound, for h100, bs 256 is still memory bound, so on h100 should increase the default value. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
Anything you want to discuss about vllm.
This is a meta RFC tracking some of the performance enhancement works we are prioritizing.
The text was updated successfully, but these errors were encountered: