[RFC]: Performance Roadmap #6801

simon-mo · 2024-07-25T21:32:34Z

Anything you want to discuss about vllm.

This is a meta RFC tracking some of the performance enhancement works we are prioritizing.

richardliaw · 2024-07-25T21:42:44Z

Let's pin this issue!

zhisbug · 2024-07-26T00:03:46Z

@SolitaryThinker please pin the multi-step inference issue

w013nad · 2024-07-26T12:29:26Z

If we're talking performance, can someone explain how sglang is 50-100% faster than vLLM while using vLLM as a backend? Their code base looks like a pure Python vLLM wrapper. How are they that much faster?

simon-mo · 2024-07-26T16:38:31Z

To be fully transparent, we are still figuring it out. @KuntaiDu is reproducing benchmark here: #6794 (comment)

youkaichao · 2024-07-26T18:21:08Z

@w013nad there are some niche details, for example, HF duplicates KV heads from 8 to 16, but haven't update the config. We are waiting for HF to update the config and the weight from 16 to 8.

However, sglang directly patches it:

https://github.com/sgl-project/sglang/blob/8628ab9c8bdf9b01c4671e3c6caabf49afd73395/python/sglang/srt/managers/controller/model_runner.py#L126

which means they use less memory for kv cache, and will have better throughput.

zhyncs · 2024-07-26T19:08:38Z

If we're talking performance, can someone explain how sglang is 50-100% faster than vLLM while using vLLM as a backend? Their code base looks like a pure Python vLLM wrapper. How are they that much faster?

Hi @w013nad I am a member of the SGLang Team. We are delighted that you have taken an interest in our project, SGLang https://github.com/sgl-project/sglang. We recently published a blog post introducing the performance advantages of v0.2 https://lmsys.org/blog/2024-07-25-sglang-llama3/. The performance data from the blog shows that on the Llama 3 model, from 8B to 70B to 405B, and from BF16 to FP8, on both A100 and H100, SGLang is far ahead in terms of performance due to its extensive engineering optimizations such as efficient scheduling, CUDA Graphs, Torch Compile etc. In fact, currently SGLang does not use vLLM as a backend, everything from Attention to scheduling and various optimizations are implemented by SGLang itself. At present we only utilize some linear classes from vLLM. Hope this answers your question. We greatly appreciate and thank vLLM for the excellent work, looking forward to continuing outstanding work on LLM inference acceleration, cheers.

Ying1123 · 2024-07-26T19:09:09Z

@w013nad It is incorrect to say that "SGLang is a pure Python vLLM wrapper." Otherwise, it cannot explain the speedup you mentioned.

SGLang has its own efficient memory management and batch scheduler. Essentially, it only imports some fused kernels and layers from vLLM. SGLang uses vLLM as a kernel library, similar to how it uses FlashInfer. All other upper-level code in vLLM is irrelevant to SGLang.
What @youkaichao mentioned is an optimization we implemented for 405B model. However, it is not used in any of the benchmark results we have released. We always make fair comparisons. In our blog post, we use dummy weights with 8 heads for all frameworks to ensure fairness.

SGLang has a clean and well-designed architecture. It is also a fully open community with many experienced inference optimization developers. I encourage everyone to give it a try and run the benchmark. Both SGLang and vLLM are great projects and they can learn from each other.

w013nad · 2024-07-26T19:22:58Z

If we're talking performance, can someone explain how sglang is 50-100% faster than vLLM while using vLLM as a backend? Their code base looks like a pure Python vLLM wrapper. How are they that much faster?

Hi @w013nad I am a member of the SGLang Team. We are delighted that you have taken an interest in our project, SGLang https://github.com/sgl-project/sglang. We recently published a blog post introducing the performance advantages of v0.2 https://lmsys.org/blog/2024-07-25-sglang-llama3/. The performance data from the blog shows that on the Llama 3 model, from 8B to 70B to 405B, and from BF16 to FP8, on both A100 and H100, SGLang is far ahead in terms of performance due to its extensive engineering optimizations such as efficient scheduling, CUDA Graphs, Torch Compile etc. In fact, currently SGLang does not use vLLM as a backend, everything from Attention to scheduling and various optimizations are implemented by SGLang itself. At present we only utilize some linear classes from vLLM. Hope this answers your question. We greatly appreciate and thank vLLM for the excellent work, looking forward to continuing outstanding work on LLM inference acceleration, cheers.

Admittedly I didn't look too much through your code base. I just saw the github code overview that showed 99% Python and that you were importing a lot of vLLM components in your model executor. I was very impressed by your project though. I did a few of my own tests on my A100s and found that your codebase was much faster across the board

https://www.reddit.com/r/LocalLLaMA/comments/1ebztye/the_milestone_release_of_sglang_runtime_v02/leynwa2/

Ying1123 · 2024-07-26T19:30:17Z

@w013nad Great to hear that you got good numbers from SGLang.

For llama-8b model, you can add --enable-torch-compile, which will give you an even better speedup.
speculative decoding is on our roadmap
tokenization endpoint should be easy to add. Your contribution is also welcome!
outlines support is there and also much faster than vllm. This is an example on how you can use it with OpenAI-compatible API.

As you said, SGLang is 99% Python. It reflects our philosophy of modular and minimal design.

youkaichao · 2024-07-26T20:01:40Z

What @youkaichao mentioned is an optimization we implemented for 405B model. However, it is not used in any of the benchmark results we have released. We always make fair comparisons. In our blog post, we use dummy weights with 8 heads for all frameworks to ensure fairness.

Thanks for the clarification! Really hope HF can fix the kv heads issue so that the community can benefit from it without using hacks.

SolitaryThinker · 2024-07-26T23:56:39Z

@simon-mo please update with rfc for multistep #6854

WangErXiao · 2024-07-31T01:52:20Z

If we're talking performance, can someone explain how sglang is 50-100% faster than vLLM while using vLLM as a backend? Their code base looks like a pure Python vLLM wrapper. How are they that much faster?

Hi @w013nad I am a member of the SGLang Team. We are delighted that you have taken an interest in our project, SGLang https://github.com/sgl-project/sglang. We recently published a blog post introducing the performance advantages of v0.2 https://lmsys.org/blog/2024-07-25-sglang-llama3/. The performance data from the blog shows that on the Llama 3 model, from 8B to 70B to 405B, and from BF16 to FP8, on both A100 and H100, SGLang is far ahead in terms of performance due to its extensive engineering optimizations such as efficient scheduling, CUDA Graphs, Torch Compile etc. In fact, currently SGLang does not use vLLM as a backend, everything from Attention to scheduling and various optimizations are implemented by SGLang itself. At present we only utilize some linear classes from vLLM. Hope this answers your question. We greatly appreciate and thank vLLM for the excellent work, looking forward to continuing outstanding work on LLM inference acceleration, cheers.

I tested sglang and vllm using the llama3.1-8B model on a single L40s GPU. I found that at low batch sizes, the speeds of sglang and vllm are similar. However, at higher batch sizes, sglang performs better than vllm.

zhyncs · 2024-07-31T01:56:11Z

Hi @WangErXiao Thanks for your interest. For enterprise-level users, increasing the batch size is very necessary and reasonable, as they have a huge amount of traffic. SGLang has performed very well in terms of scalability, supporting batch sizes of up to thousands.

mpjlu · 2024-07-31T11:26:30Z

When BS is small, vllm/sglang/trt-llm performance is similar. The difference only when BS is large.
When batch is large, vllm performance is low。 The reason maybe vLLM use max_num_batched_tokens to profile the memory. When max_num_batched_tokens is large, the profiled memory is huge, so the number of blocks can be used for kv cache is small.
Just use a small max_num_batched_tokens for the benchmark, vllm performance will be good for large batch size.

zhyncs · 2024-07-31T14:05:42Z

Hi @mpjlu Thanks for sharing your insights. In fact, the real online requests are measured in terms of request rate, not batch size. Even if the request rate is 1, the batch size can still accumulate to be very large. Currently, many benchmark tests have a problem: they do not realize the impact of the number of requests during benchmark. For example, if the request rate increases from 1 to 2 to 4 and up to 16, it's unreasonable to always use 1000 num prompts because aside from server warmup and cooldown taking up some time, if the number is too small, during the entire benchmark period, the proportion of time that truly puts the server at capacity is very small. This cannot reflect the server's true level. On statistical line charts, it would appear as though different frameworks' performances are converging when that's actually not the case. If the request rate exceeds server's capacity, the request will be queue up. Then larger number of prompts cause higher E2E latency. The actual online traffic, especially the online services of large internet companies such as search recommendation advertising, although there are peaks and troughs, still has a very high overall volume. If you have more in-depth thoughts, feel free to communicate with me, thanks!

mpjlu · 2024-07-31T15:00:21Z

[Batch Size I means the real input batch of vllm.model_execution], we find many cases, the huge performance difference is caused by some arguments not setting properly. for example, llama 1 7b and llama3 7b, the network is the same, if use the same script to test performance, the performance will be different.

merrymercy · 2024-07-31T17:24:24Z

@WangErXiao You can try to add --enable-torch-compile for smaller models. It will greatly accelerate small batch sizes.
@mpjlu We checked the KV cache pool size. In most cases, SGLang and vLLM uses a very similar KV cache pool size with diff < 5%. vLLM sometimes even allocates a larger KV cache pool.

mpjlu · 2024-08-01T03:37:45Z

Hi @merrymercy @zhyncs , I have a question about this picture, my count show A100 can not support 7B model with total 2k length (1k input + 1k output) at 16 qps.
My method is:
16 * 7B * 2 * 2k= 448T > 312T (A100 fp16)。
Is this count is wrong？ or this online test is not totally 2k length.

zhyncs · 2024-08-01T06:03:11Z

Hi @mpjlu You might not have carefully understood the true meaning of input length and output length here. This uses a random dataset with a random ratio of 0, meaning that the input length is [1, 1024], and the output length is [1, 1024], and both are discrete uniform.

mpjlu · 2024-08-01T06:48:57Z

Hi @zhyncs , thanks for your kindly reply.
First, the length is random is ok, so the count will be:
16 * 7B * 2 * 2k / 2= 224T < 312T (A100 fp16)。 it is ok to support 16 qps.
If the framework can support 16 qps, no mater how long a request is waiting, it is the same for 1 second process 16 request.

zhyncs · 2024-08-01T07:01:25Z

Hi @mpjlu Thank you for your detailed and patient explanation, your understanding is correct. The computational load of forward can be roughly estimated as 2ND calculation. I apologize for my previous aggressive reply, thank you!

mpjlu · 2024-08-08T16:05:18Z

I have done some profile about the performance difference between sglang and vllm. When the request rate is small (e.g. the running request is less than 50), the performance gap is small between sglang and vllm (less than 10%). when the request rate is large or offline with many requests send to the server at the same time. The performance gap is large.

The key reason is max_num_seqs arg of vllm, which controls the BS of model forward, the default value is 256, but for sglang, this value is setting by profiling, and can be very large. So for vllm, if you use default args to test the performance, the largest BS of model forward is 256, but sglang can be very larger.

The other reason of vLLM performance is not good when BS is large is the time between two step is long. This includes sampling time, prepare input time, process_model_output time, and scheduler time. When BS is 200, all this together can be 20ms.

In my test, the scheduler/torch compile difference is not the key reason of performance gap.

robertgshaw2-neuralmagic · 2024-08-08T16:20:42Z

I have done some profile about the performance difference between sglang and vllm. When the request rate is small (e.g. the running request is less than 50), the performance gap is small between sglang and vllm (less than 10%). when the request rate is large or offline with many requests send to the server at the same time. The performance gap is large.

The key reason is max_num_seqs arg of vllm, which controls the BS of model forward, the default value is 256, but for sglang, this value is setting by profiling, and can be very large. So for vllm, if you use default args to test the performance, the largest BS of model forward is 256, but sglang can be very larger.

The other reason of vLLM performance is not good when BS is large is the time between two step is long. This includes sampling time, prepare input time, process_model_output time, and scheduler time. When BS is 200, all this together can be 20ms.

In my test, the scheduler/torch compile difference is not the key reason of performance gap.

Thanks for the details!

We agree the time is too long between steps and are actively working on each of these.

But you bring up a good point with our defaults. Its a good idea to do a sweep over these default parameters in addition.

Thanks!

alexm-neuralmagic · 2024-08-08T16:26:06Z

@mpjlu this is a good insight about max_num_seqs default being 256. Any reason it is set to this specific value?

mpjlu · 2024-08-09T02:23:51Z

@alexm-neuralmagic I think 256 was a very large value 6 months ago, because most models at that time don't use GQA, the kv cache is large and cannot run very large bs.
Now it is possible to run large bs with GQA models like llama 3, but in most of real production cases，you will not run so large bs. Because in production a request have to be done in 3 or 10 seconds.

zhyncs · 2024-08-09T02:51:59Z

@mpjlu

if you use default args to test the performance, the largest BS of model forward is 256

If manually setting a larger max_num_seqs has a significant impact on performance, and if there is an improvement, could you share your data?

This includes sampling time, prepare input time, process_model_output time, and scheduler time.

In our blog, we mentioned that vLLM suffers from high CPU scheduling overhead. This does not refer to pure scheduling like FCFS, but rather the scheduling between CPU and GPU workloads.

but in most of real production cases，you will not run so large bs. Because in production a request have to be done in 3 or 10 seconds.

This is inaccurate. Real-world production involves more than online scenarios. My experience in search and recommendation, possibly the highest-traffic area in the internet industry, showed that we used both offline and online. Offline, our focus was solely on throughput without concern for latency. In online scenario, enabling streaming allows us to significantly increase batch size as long as TTFT and ITL are within acceptable limits.

WoosukKwon · 2024-08-09T03:24:11Z

@mpjlu Thanks for the good insight!

mpjlu · 2024-08-09T03:32:19Z

Now I don't have a 80G memory GPU to test a larger max_num_seqs. I just find in 46G memory GPU, the llama 3 8b model (in 1024- out 1024) BS can be 220, in a 80G memory GPU, this bs can be about 700. so vllm default bs 256 should impact performance.
@zhyncs

zhyncs · 2024-08-09T03:43:05Z

Now I don't have a 80G memory GPU to test a larger max_num_seqs. I just find in 46G memory GPU, the llama 3 8b model (in 1024- out 1024) BS can be 220, in a 80G memory GPU, this bs can be about 700. so vllm default bs 256 should impact performance. @zhyncs

So these are your assumptions rather than actual test results? If you have concrete data to verify this, feel free to share it.

mpjlu · 2024-08-09T04:03:20Z

Yes, this is my assumption。 But I have some other test: https://doc.weixin.qq.com/doc/w3_AH8AWgYyACkl1A4yKimS6SluHVrZ8
they can partly support my view.
This is not a hard test, I think you can do this test in 10 minutes. If my assumption is wrong, just show me the result. @zhyncs

zhyncs · 2024-08-09T05:38:55Z

Yes, this is my assumption。 But I have some other test: https://doc.weixin.qq.com/doc/w3_AH8AWgYyACkl1A4yKimS6SluHVrZ8 they can partly support my view. This is not a hard test, I think you can do this test in 10 minutes. If my assumption is wrong, just show me the result. @zhyncs

@mpjlu Raising max-num-reqs to 512 appears ineffective based on the benchmark result. The result is close to that measured with the default 256. You can check this yourself.

mpjlu · 2024-08-09T06:42:35Z

@zhyncs , thanks very much. I will find the reason when I have a 80G GPU.

Ying1123 · 2024-08-09T07:12:56Z

@zhyncs , thanks very much. I will find the reason when I have a 80G GPU.

Hi @mpjlu, we do noticed that default value of 256, and we tried increase it but was not successfully get stably better number. To give you one data point, in one of our offline benchmark of 4000 requests with input_len and output_len in [0, 1024], batch size 256 delivers 1816.09 output tok/s, and batch size 512 delievers 1687.79 output tok/s.
Generally from our observation, 256 seems not a random number, but a tested number, at least for v0.5.2. I believe vLLM will improve a lot though. It is a great project with a great community with momentum, while at SGLang team we also have tried our best to make fair comparison with respect.

alexm-neuralmagic · 2024-08-09T13:58:59Z

vLLM also has --gpu-memory-utilization set to 0.9 by default. on an 80G GPU, this means that 8GB are potentially not used. I have seen that increasing --gpu-memory-utilization to 0.98 is possible and does provide better results. In general, the 0.9 default is a safe current choice because some memory (like cuda graphs) is not profiled accurately, but this may change with more precise bootstrap profiling

alexm-neuralmagic · 2024-08-09T14:01:51Z

@Ying1123 increasing default max batch size in vLLM increases the probability of request preemption (which has high overhead currently). Did you saw a log of preempted requests when you increased from 256 to 512? (vLLM usually warns the user when this happens)

mpjlu · 2024-08-09T15:30:28Z

it is ok on a100 increase bs from 256 to 512 does not increase tpt. only when memory bound increase bs can increase tpt。for a100, bs 256 is compute bound, for h100, bs 256 is still memory bound, so on h100 should increase the default value.

github-actions · 2024-11-08T01:59:01Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

simon-mo added misc RFC and removed misc labels Jul 25, 2024

simon-mo pinned this issue Jul 25, 2024

felixzhu555 mentioned this issue Aug 4, 2024

[Performance]: under performing in comparision of sglang #7108

Closed

rkooo567 mentioned this issue Aug 6, 2024

[Core] Optimize SPMD architecture with delta + serialization optimization #7109

Merged

mgoin mentioned this issue Aug 9, 2024

[Feature]: Small Model Large Latency Compared to SGLang and TensorRT-LLM #7339

Closed

simon-mo mentioned this issue Sep 4, 2024

Release v0.6.0 #8144

Closed

1 task

zhuohan123 unpinned this issue Sep 5, 2024

simon-mo mentioned this issue Sep 24, 2024

[Roadmap] vLLM Roadmap Q3 2024 #5805

Closed

46 tasks

github-actions bot added the stale label Nov 8, 2024

hmellor added keep-open and removed stale labels Nov 20, 2024

[RFC]: Performance Roadmap #6801

[RFC]: Performance Roadmap #6801

Comments

simon-mo commented Jul 25, 2024 • edited Loading

Anything you want to discuss about vllm.

richardliaw commented Jul 25, 2024

zhisbug commented Jul 26, 2024

w013nad commented Jul 26, 2024

simon-mo commented Jul 26, 2024

youkaichao commented Jul 26, 2024

zhyncs commented Jul 26, 2024

Ying1123 commented Jul 26, 2024 • edited Loading

w013nad commented Jul 26, 2024

Ying1123 commented Jul 26, 2024 • edited Loading

youkaichao commented Jul 26, 2024

SolitaryThinker commented Jul 26, 2024

WangErXiao commented Jul 31, 2024

zhyncs commented Jul 31, 2024

mpjlu commented Jul 31, 2024 • edited Loading

zhyncs commented Jul 31, 2024

mpjlu commented Jul 31, 2024 • edited Loading

merrymercy commented Jul 31, 2024 • edited Loading

mpjlu commented Aug 1, 2024 • edited Loading

zhyncs commented Aug 1, 2024 • edited Loading

mpjlu commented Aug 1, 2024

zhyncs commented Aug 1, 2024

mpjlu commented Aug 8, 2024

robertgshaw2-neuralmagic commented Aug 8, 2024

alexm-neuralmagic commented Aug 8, 2024

mpjlu commented Aug 9, 2024 • edited Loading

zhyncs commented Aug 9, 2024

WoosukKwon commented Aug 9, 2024

mpjlu commented Aug 9, 2024 • edited Loading

zhyncs commented Aug 9, 2024

mpjlu commented Aug 9, 2024

zhyncs commented Aug 9, 2024

mpjlu commented Aug 9, 2024

Ying1123 commented Aug 9, 2024

alexm-neuralmagic commented Aug 9, 2024

alexm-neuralmagic commented Aug 9, 2024

mpjlu commented Aug 9, 2024

github-actions bot commented Nov 8, 2024

simon-mo commented Jul 25, 2024 •

edited

Loading

Ying1123 commented Jul 26, 2024 •

edited

Loading

Ying1123 commented Jul 26, 2024 •

edited

Loading

mpjlu commented Jul 31, 2024 •

edited

Loading

mpjlu commented Jul 31, 2024 •

edited

Loading

merrymercy commented Jul 31, 2024 •

edited

Loading

mpjlu commented Aug 1, 2024 •

edited

Loading

zhyncs commented Aug 1, 2024 •

edited

Loading

mpjlu commented Aug 9, 2024 •

edited

Loading

mpjlu commented Aug 9, 2024 •

edited

Loading