Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Performance Roadmap #6801

Open
3 of 5 tasks
simon-mo opened this issue Jul 25, 2024 · 37 comments
Open
3 of 5 tasks

[RFC]: Performance Roadmap #6801

simon-mo opened this issue Jul 25, 2024 · 37 comments

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented Jul 25, 2024

Anything you want to discuss about vllm.

This is a meta RFC tracking some of the performance enhancement works we are prioritizing.

@simon-mo simon-mo added misc RFC and removed misc labels Jul 25, 2024
@richardliaw
Copy link
Collaborator

Let's pin this issue!

@simon-mo simon-mo pinned this issue Jul 25, 2024
@zhisbug
Copy link
Collaborator

zhisbug commented Jul 26, 2024

@SolitaryThinker please pin the multi-step inference issue

@w013nad
Copy link

w013nad commented Jul 26, 2024

If we're talking performance, can someone explain how sglang is 50-100% faster than vLLM while using vLLM as a backend? Their code base looks like a pure Python vLLM wrapper. How are they that much faster?

@simon-mo
Copy link
Collaborator Author

To be fully transparent, we are still figuring it out. @KuntaiDu is reproducing benchmark here: #6794 (comment)

@youkaichao
Copy link
Member

@w013nad there are some niche details, for example, HF duplicates KV heads from 8 to 16, but haven't update the config. We are waiting for HF to update the config and the weight from 16 to 8.

However, sglang directly patches it:

https://github.com/sgl-project/sglang/blob/8628ab9c8bdf9b01c4671e3c6caabf49afd73395/python/sglang/srt/managers/controller/model_runner.py#L126

which means they use less memory for kv cache, and will have better throughput.

@zhyncs
Copy link
Contributor

zhyncs commented Jul 26, 2024

If we're talking performance, can someone explain how sglang is 50-100% faster than vLLM while using vLLM as a backend? Their code base looks like a pure Python vLLM wrapper. How are they that much faster?

Hi @w013nad I am a member of the SGLang Team. We are delighted that you have taken an interest in our project, SGLang https://github.com/sgl-project/sglang. We recently published a blog post introducing the performance advantages of v0.2 https://lmsys.org/blog/2024-07-25-sglang-llama3/. The performance data from the blog shows that on the Llama 3 model, from 8B to 70B to 405B, and from BF16 to FP8, on both A100 and H100, SGLang is far ahead in terms of performance due to its extensive engineering optimizations such as efficient scheduling, CUDA Graphs, Torch Compile etc. In fact, currently SGLang does not use vLLM as a backend, everything from Attention to scheduling and various optimizations are implemented by SGLang itself. At present we only utilize some linear classes from vLLM. Hope this answers your question. We greatly appreciate and thank vLLM for the excellent work, looking forward to continuing outstanding work on LLM inference acceleration, cheers.

@Ying1123
Copy link

Ying1123 commented Jul 26, 2024

@w013nad It is incorrect to say that "SGLang is a pure Python vLLM wrapper." Otherwise, it cannot explain the speedup you mentioned.

  1. SGLang has its own efficient memory management and batch scheduler. Essentially, it only imports some fused kernels and layers from vLLM. SGLang uses vLLM as a kernel library, similar to how it uses FlashInfer. All other upper-level code in vLLM is irrelevant to SGLang.
  2. What @youkaichao mentioned is an optimization we implemented for 405B model. However, it is not used in any of the benchmark results we have released. We always make fair comparisons. In our blog post, we use dummy weights with 8 heads for all frameworks to ensure fairness.

SGLang has a clean and well-designed architecture. It is also a fully open community with many experienced inference optimization developers. I encourage everyone to give it a try and run the benchmark. Both SGLang and vLLM are great projects and they can learn from each other.

@w013nad
Copy link

w013nad commented Jul 26, 2024

If we're talking performance, can someone explain how sglang is 50-100% faster than vLLM while using vLLM as a backend? Their code base looks like a pure Python vLLM wrapper. How are they that much faster?

Hi @w013nad I am a member of the SGLang Team. We are delighted that you have taken an interest in our project, SGLang https://github.com/sgl-project/sglang. We recently published a blog post introducing the performance advantages of v0.2 https://lmsys.org/blog/2024-07-25-sglang-llama3/. The performance data from the blog shows that on the Llama 3 model, from 8B to 70B to 405B, and from BF16 to FP8, on both A100 and H100, SGLang is far ahead in terms of performance due to its extensive engineering optimizations such as efficient scheduling, CUDA Graphs, Torch Compile etc. In fact, currently SGLang does not use vLLM as a backend, everything from Attention to scheduling and various optimizations are implemented by SGLang itself. At present we only utilize some linear classes from vLLM. Hope this answers your question. We greatly appreciate and thank vLLM for the excellent work, looking forward to continuing outstanding work on LLM inference acceleration, cheers.

Admittedly I didn't look too much through your code base. I just saw the github code overview that showed 99% Python and that you were importing a lot of vLLM components in your model executor. I was very impressed by your project though. I did a few of my own tests on my A100s and found that your codebase was much faster across the board

https://www.reddit.com/r/LocalLLaMA/comments/1ebztye/the_milestone_release_of_sglang_runtime_v02/leynwa2/

@Ying1123
Copy link

Ying1123 commented Jul 26, 2024

@w013nad Great to hear that you got good numbers from SGLang.

  1. For llama-8b model, you can add --enable-torch-compile, which will give you an even better speedup.
  2. speculative decoding is on our roadmap
  3. tokenization endpoint should be easy to add. Your contribution is also welcome!
  4. outlines support is there and also much faster than vllm. This is an example on how you can use it with OpenAI-compatible API.

As you said, SGLang is 99% Python. It reflects our philosophy of modular and minimal design.

@youkaichao
Copy link
Member

What @youkaichao mentioned is an optimization we implemented for 405B model. However, it is not used in any of the benchmark results we have released. We always make fair comparisons. In our blog post, we use dummy weights with 8 heads for all frameworks to ensure fairness.

Thanks for the clarification! Really hope HF can fix the kv heads issue so that the community can benefit from it without using hacks.

@SolitaryThinker
Copy link
Contributor

@simon-mo please update with rfc for multistep #6854

@WangErXiao
Copy link

If we're talking performance, can someone explain how sglang is 50-100% faster than vLLM while using vLLM as a backend? Their code base looks like a pure Python vLLM wrapper. How are they that much faster?

Hi @w013nad I am a member of the SGLang Team. We are delighted that you have taken an interest in our project, SGLang https://github.com/sgl-project/sglang. We recently published a blog post introducing the performance advantages of v0.2 https://lmsys.org/blog/2024-07-25-sglang-llama3/. The performance data from the blog shows that on the Llama 3 model, from 8B to 70B to 405B, and from BF16 to FP8, on both A100 and H100, SGLang is far ahead in terms of performance due to its extensive engineering optimizations such as efficient scheduling, CUDA Graphs, Torch Compile etc. In fact, currently SGLang does not use vLLM as a backend, everything from Attention to scheduling and various optimizations are implemented by SGLang itself. At present we only utilize some linear classes from vLLM. Hope this answers your question. We greatly appreciate and thank vLLM for the excellent work, looking forward to continuing outstanding work on LLM inference acceleration, cheers.

I tested sglang and vllm using the llama3.1-8B model on a single L40s GPU. I found that at low batch sizes, the speeds of sglang and vllm are similar. However, at higher batch sizes, sglang performs better than vllm.

@zhyncs
Copy link
Contributor

zhyncs commented Jul 31, 2024

Hi @WangErXiao Thanks for your interest. For enterprise-level users, increasing the batch size is very necessary and reasonable, as they have a huge amount of traffic. SGLang has performed very well in terms of scalability, supporting batch sizes of up to thousands.

@mpjlu
Copy link

mpjlu commented Jul 31, 2024

When BS is small, vllm/sglang/trt-llm performance is similar. The difference only when BS is large.
When batch is large, vllm performance is low。 The reason maybe vLLM use max_num_batched_tokens to profile the memory. When max_num_batched_tokens is large, the profiled memory is huge, so the number of blocks can be used for kv cache is small.
Just use a small max_num_batched_tokens for the benchmark, vllm performance will be good for large batch size.

@zhyncs
Copy link
Contributor

zhyncs commented Jul 31, 2024

Hi @mpjlu Thanks for sharing your insights. In fact, the real online requests are measured in terms of request rate, not batch size. Even if the request rate is 1, the batch size can still accumulate to be very large. Currently, many benchmark tests have a problem: they do not realize the impact of the number of requests during benchmark. For example, if the request rate increases from 1 to 2 to 4 and up to 16, it's unreasonable to always use 1000 num prompts because aside from server warmup and cooldown taking up some time, if the number is too small, during the entire benchmark period, the proportion of time that truly puts the server at capacity is very small. This cannot reflect the server's true level. On statistical line charts, it would appear as though different frameworks' performances are converging when that's actually not the case. If the request rate exceeds server's capacity, the request will be queue up. Then larger number of prompts cause higher E2E latency. The actual online traffic, especially the online services of large internet companies such as search recommendation advertising, although there are peaks and troughs, still has a very high overall volume. If you have more in-depth thoughts, feel free to communicate with me, thanks!

@mpjlu
Copy link

mpjlu commented Jul 31, 2024

[Batch Size I means the real input batch of vllm.model_execution], we find many cases, the huge performance difference is caused by some arguments not setting properly. for example, llama 1 7b and llama3 7b, the network is the same, if use the same script to test performance, the performance will be different.

@merrymercy
Copy link
Contributor

merrymercy commented Jul 31, 2024

@WangErXiao You can try to add --enable-torch-compile for smaller models. It will greatly accelerate small batch sizes.
@mpjlu We checked the KV cache pool size. In most cases, SGLang and vLLM uses a very similar KV cache pool size with diff < 5%. vLLM sometimes even allocates a larger KV cache pool.

@mpjlu
Copy link

mpjlu commented Aug 1, 2024

8b_latency
Hi @merrymercy @zhyncs , I have a question about this picture, my count show A100 can not support 7B model with total 2k length (1k input + 1k output) at 16 qps.
My method is:
16 * 7B * 2 * 2k= 448T > 312T (A100 fp16)。
Is this count is wrong? or this online test is not totally 2k length.

@zhyncs
Copy link
Contributor

zhyncs commented Aug 1, 2024

Hi @mpjlu You might not have carefully understood the true meaning of input length and output length here. This uses a random dataset with a random ratio of 0, meaning that the input length is [1, 1024], and the output length is [1, 1024], and both are discrete uniform.

@mpjlu
Copy link

mpjlu commented Aug 1, 2024

Hi @zhyncs , thanks for your kindly reply.
First, the length is random is ok, so the count will be:
16 * 7B * 2 * 2k / 2= 224T < 312T (A100 fp16)。 it is ok to support 16 qps.
If the framework can support 16 qps, no mater how long a request is waiting, it is the same for 1 second process 16 request.

@zhyncs
Copy link
Contributor

zhyncs commented Aug 1, 2024

Hi @mpjlu Thank you for your detailed and patient explanation, your understanding is correct. The computational load of forward can be roughly estimated as 2ND calculation. I apologize for my previous aggressive reply, thank you!

@mpjlu
Copy link

mpjlu commented Aug 8, 2024

I have done some profile about the performance difference between sglang and vllm. When the request rate is small (e.g. the running request is less than 50), the performance gap is small between sglang and vllm (less than 10%). when the request rate is large or offline with many requests send to the server at the same time. The performance gap is large.

The key reason is max_num_seqs arg of vllm, which controls the BS of model forward, the default value is 256, but for sglang, this value is setting by profiling, and can be very large. So for vllm, if you use default args to test the performance, the largest BS of model forward is 256, but sglang can be very larger.

The other reason of vLLM performance is not good when BS is large is the time between two step is long. This includes sampling time, prepare input time, process_model_output time, and scheduler time. When BS is 200, all this together can be 20ms.

In my test, the scheduler/torch compile difference is not the key reason of performance gap.

@robertgshaw2-neuralmagic
Copy link
Collaborator

I have done some profile about the performance difference between sglang and vllm. When the request rate is small (e.g. the running request is less than 50), the performance gap is small between sglang and vllm (less than 10%). when the request rate is large or offline with many requests send to the server at the same time. The performance gap is large.

The key reason is max_num_seqs arg of vllm, which controls the BS of model forward, the default value is 256, but for sglang, this value is setting by profiling, and can be very large. So for vllm, if you use default args to test the performance, the largest BS of model forward is 256, but sglang can be very larger.

The other reason of vLLM performance is not good when BS is large is the time between two step is long. This includes sampling time, prepare input time, process_model_output time, and scheduler time. When BS is 200, all this together can be 20ms.

In my test, the scheduler/torch compile difference is not the key reason of performance gap.

Thanks for the details!

We agree the time is too long between steps and are actively working on each of these.

But you bring up a good point with our defaults. Its a good idea to do a sweep over these default parameters in addition.

Thanks!

@alexm-neuralmagic
Copy link
Collaborator

@mpjlu this is a good insight about max_num_seqs default being 256. Any reason it is set to this specific value?

@mpjlu
Copy link

mpjlu commented Aug 9, 2024

@alexm-neuralmagic I think 256 was a very large value 6 months ago, because most models at that time don't use GQA, the kv cache is large and cannot run very large bs.
Now it is possible to run large bs with GQA models like llama 3, but in most of real production cases,you will not run so large bs. Because in production a request have to be done in 3 or 10 seconds.

@zhyncs
Copy link
Contributor

zhyncs commented Aug 9, 2024

@mpjlu

if you use default args to test the performance, the largest BS of model forward is 256

If manually setting a larger max_num_seqs has a significant impact on performance, and if there is an improvement, could you share your data?

This includes sampling time, prepare input time, process_model_output time, and scheduler time.

In our blog, we mentioned that vLLM suffers from high CPU scheduling overhead. This does not refer to pure scheduling like FCFS, but rather the scheduling between CPU and GPU workloads.

but in most of real production cases,you will not run so large bs. Because in production a request have to be done in 3 or 10 seconds.

This is inaccurate. Real-world production involves more than online scenarios. My experience in search and recommendation, possibly the highest-traffic area in the internet industry, showed that we used both offline and online. Offline, our focus was solely on throughput without concern for latency. In online scenario, enabling streaming allows us to significantly increase batch size as long as TTFT and ITL are within acceptable limits.

@WoosukKwon
Copy link
Collaborator

@mpjlu Thanks for the good insight!

@mpjlu
Copy link

mpjlu commented Aug 9, 2024

Now I don't have a 80G memory GPU to test a larger max_num_seqs. I just find in 46G memory GPU, the llama 3 8b model (in 1024- out 1024) BS can be 220, in a 80G memory GPU, this bs can be about 700. so vllm default bs 256 should impact performance.
@zhyncs

@zhyncs
Copy link
Contributor

zhyncs commented Aug 9, 2024

Now I don't have a 80G memory GPU to test a larger max_num_seqs. I just find in 46G memory GPU, the llama 3 8b model (in 1024- out 1024) BS can be 220, in a 80G memory GPU, this bs can be about 700. so vllm default bs 256 should impact performance. @zhyncs

So these are your assumptions rather than actual test results? If you have concrete data to verify this, feel free to share it.

@mpjlu
Copy link

mpjlu commented Aug 9, 2024

Yes, this is my assumption。 But I have some other test: https://doc.weixin.qq.com/doc/w3_AH8AWgYyACkl1A4yKimS6SluHVrZ8
they can partly support my view.
This is not a hard test, I think you can do this test in 10 minutes. If my assumption is wrong, just show me the result. @zhyncs

@zhyncs
Copy link
Contributor

zhyncs commented Aug 9, 2024

Yes, this is my assumption。 But I have some other test: https://doc.weixin.qq.com/doc/w3_AH8AWgYyACkl1A4yKimS6SluHVrZ8 they can partly support my view. This is not a hard test, I think you can do this test in 10 minutes. If my assumption is wrong, just show me the result. @zhyncs

@mpjlu Raising max-num-reqs to 512 appears ineffective based on the benchmark result. The result is close to that measured with the default 256. You can check this yourself.

@mpjlu
Copy link

mpjlu commented Aug 9, 2024

@zhyncs , thanks very much. I will find the reason when I have a 80G GPU.

@Ying1123
Copy link

Ying1123 commented Aug 9, 2024

@zhyncs , thanks very much. I will find the reason when I have a 80G GPU.

Hi @mpjlu, we do noticed that default value of 256, and we tried increase it but was not successfully get stably better number. To give you one data point, in one of our offline benchmark of 4000 requests with input_len and output_len in [0, 1024], batch size 256 delivers 1816.09 output tok/s, and batch size 512 delievers 1687.79 output tok/s.
Generally from our observation, 256 seems not a random number, but a tested number, at least for v0.5.2. I believe vLLM will improve a lot though. It is a great project with a great community with momentum, while at SGLang team we also have tried our best to make fair comparison with respect.

@alexm-neuralmagic
Copy link
Collaborator

vLLM also has --gpu-memory-utilization set to 0.9 by default. on an 80G GPU, this means that 8GB are potentially not used. I have seen that increasing --gpu-memory-utilization to 0.98 is possible and does provide better results. In general, the 0.9 default is a safe current choice because some memory (like cuda graphs) is not profiled accurately, but this may change with more precise bootstrap profiling

@alexm-neuralmagic
Copy link
Collaborator

@Ying1123 increasing default max batch size in vLLM increases the probability of request preemption (which has high overhead currently). Did you saw a log of preempted requests when you increased from 256 to 512? (vLLM usually warns the user when this happens)

@mpjlu
Copy link

mpjlu commented Aug 9, 2024

it is ok on a100 increase bs from 256 to 512 does not increase tpt. only when memory bound increase bs can increase tpt。for a100, bs 256 is compute bound, for h100, bs 256 is still memory bound, so on h100 should increase the default value.

@simon-mo simon-mo mentioned this issue Sep 4, 2024
1 task
@zhuohan123 zhuohan123 unpinned this issue Sep 5, 2024
Copy link

github-actions bot commented Nov 8, 2024

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Nov 8, 2024
@hmellor hmellor added keep-open and removed stale labels Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests