-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightLLM benchmark #670
Comments
As mentioned in the blog, one point particularly caught my attention, and I feel that this feature may be compatible with the current design of vLLM.
English version
|
After briefly reviewing LightLLM, I found that its kernel is written in OpenAI Triton rather than CUDA. This makes it easier for those who want to participate in optimization in the future to get started, and it also has good performance. For multi GPUs inference, it uses RPyC instead of Ray. Both of vLLM and LightLLM use Tensor Parallelism over multiple GPUs for faster inference. Currently, it's hard to say whether the technology choice can bring such a big performance difference. I think the performance improvement comes more from TokenAttention and Efficient Router, as well as the a-synchronization of tokenize and de tokenize. I've counted the lines of code in LightLLM, and only look at Llama2, there are only over 2000 lines overall when removing Bloom and LLaMA, which is quite unbelievable. |
Hi! Thanks for bringing this up. We are excited to see new open-source efforts based on vLLM projects. I'm reproducing the results from LightLLM right now, but from the first glance:
However, these are all my initial guesses. We will perform a more thorough benchmark and update the results here. |
I tried to rerun the LLaMA 7B benchmark on an 80GB A100 on GCP. With the latest vLLM main branch:
When commenting out all tokenization and using batched argmax sampling (branch):
The improvement in throughput (1.76x) is very close to the improvement reported in LightLLM (1.92x). Will update the results after reproducing LightLLM's results. |
Hi @zhuohan123 Thanks for your detailed reply. As mentioned above, the advantage of 7b throughout is not obvious. Perhaps, in addition to reproducing the 7b results, we may also investigate the 65b results, where the throughput difference becomes even larger when the model size increases. Thanks. |
I found that a lot of coroutines are used in lightllm router, maybe these also bring some throughput improvements? |
@zhuohan123 can you update the link to the branch where only batched argmax sampling is used , since its not working anymore, thanks in advance. |
This optimization has been integrated to the latest main branch of vLLM. Please try it out! |
Hi vLLM genius @zhuohan123 @WoosukKwon
I find a new project https://github.com/ModelTC/lightllm
After reading their blog, the performance advantage on the 7b model is not very obvious, but the gap is larger on the 65b. We will also do some verification and comparison later. The reason for bringing up this issue is to hope that we may see what the LightLLM does well, so that we can refer to and port similar optimizations to vLLM. Cheers.
The text was updated successfully, but these errors were encountered: