Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anyone tests lora's throughput #3316

Open
white-wolf-tech opened this issue Mar 11, 2024 · 3 comments
Open

anyone tests lora's throughput #3316

white-wolf-tech opened this issue Mar 11, 2024 · 3 comments

Comments

@white-wolf-tech
Copy link

I installed vllm from the lastest code.

found it supports Qwen2 series model.

I test Qwen1.8B with 16 concurrency. got the following result:

I merge the lora weight to Qwen1.8B. latency(ms):
min: 222, average: 400, max:418

without merging lora weight to Qwen1.8B, using lora dynamic calling through query.
min: 307, average: 780, max: 874

vllm lora way is much more slower than merging version? Is this okay?

@Nipi64310
Copy link

It's normal for the speed to increase after merging because there is no need to perform matrix multiplication between the input and lora weights in each lora layer.

@changun
Copy link

changun commented Mar 18, 2024

Thanks @Nipi64310 for explanation. We are seeing similar slow-down as well. It is a trade-off between convivence and speed now.

@FerranAgulloLopez
Copy link

When merging there is no overhead in memory neither in computation. However, if willing to serve multiple subtasks, you will need to replicate the full model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@changun @white-wolf-tech @Nipi64310 @FerranAgulloLopez and others