anyone tests lora's throughput #3316

white-wolf-tech · 2024-03-11T11:10:50Z

I installed vllm from the lastest code.

found it supports Qwen2 series model.

I test Qwen1.8B with 16 concurrency. got the following result:

I merge the lora weight to Qwen1.8B. latency(ms)：
min: 222, average: 400, max:418

without merging lora weight to Qwen1.8B, using lora dynamic calling through query.
min: 307, average: 780, max: 874

vllm lora way is much more slower than merging version? Is this okay?

Nipi64310 · 2024-03-13T07:25:51Z

It's normal for the speed to increase after merging because there is no need to perform matrix multiplication between the input and lora weights in each lora layer.

changun · 2024-03-18T22:55:05Z

Thanks @Nipi64310 for explanation. We are seeing similar slow-down as well. It is a trade-off between convivence and speed now.

FerranAgulloLopez · 2024-04-05T22:20:11Z

When merging there is no overhead in memory neither in computation. However, if willing to serve multiple subtasks, you will need to replicate the full model.

white-wolf-tech mentioned this issue Mar 13, 2024

Add multi-LoRA support #1804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

anyone tests lora's throughput #3316

anyone tests lora's throughput #3316

white-wolf-tech commented Mar 11, 2024

Nipi64310 commented Mar 13, 2024

changun commented Mar 18, 2024

FerranAgulloLopez commented Apr 5, 2024

anyone tests lora's throughput #3316

anyone tests lora's throughput #3316

Comments

white-wolf-tech commented Mar 11, 2024

Nipi64310 commented Mar 13, 2024

changun commented Mar 18, 2024

FerranAgulloLopez commented Apr 5, 2024