lora speed #2559

qingzhong1 · 2024-12-23T14:28:48Z

I measured the speed of starting multiple loras using sglang and vllm. Why is vllm faster than sglang? What acceleration method is sglang? I haven’t enabled it yet?
Graphics card 4090
sglang sever：
python -m sglang.launch_server --model-path /mnt/models/source/model/qwen2_5-7b-instruct/Qwen2___5-7B-Instruct
--host 0.0.0.0
--port 8000
--tp-size 1
--mem-fraction-static 0.9
--served-model-name "Qwen2.5-7B-Instruct"
--chunked-prefill-size 4096
--disable-cuda-graph
--disable-radix-cache
--show-time-cost
--enable-torch-compile
--schedule-conservativeness 0.03
--schedule-policy fcfs
--lora-paths lora0=“” lora_batch=""
--max-loras-per-batch 32
--dtype bfloat16

vllm sever
python -m vllm.entrypoints.openai.api_server --model /mnt/models/source/model/qwen2_5-7b-instruct/Qwen2___5-7B-Instruct
--port 8899
--served-model-name Qwen2.5-7B-Instruct
--enable-lora
--lora-moduleslora0=“” lora_batch=""
--gpu_memory_utilization 0.90
--enable-prefix-caching
--max-num-seqs 128

sglang post
url = "http://localhost:8000"
json_data = {
"text": problems_token_completions,
"sampling_params": {"max_new_tokens": 10,"temperature": 0, "top_p": 1,"top_k":1},
"lora_path": ["lora0","lora_batch"]*32,}

import time
time_start=time.time()
response = requests.post(
url + "/generate",
json=json_data,
)
time_end=time.time()
print(time_end-time_start)

vllm post

import time
url = "http://localhost:8899"
json_data={"model": "reranker_classify_catalog_rough_model", "messages": [{"role":"user","content":problem[10]}],"max_tokens": 100,"temperature": 0, "top_p": 1}
time_start=time.time()
response = requests.post(
url + "/v1/chat/completions",
json=json_data,
)
time_end=time.time()
print(time_end-time_start)
print(response.json())

sglang speed
gen throughput (token/s): 33.28
vllm speed
Avg generation throughput: 55.9 tokens/s

zhaochenyang20 · 2024-12-23T19:46:13Z

@Ying1123

zhaochenyang20 self-assigned this Dec 23, 2024

zhaochenyang20 added the enhancement New feature or request label Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lora speed #2559

lora speed #2559

qingzhong1 commented Dec 23, 2024 •

edited

Loading

zhaochenyang20 commented Dec 23, 2024

lora speed #2559

lora speed #2559

Comments

qingzhong1 commented Dec 23, 2024 • edited Loading

zhaochenyang20 commented Dec 23, 2024

qingzhong1 commented Dec 23, 2024 •

edited

Loading