Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lora speed #2559

Open
qingzhong1 opened this issue Dec 23, 2024 · 1 comment
Open

lora speed #2559

qingzhong1 opened this issue Dec 23, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@qingzhong1
Copy link

qingzhong1 commented Dec 23, 2024

I measured the speed of starting multiple loras using sglang and vllm. Why is vllm faster than sglang? What acceleration method is sglang? I haven’t enabled it yet?
Graphics card 4090
sglang sever:
python -m sglang.launch_server --model-path /mnt/models/source/model/qwen2_5-7b-instruct/Qwen2___5-7B-Instruct
--host 0.0.0.0
--port 8000
--tp-size 1
--mem-fraction-static 0.9
--served-model-name "Qwen2.5-7B-Instruct"
--chunked-prefill-size 4096
--disable-cuda-graph
--disable-radix-cache
--show-time-cost
--enable-torch-compile
--schedule-conservativeness 0.03
--schedule-policy fcfs
--lora-paths lora0=“” lora_batch=""
--max-loras-per-batch 32
--dtype bfloat16

vllm sever
python -m vllm.entrypoints.openai.api_server --model /mnt/models/source/model/qwen2_5-7b-instruct/Qwen2___5-7B-Instruct
--port 8899
--served-model-name Qwen2.5-7B-Instruct
--enable-lora
--lora-moduleslora0=“” lora_batch=""
--gpu_memory_utilization 0.90
--enable-prefix-caching
--max-num-seqs 128

sglang post
url = "http://localhost:8000"
json_data = {
"text": problems_token_completions,
"sampling_params": {"max_new_tokens": 10,"temperature": 0, "top_p": 1,"top_k":1},
"lora_path": ["lora0","lora_batch"]*32,}

import time
time_start=time.time()
response = requests.post(
url + "/generate",
json=json_data,
)
time_end=time.time()
print(time_end-time_start)

vllm post

import time
url = "http://localhost:8899"
json_data={"model": "reranker_classify_catalog_rough_model", "messages": [{"role":"user","content":problem[10]}],"max_tokens": 100,"temperature": 0, "top_p": 1}
time_start=time.time()
response = requests.post(
url + "/v1/chat/completions",
json=json_data,
)
time_end=time.time()
print(time_end-time_start)
print(response.json())

sglang speed
gen throughput (token/s): 33.28
vllm speed
Avg generation throughput: 55.9 tokens/s

@zhaochenyang20
Copy link
Collaborator

@Ying1123

@zhaochenyang20 zhaochenyang20 self-assigned this Dec 23, 2024
@zhaochenyang20 zhaochenyang20 added the enhancement New feature or request label Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants