We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I measured the speed of starting multiple loras using sglang and vllm. Why is vllm faster than sglang? What acceleration method is sglang? I haven’t enabled it yet? Graphics card 4090 sglang sever: python -m sglang.launch_server --model-path /mnt/models/source/model/qwen2_5-7b-instruct/Qwen2___5-7B-Instruct --host 0.0.0.0 --port 8000 --tp-size 1 --mem-fraction-static 0.9 --served-model-name "Qwen2.5-7B-Instruct" --chunked-prefill-size 4096 --disable-cuda-graph --disable-radix-cache --show-time-cost --enable-torch-compile --schedule-conservativeness 0.03 --schedule-policy fcfs --lora-paths lora0=“” lora_batch="" --max-loras-per-batch 32 --dtype bfloat16
vllm sever python -m vllm.entrypoints.openai.api_server --model /mnt/models/source/model/qwen2_5-7b-instruct/Qwen2___5-7B-Instruct --port 8899 --served-model-name Qwen2.5-7B-Instruct --enable-lora --lora-moduleslora0=“” lora_batch="" --gpu_memory_utilization 0.90 --enable-prefix-caching --max-num-seqs 128
sglang post url = "http://localhost:8000" json_data = { "text": problems_token_completions, "sampling_params": {"max_new_tokens": 10,"temperature": 0, "top_p": 1,"top_k":1}, "lora_path": ["lora0","lora_batch"]*32,}
import time time_start=time.time() response = requests.post( url + "/generate", json=json_data, ) time_end=time.time() print(time_end-time_start)
vllm post
import time url = "http://localhost:8899" json_data={"model": "reranker_classify_catalog_rough_model", "messages": [{"role":"user","content":problem[10]}],"max_tokens": 100,"temperature": 0, "top_p": 1} time_start=time.time() response = requests.post( url + "/v1/chat/completions", json=json_data, ) time_end=time.time() print(time_end-time_start) print(response.json())
sglang speed gen throughput (token/s): 33.28 vllm speed Avg generation throughput: 55.9 tokens/s
The text was updated successfully, but these errors were encountered:
@Ying1123
Sorry, something went wrong.
zhaochenyang20
No branches or pull requests
I measured the speed of starting multiple loras using sglang and vllm. Why is vllm faster than sglang? What acceleration method is sglang? I haven’t enabled it yet?
Graphics card 4090
sglang sever:
python -m sglang.launch_server --model-path /mnt/models/source/model/qwen2_5-7b-instruct/Qwen2___5-7B-Instruct
--host 0.0.0.0
--port 8000
--tp-size 1
--mem-fraction-static 0.9
--served-model-name "Qwen2.5-7B-Instruct"
--chunked-prefill-size 4096
--disable-cuda-graph
--disable-radix-cache
--show-time-cost
--enable-torch-compile
--schedule-conservativeness 0.03
--schedule-policy fcfs
--lora-paths lora0=“” lora_batch=""
--max-loras-per-batch 32
--dtype bfloat16
vllm sever
python -m vllm.entrypoints.openai.api_server --model /mnt/models/source/model/qwen2_5-7b-instruct/Qwen2___5-7B-Instruct
--port 8899
--served-model-name Qwen2.5-7B-Instruct
--enable-lora
--lora-moduleslora0=“” lora_batch=""
--gpu_memory_utilization 0.90
--enable-prefix-caching
--max-num-seqs 128
sglang post
url = "http://localhost:8000"
json_data = {
"text": problems_token_completions,
"sampling_params": {"max_new_tokens": 10,"temperature": 0, "top_p": 1,"top_k":1},
"lora_path": ["lora0","lora_batch"]*32,}
import time
time_start=time.time()
response = requests.post(
url + "/generate",
json=json_data,
)
time_end=time.time()
print(time_end-time_start)
vllm post
import time
url = "http://localhost:8899"
json_data={"model": "reranker_classify_catalog_rough_model", "messages": [{"role":"user","content":problem[10]}],"max_tokens": 100,"temperature": 0, "top_p": 1}
time_start=time.time()
response = requests.post(
url + "/v1/chat/completions",
json=json_data,
)
time_end=time.time()
print(time_end-time_start)
print(response.json())
sglang speed
gen throughput (token/s): 33.28
vllm speed
Avg generation throughput: 55.9 tokens/s
The text was updated successfully, but these errors were encountered: