使用FastChat的一些脚本 #64

Valdanitooooo · 2024-04-24T16:52:02Z

Valdanitooooo
Apr 24, 2024
Maintainer

4个脚本

1_start_controller.sh

nohup python -m fastchat.serve.controller --host 0.0.0.0 --port 21001  > log1_controller.txt 2>&1 &

2_start_worker.sh

nohup python -m fastchat.serve.model_worker --model-path /home/valdanito/workspace/llmops/models/llama/Meta-Llama-3-8B-Instruct --model-names Meta-Llama-3-8B-Instruct --num-gpus 1 > log2_worker.txt 2>&1 &

3_start_api_server.sh

nohup python -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000 > log3_api_server.txt 2>&1 &

4_start_web_ui.sh

nohup python -m fastchat.serve.gradio_web_server --host 0.0.0.0 --port 9000 > log4_web_ui.txt 2>&1 &

vllm

使用vllm就改下2_start_worker.sh
model_worker改成vllm_worker

nohup python -m fastchat.serve.vllm_worker --model-path /home/valdanito/workspace/llmops/models/llama/Meta-Llama-3-8B-Instruct --model-names Meta-Llama-3-8B-Instruct --num-gpus 1 > log2_worker.txt 2>&1 &

Tips

主要在woker部分可玩的方式比较多

首先worker和llm是一对一关系，一个worker只负责部署一个model，
但worker和显卡是多对多关系，一个模型可以单卡或多卡部署，一张显卡也可以部署一个或多个模型
可以根据参数灵活分配显存
比如用阿里云 A10 24GB * 8卡的服务器，部署qwen 14B、chatglm3-6b、wizardcoder 34B、llama2-7b四个模型
可以分给wizardcoder 4卡，qwen2卡，llama2和chatglm各1卡这样部署
也可以给wizardcoder 8卡，但显存限制最多用50%，给qwen4卡，显存最多占用50%，给llama2和chatglm各2卡，显存占用最多50%
方法很多，需要根据实际部署的模型占用显存的多少，以及并发量合理分配显存

worker支持的参数很多，可以 python -m fastchat.serve.model_worker --help查看

--gpus
用来指定显卡，比如 --gpus 0,1,2,3 这样就使用了8卡中的前四个，我记得这个参数也不是随便写的，使用gpu的数量要能被64整除
--max-gpu-memory
限制使用显存使用量，单位Gib，如 - --max-gpu-memory 22Gib，这个使用量针对的是单张卡

查看 python -m fastchat.serve.vllm_worker --help 和model_worker有些区别

它不能直接限制显存使用量，它的参数是

--gpu-memory-utilization
传入的是小数，如 - --gpu-memory-utilization 0.5，即使用显存量不超过单张卡的50%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用FastChat的一些脚本 #64

{{title}}

Replies: 0 comments

Select a reply

使用FastChat的一些脚本 #64

Valdanitooooo Apr 24, 2024 Maintainer

4个脚本

vllm

Tips

Replies: 0 comments

Valdanitooooo
Apr 24, 2024
Maintainer