Support vllm engine. #40

Isotr0py · 2024-01-14T05:34:38Z

Related issue: #39

支持vllm推理后端

TODO:

支持流式输出 (AsyncLLMEngine) -> ~~Update: need to fix~~ Fixed
支持非流式输出 (LLM)
测试GPTQ/AWQ量化模型推理
~~添加 requirements~~ -> Update: conflict with transformers==4.33.2, won't add to requirements.txt

~~尚未完成所有测试，先提个draft~~
Done.

Install

Run:

# cu121
pip3 install https://github.com/vllm-project/vllm/releases/download/v0.2.7/vllm-0.2.7-cp310-cp310-manylinux1_x86_64.whl

or

# cu118
pip3 install https://github.com/vllm-project/vllm/releases/download/v0.2.7/vllm-0.2.7+cu118-cp310-cp310-manylinux1_x86_64.whl

before running pip3 install transformers==4.33.2 sentencepiece xformers

utils/model.py

utils/cli.py

utils/model.py

kurikomoe · 2024-01-14T09:33:31Z

I will put a pr to group arguments in different groups after this pr merged. Currently it's kind of messy.

Isotr0py · 2024-01-14T14:53:14Z

Test results

Sakura-13B-LNovel-v0_8-4bit GPTQ model works with tensor_parallel_size=1.
Sakura-7B 4bit AWQ model works with tensor_parallel_size=2.

Benchmark (`Sakura-13B-LNovel-v0_8-4bit`)

vllm (tensor_parallel_size=1, enforce_eager=True, T4 GPU)

INFO 01-14 14:30:33 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 13.2%, CPU KV cache usage: 0.0%
INFO 01-14 14:30:38 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.8%, CPU KV cache usage: 0.0%
INFO 01-14 14:30:43 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 18.3%, CPU KV cache usage: 0.0%
Processed prompts: 100%|██████████████████████████| 1/1 [00:19<00:00, 19.72s/it]
2024-01-14 14:30:47 07e90ad7a5a0 utils.model[4917] INFO Output generated in 19.72 seconds (25.96 tokens/s, 512 tokens, context 505 tokens)

autogptq (distributed inference, T4 GPU*2)

2024-01-14 14:47:21 07e90ad7a5a0 utils.model[4972] INFO Output generated in 140.95 seconds (3.63 tokens/s, 512 tokens, context 505 tokens)

Known issues

Sakura-13B-LNovel-v0_8-4bit GPTQ model doesn't work with tensor_parallel_size=2.
Because input_size_per_partition % (quant_config.group_size * tensor_parallel_size) != 0 is not met, where input_size_per_partition=13696 and quantization_config.group_size=128

StreamOutput doesn't work yet, will be fixed soon.

sakura-umi · 2024-01-16T13:22:28Z

So is there any solution about the fake stream output?
If the stream model only output after all inference is done then there is no sense.

Maybe we can implement the true stream output referring to Qwen's implementation or vllm openai api server implementation.
I'll soon on it.

kurikomoe · 2024-01-16T14:25:51Z

vllm 0.2.7 requires pydantic==1.10.13, but you have pydantic 2.5.3 which is incompatible.
vllm 0.2.7 requires transformers>=4.36.0, but you have transformers 4.33.2 which is incompatible.

AFAIK, Sakura-13B-LNovel-v0_8-4bit is based on Baichuan which needs transformer == 4.33.2, or we need to change some code like sp_model to fix the compatibility issue.

As for the streaming output and nested_async, give me some time to make this pr run on my PC. The plain Sakura-13B-LNovel-v0_8-4bit with the following command failed to start. Maybe something related to the torch version. I will try the 0.8 awq later.

python3 server.py --listen 0.0.0.0:5000 --trust_remote_code --model_name_or_path ./models/Sakura-13B-LNovel-v0_8-4bit  --model_version 0.8 --no-auth --log debug  --vllm

ok my fault, just because 3090 out of memory when try to run Sakura-13B-LNovel-v0_8-4bit with --vllm

Isotr0py · 2024-01-16T14:41:28Z

It's strange that 3090 will OOM. It seems that it's because --use_gptq_model missing.
My cmd to run Sakura-13B-LNovel-v0_8-4bit on a 15G T4:

python server.py \
    --model_name_or_path SakuraLLM/Sakura-13B-LNovel-v0_8-4bit \
    --vllm \
    --use_gptq_model \
    --model_version 0.8 \
    --trust_remote_code \
    --no-auth \
    --tensor_parallel_size 1 \
    --enforce_eager \
    --gpu_memory_utilization 0.95

kurikomoe · 2024-01-16T14:52:39Z

It's strange that 3090 will OOM. It seems that it's because --use_gptq_model missing. My cmd to run Sakura-13B-LNovel-v0_8-4bit on a 15G T4:
python server.py \
    --model_name_or_path SakuraLLM/Sakura-13B-LNovel-v0_8-4bit \
    --vllm \
    --use_gptq_model \
    --model_version 0.8 \
    --trust_remote_code \
    --no-auth \
    --tensor_parallel_size 1 \
    --enforce_eager \
    --gpu_memory_utilization 0.95

ok, --use_gptq_model works.
Then, I need to think of a method to validate the model_name/version/quant/quant_methods against the command-line options in the next cli patch.

Isotr0py · 2024-01-16T14:54:50Z

OK, the fake stream output problem should be solved now.

utils/model.py

sakura-umi · 2024-01-16T16:18:14Z

Then, I need to think of a method to validate the model_name/version/quant/quant_methods against the command-line options in the next cli patch.

This can be very helpful for those who don't know much about how the params work.

utils/model.py

kurikomoe

LGTM, tested on 0.8 4bit gptq

Still, we need to solve the transformer == 4.33.2 issue soon or later.

sakura-umi · 2024-01-17T14:43:39Z

I'll update README.md and pyinstaller settings soon.

Isotr0py added 2 commits January 13, 2024 20:49

initialize: add vllm backend

8a7613b

add vllm args

267161b

sakura-umi added enhancement New feature or request server This issue is about Sakura Server API labels Jan 14, 2024

sakura-umi requested review from sakura-umi and kurikomoe January 14, 2024 07:08

add vllm quant

9f31892

kurikomoe requested changes Jan 14, 2024

View reviewed changes

utils/model.py Outdated Show resolved Hide resolved

utils/model.py Outdated Show resolved Hide resolved

utils/cli.py Show resolved Hide resolved

utils/model.py Outdated Show resolved Hide resolved

utils/model.py Outdated Show resolved Hide resolved

Isotr0py added 3 commits January 14, 2024 20:07

support vllm batch inference

78b9b0f

fix vllm generate

9a4241d

add eager mode to reduce vram usage

b5c013d

fix vllm streaming and gpu usage

5d1f7cd

Isotr0py marked this pull request as ready for review January 16, 2024 06:12

Isotr0py requested a review from kurikomoe January 16, 2024 09:11

fix vllm streamoutput

bff717d

refactor vllm generate

43f5269

sakura-umi reviewed Jan 16, 2024

View reviewed changes

utils/model.py Outdated Show resolved Hide resolved

kurikomoe requested changes Jan 16, 2024

View reviewed changes

utils/model.py Show resolved Hide resolved

utils/model.py Show resolved Hide resolved

utils/model.py Outdated Show resolved Hide resolved

utils/model.py Outdated Show resolved Hide resolved

Isotr0py added 3 commits January 17, 2024 13:22

add requirement, fix vllm id

6546039

remove nest_asyncio

a0039bf

mark conflict package

9708ad8

Isotr0py requested a review from sakura-umi January 17, 2024 09:23

Isotr0py requested a review from kurikomoe January 17, 2024 09:23

kurikomoe approved these changes Jan 17, 2024

View reviewed changes

sakura-umi approved these changes Jan 17, 2024

View reviewed changes

sakura-umi merged commit 0adef2b into SakuraLLM:main Jan 17, 2024
3 checks passed

Isotr0py deleted the vllm branch January 17, 2024 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support vllm engine. #40

Support vllm engine. #40

Isotr0py commented Jan 14, 2024 •

edited

Loading

kurikomoe commented Jan 14, 2024

Isotr0py commented Jan 14, 2024 •

edited

Loading

sakura-umi commented Jan 16, 2024 •

edited

Loading

kurikomoe commented Jan 16, 2024 •

edited

Loading

Isotr0py commented Jan 16, 2024

kurikomoe commented Jan 16, 2024 •

edited

Loading

Isotr0py commented Jan 16, 2024

sakura-umi commented Jan 16, 2024

kurikomoe left a comment •

edited

Loading

sakura-umi commented Jan 17, 2024

Support vllm engine. #40

Support vllm engine. #40

Conversation

Isotr0py commented Jan 14, 2024 • edited Loading

TODO:

Install

kurikomoe commented Jan 14, 2024

Isotr0py commented Jan 14, 2024 • edited Loading

Test results

Benchmark (Sakura-13B-LNovel-v0_8-4bit)

Known issues

sakura-umi commented Jan 16, 2024 • edited Loading

kurikomoe commented Jan 16, 2024 • edited Loading

Isotr0py commented Jan 16, 2024

kurikomoe commented Jan 16, 2024 • edited Loading

Isotr0py commented Jan 16, 2024

sakura-umi commented Jan 16, 2024

kurikomoe left a comment • edited Loading

Choose a reason for hiding this comment

sakura-umi commented Jan 17, 2024

Isotr0py commented Jan 14, 2024 •

edited

Loading

Isotr0py commented Jan 14, 2024 •

edited

Loading

Benchmark (`Sakura-13B-LNovel-v0_8-4bit`)

sakura-umi commented Jan 16, 2024 •

edited

Loading

kurikomoe commented Jan 16, 2024 •

edited

Loading

kurikomoe commented Jan 16, 2024 •

edited

Loading

kurikomoe left a comment •

edited

Loading