[New Model] GLM-4-9B-Chat #5306

Geaming2002 · 2024-06-06T03:57:39Z

The model to consider.

https://huggingface.co/THUDM/glm-4-9b-chat

The closest model vllm already supports.

chatglm

What's your difficulty of supporting the model you want?

No response

jeejeelee · 2024-06-06T04:56:21Z

As descibed in https://huggingface.co/THUDM/glm-4-9b-chat:
vllm can support GLM-4-9B-Chat directly.

godcrying · 2024-06-06T07:54:19Z

When i use openai_api_server to call the model, it can't stop talking.

lonngxiang · 2024-06-06T07:58:53Z

When i use openai_api_server to call the model, it can't stop talking.

same error

ShangmingCai · 2024-06-06T08:16:20Z

When i use openai_api_server to call the model, it can't stop talking.

Did you add the stop_token_ids of ChatGLM when configuring openai_api_server? Maybe you can configure the generate function in api_server.py with non-default SamplingParams, or just pass the stop_token_ids params through the request_dict (see vllm/entrypoints/api_server.py line 47).

jeejeelee · 2024-06-06T08:21:07Z

I think you can refer to:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4

# GLM-4-9B-Chat
# 如果遇见 OOM 现象，建议减少max_model_len，或者增加tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # GLM-4-9B-Chat-1M 如果遇见 OOM 现象，建议开启下述参数
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

This snippet is copyed from https://huggingface.co/THUDM/glm-4-9b-chat

lonngxiang · 2024-06-06T08:26:29Z

I think you can refer to:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4

# GLM-4-9B-Chat
# 如果遇见 OOM 现象，建议减少max_model_len，或者增加tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # GLM-4-9B-Chat-1M 如果遇见 OOM 现象，建议开启下述参数
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

This snippet is copyed from https://huggingface.co/THUDM/glm-4-9b-chat

如果是openai 接口要怎么使用呢

lonngxiang · 2024-06-06T08:35:33Z

AsyncEngineArgs

具体openai接口要怎么修改呢，麻烦能写完整点吗

lonngxiang · 2024-06-06T08:55:32Z

@godcrying 已解决；openai接口传入extra_body对应参数即可

from openai import OpenAI
# from openai._client import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://192.1****:10860/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
## 流式回答
stream = client.chat.completions.create(
model="/glm-9b",

messages=[{"role": "user", "content": "你是谁"}],
extra_body={
    "stop_token_ids": [151329, 151336, 151338]
  },
stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        # print(chunk.choices[0].delta.content)
        print(chunk.choices[0].delta.content, end="")

orderer0001 · 2024-06-07T03:21:00Z

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢？有example吗
Is it possible to achieve streaming output without using the openai interface? Is there an example?

lonngxiang · 2024-06-07T03:36:01Z

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢？有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?

import requests


url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)

orderer0001 · 2024-06-07T03:39:08Z

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢？有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?
import requests


url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)

能否请教下那边服务端是怎么写的？
Could you please tell me how the server is written?

lonngxiang · 2024-06-07T03:44:03Z

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢？有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?
import requests


url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)
能否请教下那边服务端是怎么写的？ Could you please tell me how the server is written?

https://blog.csdn.net/weixin_42357472/article/details/139504731

https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#

lllloda · 2024-06-12T07:22:36Z

hey guys, I wonder do u how to use function call with chatglm4-9b-chat through vLLM

lockmatrix · 2024-06-12T13:35:38Z

能否请教下那边服务端是怎么写的？ Could you please tell me how the server is written?

cp config.json generation_config.json in model dir.

gabohouhou · 2024-06-13T02:48:19Z

when i use vllm to start glm-4-9b-chat-1m model serveing，there is a error：RuntimeError: Failed to load the model config. If the model is a custom model not yet available in the HuggingFace transformers library, consider setting trust_remote_code=True in LLM or using the --trust-remote-code flag in the CLI.

SunLemuria · 2024-06-28T06:39:09Z

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢？有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?
import requests


url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)
能否请教下那边服务端是怎么写的？ Could you please tell me how the server is written?

python -m vllm.entrypoints.openai.api_server \
    --model /path/to/glm-4-9b-chat \
    --served-model-name glm-4-9b-chat \
    --max-model-len 1024 \
    --trust-remote-code \
    --host=0.0.0.0 --port=8001 --enforce-eager

running command above to launch the server, use the request codes provided by @lonngxiang , perfect

github-actions · 2024-10-26T01:59:35Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

BlueSkyyyyyy · 2024-11-06T08:30:05Z

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢？有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?
import requests


url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)
能否请教下那边服务端是怎么写的？ Could you please tell me how the server is written?
python -m vllm.entrypoints.openai.api_server \
    --model /path/to/glm-4-9b-chat \
    --served-model-name glm-4-9b-chat \
    --max-model-len 1024 \
    --trust-remote-code \
    --host=0.0.0.0 --port=8001 --enforce-eager
running command above to launch the server, use the request codes provided by @lonngxiang , perfect

'message': 'As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.',

Geaming2002 added the new model Requests to new models label Jun 6, 2024

315930399 mentioned this issue Jun 18, 2024

[BUG] glm-4-9b-chat模型运行问题 chatchat-space/Langchain-Chatchat#4141

Closed

IlyasMoutawwakil mentioned this issue Jul 18, 2024

Failed to benchmark model performance with vllm backend huggingface/optimum-benchmark#229

Closed

github-actions bot added the stale label Oct 26, 2024

github-actions bot added unstale and removed stale labels Nov 8, 2024

DarkLight1337 mentioned this issue Nov 27, 2024

[Model] Added GLM-4 series hf format model support vllm==0.6.4 #10561

Merged

DarkLight1337 changed the title ~~GLM-4-9B-Chat:~~ [New Model] GLM-4-9B-Chat Nov 27, 2024

DarkLight1337 closed this as completed in #10561 Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Model] GLM-4-9B-Chat #5306

[New Model] GLM-4-9B-Chat #5306

Geaming2002 commented Jun 6, 2024

jeejeelee commented Jun 6, 2024

godcrying commented Jun 6, 2024

lonngxiang commented Jun 6, 2024

ShangmingCai commented Jun 6, 2024 •

edited

Loading

jeejeelee commented Jun 6, 2024

lonngxiang commented Jun 6, 2024 •

edited

Loading

lonngxiang commented Jun 6, 2024

lonngxiang commented Jun 6, 2024

orderer0001 commented Jun 7, 2024

lonngxiang commented Jun 7, 2024

orderer0001 commented Jun 7, 2024

lonngxiang commented Jun 7, 2024 •

edited

Loading

lllloda commented Jun 12, 2024

lockmatrix commented Jun 12, 2024

gabohouhou commented Jun 13, 2024

SunLemuria commented Jun 28, 2024 •

edited

Loading

github-actions bot commented Oct 26, 2024

BlueSkyyyyyy commented Nov 6, 2024

[New Model] GLM-4-9B-Chat #5306

[New Model] GLM-4-9B-Chat #5306

Comments

Geaming2002 commented Jun 6, 2024

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

jeejeelee commented Jun 6, 2024

godcrying commented Jun 6, 2024

lonngxiang commented Jun 6, 2024

ShangmingCai commented Jun 6, 2024 • edited Loading

jeejeelee commented Jun 6, 2024

lonngxiang commented Jun 6, 2024 • edited Loading

lonngxiang commented Jun 6, 2024

lonngxiang commented Jun 6, 2024

orderer0001 commented Jun 7, 2024

lonngxiang commented Jun 7, 2024

orderer0001 commented Jun 7, 2024

lonngxiang commented Jun 7, 2024 • edited Loading

lllloda commented Jun 12, 2024

lockmatrix commented Jun 12, 2024

gabohouhou commented Jun 13, 2024

SunLemuria commented Jun 28, 2024 • edited Loading

github-actions bot commented Oct 26, 2024

BlueSkyyyyyy commented Nov 6, 2024

ShangmingCai commented Jun 6, 2024 •

edited

Loading

lonngxiang commented Jun 6, 2024 •

edited

Loading

lonngxiang commented Jun 7, 2024 •

edited

Loading

SunLemuria commented Jun 28, 2024 •

edited

Loading