Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Model] GLM-4-9B-Chat #5306

Closed
Geaming2002 opened this issue Jun 6, 2024 · 18 comments · Fixed by #10561
Closed

[New Model] GLM-4-9B-Chat #5306

Geaming2002 opened this issue Jun 6, 2024 · 18 comments · Fixed by #10561
Labels
new model Requests to new models unstale

Comments

@Geaming2002
Copy link

The model to consider.

https://huggingface.co/THUDM/glm-4-9b-chat

The closest model vllm already supports.

chatglm

What's your difficulty of supporting the model you want?

No response

@Geaming2002 Geaming2002 added the new model Requests to new models label Jun 6, 2024
@jeejeelee
Copy link
Collaborator

As descibed in https://huggingface.co/THUDM/glm-4-9b-chat:
vllm can support GLM-4-9B-Chat directly.

@godcrying
Copy link

When i use openai_api_server to call the model, it can't stop talking.

@lonngxiang
Copy link

When i use openai_api_server to call the model, it can't stop talking.

same error

@ShangmingCai
Copy link
Contributor

ShangmingCai commented Jun 6, 2024

When i use openai_api_server to call the model, it can't stop talking.

Did you add the stop_token_ids of ChatGLM when configuring openai_api_server? Maybe you can configure the generate function in api_server.py with non-default SamplingParams, or just pass the stop_token_ids params through the request_dict (see vllm/entrypoints/api_server.py line 47).

@jeejeelee
Copy link
Collaborator

I think you can refer to:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4

# GLM-4-9B-Chat
# 如果遇见 OOM 现象,建议减少max_model_len,或者增加tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # GLM-4-9B-Chat-1M 如果遇见 OOM 现象,建议开启下述参数
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

This snippet is copyed from https://huggingface.co/THUDM/glm-4-9b-chat

@lonngxiang
Copy link

lonngxiang commented Jun 6, 2024

I think you can refer to:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4

# GLM-4-9B-Chat
# 如果遇见 OOM 现象,建议减少max_model_len,或者增加tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # GLM-4-9B-Chat-1M 如果遇见 OOM 现象,建议开启下述参数
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

This snippet is copyed from https://huggingface.co/THUDM/glm-4-9b-chat

如果是openai 接口要怎么使用呢

@lonngxiang
Copy link

AsyncEngineArgs

具体openai接口要怎么修改呢,麻烦能写完整点吗

@lonngxiang
Copy link

@godcrying 已解决;openai接口传入extra_body对应参数即可

from openai import OpenAI
# from openai._client import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://192.1****:10860/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
## 流式回答
stream = client.chat.completions.create(
model="/glm-9b",

messages=[{"role": "user", "content": "你是谁"}],
extra_body={
    "stop_token_ids": [151329, 151336, 151338]
  },
stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        # print(chunk.choices[0].delta.content)
        print(chunk.choices[0].delta.content, end="")


@orderer0001
Copy link

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢?有example吗
Is it possible to achieve streaming output without using the openai interface? Is there an example?

@lonngxiang
Copy link

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?

import requests


url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)

@orderer0001
Copy link

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?

import requests


url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)

能否请教下 那边服务端是怎么写的?
Could you please tell me how the server is written?

@lonngxiang
Copy link

lonngxiang commented Jun 7, 2024

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?

import requests


url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)

能否请教下 那边服务端是怎么写的? Could you please tell me how the server is written?

https://blog.csdn.net/weixin_42357472/article/details/139504731

https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#

@lllloda
Copy link

lllloda commented Jun 12, 2024

hey guys, I wonder do u how to use function call with chatglm4-9b-chat through vLLM

@lockmatrix
Copy link

能否请教下 那边服务端是怎么写的? Could you please tell me how the server is written?

cp config.json generation_config.json in model dir.

@gabohouhou
Copy link

when i use vllm to start glm-4-9b-chat-1m model serveing,there is a error:RuntimeError: Failed to load the model config. If the model is a custom model not yet available in the HuggingFace transformers library, consider setting trust_remote_code=True in LLM or using the --trust-remote-code flag in the CLI.

@SunLemuria
Copy link

SunLemuria commented Jun 28, 2024

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?

import requests


url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)

能否请教下 那边服务端是怎么写的? Could you please tell me how the server is written?

python -m vllm.entrypoints.openai.api_server \
    --model /path/to/glm-4-9b-chat \
    --served-model-name glm-4-9b-chat \
    --max-model-len 1024 \
    --trust-remote-code \
    --host=0.0.0.0 --port=8001 --enforce-eager

running command above to launch the server, use the request codes provided by @lonngxiang , perfect

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 26, 2024
@BlueSkyyyyyy
Copy link

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?

import requests


url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)

能否请教下 那边服务端是怎么写的? Could you please tell me how the server is written?

python -m vllm.entrypoints.openai.api_server \
    --model /path/to/glm-4-9b-chat \
    --served-model-name glm-4-9b-chat \
    --max-model-len 1024 \
    --trust-remote-code \
    --host=0.0.0.0 --port=8001 --enforce-eager

running command above to launch the server, use the request codes provided by @lonngxiang , perfect

'message': 'As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.',

@github-actions github-actions bot added unstale and removed stale labels Nov 8, 2024
@DarkLight1337 DarkLight1337 changed the title GLM-4-9B-Chat: [New Model] GLM-4-9B-Chat Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models unstale
Projects
None yet
Development

Successfully merging a pull request may close this issue.