-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Model] GLM-4-9B-Chat #5306
Comments
As descibed in https://huggingface.co/THUDM/glm-4-9b-chat: |
When i use openai_api_server to call the model, it can't stop talking. |
same error |
Did you add the |
I think you can refer to: from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4
# GLM-4-9B-Chat
# 如果遇见 OOM 现象,建议减少max_model_len,或者增加tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
model=model_name,
tensor_parallel_size=tp_size,
max_model_len=max_model_len,
trust_remote_code=True,
enforce_eager=True,
# GLM-4-9B-Chat-1M 如果遇见 OOM 现象,建议开启下述参数
# enable_chunked_prefill=True,
# max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)
inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text) This snippet is copyed from https://huggingface.co/THUDM/glm-4-9b-chat |
如果是openai 接口要怎么使用呢 |
具体openai接口要怎么修改呢,麻烦能写完整点吗 |
@godcrying 已解决;openai接口传入extra_body对应参数即可
|
请问不用不用openai的接口能否实现流式输出呢?有example吗 |
|
能否请教下 那边服务端是怎么写的? |
https://blog.csdn.net/weixin_42357472/article/details/139504731 https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html# |
hey guys, I wonder do u how to use function call with chatglm4-9b-chat through vLLM |
|
when i use vllm to start glm-4-9b-chat-1m model serveing,there is a error:RuntimeError: Failed to load the model config. If the model is a custom model not yet available in the HuggingFace transformers library, consider setting |
python -m vllm.entrypoints.openai.api_server \
--model /path/to/glm-4-9b-chat \
--served-model-name glm-4-9b-chat \
--max-model-len 1024 \
--trust-remote-code \
--host=0.0.0.0 --port=8001 --enforce-eager running command above to launch the server, use the request codes provided by @lonngxiang , perfect |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
'message': 'As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.', |
The model to consider.
https://huggingface.co/THUDM/glm-4-9b-chat
The closest model vllm already supports.
chatglm
What's your difficulty of supporting the model you want?
No response
The text was updated successfully, but these errors were encountered: