H20显卡推理 glm9b-chat失败，版本0.16.3 #2544

yangyu6 · 2024-11-12T08:54:12Z

System Info / 系統信息

版本0.16.3
显卡 H20
cuda 12.1

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

0.16.3

The command used to start Xinference / 用以启动 xinference 的命令

nohup env XINFERENCE_HOME=/home/root/.cache XINFERENCE_MODEL_SRC=modelscope xinference-local --log-level debug --host 0.0.0.0 --port 9997 > output.log 2>&1 &

Reproduction / 复现过程

从可视化UI中加载

报错日志：
2024-11-12 08:45:41,477 xinference.core.model 5261 DEBUG [request 7f8510d6-a0d2-11ef-af1b-06cfd44f9164] Enter chat, args: ModelActor(glm4-chat-0),[{'role': 'user', 'content': '你好'}],{'frequency_penalty': 0.0, 'max_tokens': 512, 'presence_penalty': 0.0, 'temperature': 0.7, 'top_p': ..., kwargs: raw_params={'frequency_penalty': 0.0, 'max_tokens': 512, 'presence_penalty': 0.0, 'stream': True, 'temperature'...
2024-11-12 08:45:41,478 xinference.core.model 5261 DEBUG [request 7f8510d6-a0d2-11ef-af1b-06cfd44f9164] Leave chat, elapsed time: 0 s
2024-11-12 08:45:41,478 xinference.core.model 5261 DEBUG After request chat, current serve request count: 0 for the model glm4-chat
2024-11-12 08:45:41,486 transformers.generation.configuration_utils 5261 INFO loading configuration file /home/root/.cache/cache/glm4-chat-pytorch-9b/generation_config.json
loading configuration file /home/root/.cache/cache/glm4-chat-pytorch-9b/generation_config.json
2024-11-12 08:45:41,486 transformers.generation.configuration_utils 5261 INFO Generate config GenerationConfig {
"do_sample": true,
"eos_token_id": [
151329,
151336,
151338
],
"max_length": 128000,
"pad_token_id": 151329,
"temperature": 0.8,
"top_p": 0.8
}

Generate config GenerationConfig {
"do_sample": true,
"eos_token_id": [
151329,
151336,
151338
],
"max_length": 128000,
"pad_token_id": 151329,
"temperature": 0.8,
"top_p": 0.8
}

2024-11-12 08:45:42,850 xinference.api.restful_api 4440 ERROR Chat completion stream got an error: Remote server 0.0.0.0:33031 closed
Traceback (most recent call last):
File "/root/miniconda3/envs/yu/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1974, in stream_results
async for item in iterator:
File "/root/miniconda3/envs/yu/lib/python3.10/site-packages/xoscar/api.py", line 340, in anext
return await self._actor_ref.xoscar_next(self._uid)
File "/root/miniconda3/envs/yu/lib/python3.10/site-packages/xoscar/backends/context.py", line 230, in send
result = await self._wait(future, actor_ref.address, send_message) # type: ignore
File "/root/miniconda3/envs/yu/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait
return await future
File "/root/miniconda3/envs/yu/lib/python3.10/site-packages/xoscar/backends/core.py", line 84, in _listen
raise ServerClosed(
xoscar.errors.ServerClosed: Remote server 0.0.0.0:33031 closed
2024-11-12 08:45:43,146 xinference.core.worker 4582 WARNING Process 0.0.0.0:33031 is down.

现象：模型已经加载到显存中，调用对话接口后立马出现上述报错，然后模型重新加载。

Expected behavior / 期待表现

期望正常对话

qinxuye · 2024-11-13T04:40:29Z

这个报错一般是 OOM 或者程序意外退出。H20 应该够跑 glm9b，感觉可能是驱动或者什么导致意外退出。

github-actions · 2024-11-20T19:03:42Z

This issue is stale because it has been open for 7 days with no activity.

tianbo-che · 2024-11-22T09:46:57Z

我遇到同样的问题，部署qwen72B，在A100 40G*2可以正常部署。在H20上面可以部署成功，对话后报错，服务重启。

github-actions · 2024-12-02T19:03:58Z

This issue is stale because it has been open for 7 days with no activity.

jiusi9 · 2024-12-04T14:16:35Z

Hi, 我也遇到了这个问题。

Qwen2.5-32B的模型，在A30的GPU上面跑是正常的，在公司新买的H20上面一跑就挂，看显存的监控就知道挂了重启了。

尝试更换了transformers版本也不行，xinference升级到1.0.1也不行...

qinxuye · 2024-12-04T14:26:14Z

Hi, 我也遇到了这个问题。

Qwen2.5-32B的模型，在A30的GPU上面跑是正常的，在公司新买的H20上面一跑就挂，看显存的监控就知道挂了重启了。

尝试更换了transformers版本也不行，xinference升级到1.0.1也不行...

挂了有日志吗

jiusi9 · 2024-12-05T01:36:07Z

模型正常启动以后，发起请求就挂了，没有发现明显的异常报错，

2024-12-05 09:33:52,971 transformers.models.llama.modeling_llama 419 WARNING  We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
2024-12-05 09:33:55,205 xinference.api.restful_api 1 ERROR    Remote server 0.0.0.0:41183 closed
2024-12-05T01:33:55.206898045Z Traceback (most recent call last):
2024-12-05T01:33:55.206905247Z   File "/usr/local/lib/python3.8/dist-packages/xinference/api/restful_api.py", line 1771, in create_chat_completion
    data = await model.chat(
2024-12-05T01:33:55.206925129Z   File "/usr/local/lib/python3.8/dist-packages/xoscar/backends/context.py", line 226, in send
2024-12-05T01:33:55.206928866Z     result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
2024-12-05T01:33:55.206933158Z   File "/usr/local/lib/python3.8/dist-packages/xoscar/backends/context.py", line 115, in _wait
2024-12-05T01:33:55.206937185Z     return await future
2024-12-05T01:33:55.206941056Z   File "/usr/local/lib/python3.8/dist-packages/xoscar/backends/context.py", line 106, in _wait
    await asyncio.shield(future)
  File "/usr/local/lib/python3.8/dist-packages/xoscar/backends/core.py", line 84, in _listen
    raise ServerClosed(
2024-12-05T01:33:55.206956545Z xoscar.errors.ServerClosed: Remote server 0.0.0.0:41183 closed
2024-12-05 09:33:55,872 xinference.core.worker 143 WARNING  Process 0.0.0.0:41183 is down.
2024-12-05 09:33:55,873 xinference.core.worker 143 WARNING  Recreating model actor sparrowx-openbuddy-llama3.1-8b-v22.2-131k-1-0 ...
2024-12-05 09:33:58,908 xinference.model.llm.llm_family 143 INFO     Caching from URI: file:///opt/models/openbuddy-llama3.1-8b-v22.2-131k
2024-12-05 09:33:58,909 xinference.model.llm.llm_family 143 INFO     Cache /opt/models/openbuddy-llama3.1-8b-v22.2-131k exists
2024-12-05 09:33:58,963 transformers.tokenization_utils_base 615 INFO     loading file tokenizer.json
2024-12-05 09:33:58,963 transformers.tokenization_utils_base 615 INFO     loading file added_tokens.json
2024-12-05T01:33:58.964008815Z 2024-12-05 09:33:58,963 transformers.tokenization_utils_base 615 INFO     loading file special_tokens_map.json
2024-12-05 09:33:58,964 transformers.tokenization_utils_base 615 INFO     loading file tokenizer_config.json
2024-12-05 09:33:59,220 transformers.tokenization_utils_base 615 INFO     Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-12-05 09:33:59,221 transformers.configuration_utils 615 INFO     loading configuration file /opt/models/openbuddy-llama3.1-8b-v22.2-131k/config.json
2024-12-05 09:33:59,223 transformers.configuration_utils 615 INFO     Model config LlamaConfig {
2024-12-05T01:33:59.223337387Z   "_name_or_path": "/opt/models/openbuddy-llama3.1-8b-v22.2-131k",
  "architectures": [
2024-12-05T01:33:59.223347108Z     "LlamaForCausalLM"
  ],
  "attention_bias": false,
2024-12-05T01:33:59.223358091Z   "attention_dropout": 0.0,
  "bos_token_id": 128000,
2024-12-05T01:33:59.223373945Z   "eos_token_id": [
2024-12-05T01:33:59.223377496Z     128001,
    128008,
    128009,
    128048
  ],
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
2024-12-05T01:33:59.223414468Z   "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
2024-12-05T01:33:59.223443444Z     "factor": 8.0,
    "high_freq_factor": 4.0,
2024-12-05T01:33:59.223450462Z     "low_freq_factor": 1.0,
2024-12-05T01:33:59.223453917Z     "original_max_position_embeddings": 8192,
2024-12-05T01:33:59.223457531Z     "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
2024-12-05T01:33:59.223472099Z   "torch_dtype": "float16",
  "transformers_version": "4.43.4",
2024-12-05T01:33:59.223479315Z   "use_cache": true,
2024-12-05T01:33:59.223483410Z   "vocab_size": 128256
}
2024-12-05T01:33:59.223490241Z 
2024-12-05 09:33:59,421 transformers.modeling_utils 615 INFO     loading weights file /opt/models/openbuddy-llama3.1-8b-v22.2-131k/model.safetensors.index.json
2024-12-05 09:33:59,421 transformers.modeling_utils 615 INFO     Instantiating LlamaForCausalLM model under default dtype torch.float16.
2024-12-05T01:33:59.422647629Z 2024-12-05 09:33:59,422 transformers.generation.configuration_utils 615 INFO     Generate config GenerationConfig {
  "bos_token_id": 128000,
2024-12-05T01:33:59.422676185Z   "eos_token_id": [
2024-12-05T01:33:59.422679946Z     128001,
    128008,
2024-12-05T01:33:59.422694988Z     128009,
2024-12-05T01:33:59.422698205Z     128048
2024-12-05T01:33:59.422701470Z   ]
}
2024-12-05T01:33:59.422708569Z 

Loading checkpoint shards:   0%|                         | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|████▎            | 1/4 [00:02<00:07,  2.64s/it]
Loading checkpoint shards:  50%|████████▌        | 2/4 [00:05<00:05,  2.85s/it]
Loading checkpoint shards:  75%|████████████▊    | 3/4 [00:08<00:02,  2.91s/it]
Loading checkpoint shards: 100%|█████████████████| 4/4 [00:09<00:00,  1.94s/it]
Loading checkpoint shards: 100%|█████████████████| 4/4 [00:09<00:00,  2.27s/it]
2024-12-05 09:34:08,808 transformers.modeling_utils 615 INFO     All model checkpoint weights were used when initializing LlamaForCausalLM.

jiusi9 · 2024-12-05T03:02:14Z

我用的是transformers的框架，但是从vllm issue里找到一个solution，测试了一下可以跑模型了。

pip install nvidia-cublas-cu12==12.4.5.8

尽管 torch 2.3.1 requires nvidia-cublas-cu12==12.1.3.1，升级后还是可用的。

vllm-project/vllm#9215
vllm-project/vllm#7893

github-actions · 2024-12-12T19:03:41Z

This issue is stale because it has been open for 7 days with no activity.

github-actions · 2024-12-17T19:04:05Z

This issue was closed because it has been inactive for 5 days since being marked as stale.

XprobeBot added the gpu label Nov 12, 2024

XprobeBot added this to the v0.16 milestone Nov 12, 2024

github-actions bot added the stale label Nov 20, 2024

github-actions bot removed the stale label Nov 22, 2024

XprobeBot modified the milestones: v0.16, v1.x Nov 25, 2024

github-actions bot added the stale label Dec 2, 2024

github-actions bot removed the stale label Dec 4, 2024

github-actions bot added the stale label Dec 12, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H20显卡推理 glm9b-chat失败，版本0.16.3 #2544

H20显卡推理 glm9b-chat失败，版本0.16.3 #2544

yangyu6 commented Nov 12, 2024

qinxuye commented Nov 13, 2024

github-actions bot commented Nov 20, 2024

tianbo-che commented Nov 22, 2024

github-actions bot commented Dec 2, 2024

jiusi9 commented Dec 4, 2024

qinxuye commented Dec 4, 2024

jiusi9 commented Dec 5, 2024

jiusi9 commented Dec 5, 2024

github-actions bot commented Dec 12, 2024

github-actions bot commented Dec 17, 2024

H20显卡推理 glm9b-chat失败，版本0.16.3 #2544

H20显卡推理 glm9b-chat失败，版本0.16.3 #2544

Comments

yangyu6 commented Nov 12, 2024

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

qinxuye commented Nov 13, 2024

github-actions bot commented Nov 20, 2024

tianbo-che commented Nov 22, 2024

github-actions bot commented Dec 2, 2024

jiusi9 commented Dec 4, 2024

qinxuye commented Dec 4, 2024

jiusi9 commented Dec 5, 2024

jiusi9 commented Dec 5, 2024

github-actions bot commented Dec 12, 2024

github-actions bot commented Dec 17, 2024