[Bug]: Nonsense output for Qwen2.5 72B after upgrading to latest vllm 0.6.3.post1 [REPROs] #9769

pseudotensor · 2024-10-28T19:39:48Z

Your current environment

docker 0.6.3.post1
8*A100

docker pull vllm/vllm-openai:latest
docker stop qwen25_72b ; docker remove qwen25_72b
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=4,5,6,7"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name qwen25_72b \
     vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=Qwen/Qwen2.5-72B-Instruct \
        --tensor-parallel-size=4 \
        --seed 1234 \
        --trust-remote-code \
        --max-model-len=32768 \
        --max-num-batched-tokens 131072 \
        --max-log-len=100 \
        --api-key=EMPTY \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.qwen25_72b.txt

Model Input Dumps

No response

🐛 Describe the bug

No such issues with prior vLLM 0.6.2.

Trivial queries work:

from openai import OpenAI

client = OpenAI(base_url='FILL ME', api_key='FILL ME')

messages = [
    {
        "role": "user",
        "content": "Who are you?",
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct",
    messages=messages,
    temperature=0.0,
    max_tokens=4096,
)

print(response.choices[0])

But longer inputs lead to nonsense only in new vllm:

qwentest1.py.zip

Gives:

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='A\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text\n</text>\n\n</text>\n\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text>\n\n</text\n</text>\n\n</text\n</text\n\n</text\n</text>\n\n</text\n</text\n</text\n</text>\n\n</text\n</text\n</text>\n\n</text>\n\n</text\n\n</text\n</text\n</text\n</text>\n\n</text>\n\n</text\n</text>\n\n</text\n\n\n</text>\n\n</text\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text\n</text\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text\n</text</text</text\n</text</text\n</text</text</text\n</text\n</text\n</text>\n\n</text</text</text</text>\n\n</text>\n\n</text</text>\n\n</text>\n\n</text\n</text</text\n</text\n</text>\n\n</text\n</text>\n\n</text\n</text>\n\n</text\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text</text>\n\n</text</text</text</text</text</text>\n\n</text</text</text</text</text</text</text</text</text</text</text>\n\n</text>\n\n</text</text>\n\n</text</text</text</text</text>\n\n</text</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text>\n\n</text\n</text>\n\n</text\n</text>\n\n</text>\n\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text\n</text>\n', refusal=None, role='assistant', function_call=None, tool_calls=[]), stop_reason=None)

Full logs from that running state. It was just running overnight and was running some benchmarks.

qwen25_72b.bad.log.zip

Related or not? #9732

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

pseudotensor · 2024-10-28T19:42:22Z

nvidia-smi:

ubuntu@h2ogpt-a100-node-1:~$ nvidia-smi
Mon Oct 28 19:41:45 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:0F:00.0 Off |                    0 |
| N/A   43C    P0             69W /  400W |   69883MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:15:00.0 Off |                    0 |
| N/A   41C    P0             71W /  400W |   69787MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  |   00000000:50:00.0 Off |                    0 |
| N/A   41C    P0             72W /  400W |   69787MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  |   00000000:53:00.0 Off |                    0 |
| N/A   41C    P0             67W /  400W |   69499MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  |   00000000:8C:00.0 Off |                    0 |
| N/A   68C    P0            332W /  400W |   77735MiB /  81920MiB |     96%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  |   00000000:91:00.0 Off |                    0 |
| N/A   60C    P0            318W /  400W |   77639MiB /  81920MiB |     92%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  |   00000000:D6:00.0 Off |                    0 |
| N/A   63C    P0            331W /  400W |   77639MiB /  81920MiB |     93%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  |   00000000:DA:00.0 Off |                    0 |
| N/A   72C    P0            331W /  400W |   77351MiB /  81920MiB |     94%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   1815338      C   /usr/bin/python3                            69864MiB |
|    1   N/A  N/A   1815472      C   /usr/bin/python3                            69768MiB |
|    2   N/A  N/A   1815473      C   /usr/bin/python3                            69768MiB |
|    3   N/A  N/A   1815474      C   /usr/bin/python3                            69480MiB |
|    4   N/A  N/A   1980777      C   /usr/bin/python3                            77716MiB |
|    5   N/A  N/A   1981060      C   /usr/bin/python3                            77620MiB |
|    6   N/A  N/A   1981061      C   /usr/bin/python3                            77620MiB |
|    7   N/A  N/A   1981062      C   /usr/bin/python3                            77332MiB |
+-----------------------------------------------------------------------------------------+

The other 4 GPUs are doing Qwen VL 2 76B

ubuntu@h2ogpt-a100-node-1:~$ docker ps
CONTAINER ID   IMAGE                     COMMAND                  CREATED        STATUS        PORTS     NAMES
78dce1c637ec   vllm/vllm-openai:latest   "python3 -m vllm.ent…"   27 hours ago   Up 27 hours             qwen25_72b
d2918b1209aa   vllm/vllm-openai:latest                  "python3 -m vllm.ent…"   4 weeks ago    Up 5 days               qwen72bvll

pseudotensor · 2024-10-28T19:50:07Z

Even after restarting the docker image, I get back the same result.

So the above script is a fine repro. It isn't the only way of course, all our longer inputs fail with 0.6.3.post1.

pseudotensor · 2024-10-28T19:51:36Z

Note this model is extremely good competitive model for coding and agents, so really needs to be top citizen for vLLM team in terms of testing etc.

osilverstein · 2024-10-28T19:51:50Z

I just posted a similar issue but with totally different params. I wonder if related at all: issue

HoboRiceone · 2024-10-29T06:57:32Z

Face similar problems

cedonley · 2024-10-29T15:59:23Z

I had issues with long context. They are related to the issue fixed in this PR: #9549
If you get better results with --enforce-eager then this is likely the culprit. I'm seeing several similar issues the past few days.

pseudotensor · 2024-10-29T20:14:46Z

Got it, can try that if I want to upgrade again, but will stick to 0.6.2 for this model for now.

SinanAkkoyun · 2024-10-29T21:49:03Z

I fixed my nonsense issue by installing the latest dev version of vLLM #9732 (comment)

Maybe that fixes your issue too @pseudotensor

why11699 · 2024-10-31T07:11:35Z

Same situation when processing 32K context input on qwen-2.5-7B.
Works fine turning vllm back to 0.6.2

frei-x · 2024-10-31T08:53:38Z

I have this problem when using AWQ and GPTQ. Adding --enforce-eager can solve it normally, but it will be slower.

cedonley · 2024-10-31T14:15:34Z

The issue is resolved in main with this fix: #9549

You can install the nightly or use —enforce-eager until v0.6.4. You may be able to revert to 0.6.2 but I had issues with 0.6.2 due to a transformers change that breaks Qwen2.5 when you enable long context (>32k)

xinfanmeng · 2024-11-04T07:17:26Z

same problem

pseudotensor · 2024-11-11T19:16:43Z

@cedonley --enforce-eager does same thing in more general cases.

DarkLight1337 · 2024-12-25T07:20:20Z

Closing as #9549 has been released. Please upgrade vLLM to v0.6.4 or above.

pseudotensor added the bug Something isn't working label Oct 28, 2024

pseudotensor changed the title ~~[Bug]: Nonsense output for Qwen2.5 72B after upgrading to latest vllm 0.6.3.post1~~ [Bug]: Nonsense output for Qwen2.5 72B after upgrading to latest vllm 0.6.3.post1 [REPROs] Oct 28, 2024

jeejeelee mentioned this issue Oct 30, 2024

[Bug]: The Qwen series models produce garbled output when generating long texts. #9825

Closed

1 task

banghua-nexusflow mentioned this issue Nov 18, 2024

Hunyuan Large & Athene V2 TIGER-AI-Lab/MMLU-Pro#44

Closed

DarkLight1337 closed this as completed Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Nonsense output for Qwen2.5 72B after upgrading to latest vllm 0.6.3.post1 [REPROs] #9769

[Bug]: Nonsense output for Qwen2.5 72B after upgrading to latest vllm 0.6.3.post1 [REPROs] #9769

pseudotensor commented Oct 28, 2024

pseudotensor commented Oct 28, 2024 •

edited

Loading

pseudotensor commented Oct 28, 2024

pseudotensor commented Oct 28, 2024

osilverstein commented Oct 28, 2024

HoboRiceone commented Oct 29, 2024

cedonley commented Oct 29, 2024

pseudotensor commented Oct 29, 2024

SinanAkkoyun commented Oct 29, 2024

why11699 commented Oct 31, 2024

frei-x commented Oct 31, 2024

cedonley commented Oct 31, 2024

xinfanmeng commented Nov 4, 2024

pseudotensor commented Nov 11, 2024

DarkLight1337 commented Dec 25, 2024 •

edited

Loading

[Bug]: Nonsense output for Qwen2.5 72B after upgrading to latest vllm 0.6.3.post1 [REPROs] #9769

[Bug]: Nonsense output for Qwen2.5 72B after upgrading to latest vllm 0.6.3.post1 [REPROs] #9769

Comments

pseudotensor commented Oct 28, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

pseudotensor commented Oct 28, 2024 • edited Loading

pseudotensor commented Oct 28, 2024

pseudotensor commented Oct 28, 2024

osilverstein commented Oct 28, 2024

HoboRiceone commented Oct 29, 2024

cedonley commented Oct 29, 2024

pseudotensor commented Oct 29, 2024

SinanAkkoyun commented Oct 29, 2024

why11699 commented Oct 31, 2024

frei-x commented Oct 31, 2024

cedonley commented Oct 31, 2024

xinfanmeng commented Nov 4, 2024

pseudotensor commented Nov 11, 2024

DarkLight1337 commented Dec 25, 2024 • edited Loading

pseudotensor commented Oct 28, 2024 •

edited

Loading

DarkLight1337 commented Dec 25, 2024 •

edited

Loading