[Bug]: vllm crashes when preemption of priority scheduling is triggered on vllm-0.6.3.dev173+g36ea7907.d20241011 #9342

tonyaw · 2024-10-14T14:04:16Z

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

Model Input Dumps

No response

🐛 Describe the bug

As I mentioned in #9272, even if the priority is propagated successfully, vllm always crashes as long as preemption happens.
I just tested with vllm-0.6.3.dev173+g36ea7907.d20241011. The only change I made is following fix and some logs:
#9277

Could you please help to check it?

Reproduce procedure:

Start vllm:

python3 -m vllm.entrypoints.openai.api_server --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
        --host 0.0.0.0 --port 8080  --seed 42 --trust-remote-code --scheduling-policy priority \
        --tensor-parallel-size 2 --max-num-seqs 10 --enable_chunked_prefill False

Use openai client to make a 15 concurrent load.
Use another openai client to send some requests with priority -100.
As long as preemption is triggered, vllm crashes:

INFO 10-11 06:51:31 engine.py:292] Added request chat-29ccd21f37fb4a8ea96c1c8c189a6a49.
INFO 10-11 06:51:31 engine.py:294] tonyaw:Added request -100.
INFO 10-11 06:51:31 scheduler.py:1025] tonyaw: len(prefills.seq_groups) = 0
INFO 10-11 06:51:31 scheduler.py:1025] tonyaw: len(prefills.seq_groups) = 0
INFO 10-11 06:51:31 scheduler.py:807] tonyaw: _schedule_priority_preemption: waiting_queue is not None.
INFO 10-11 06:51:31 scheduler.py:808] tonyaw: seq_group chat-29ccd21f37fb4a8ea96c1c8c189a6a49 priority:-100
INFO 10-11 06:51:31 scheduler.py:837] tonyaw: _schedule_priority_preemption: vseq_group chat-55b8bdce4ae14eb8869839561fac50f9 is pop up, and will preempt.
WARNING 10-11 06:51:31 scheduler.py:1493] Sequence group chat-55b8bdce4ae14eb8869839561fac50f9 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
INFO 10-11 06:51:31 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241011-065131.pkl...
WARNING 10-11 06:51:31 model_runner_base.py:143] Failed to pickle inputs of failed execution: Can't get local object 'weak_bind.<locals>.weak_bound'
INFO:     10.254.17.246:54142 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
INFO:     10.254.17.246:54086 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
ERROR 10-11 06:51:31 engine.py:160] ValueError('Error in model execution: seq_group.get_last_latency() should not be called if the seq_group is in prefill phase.')
ERROR 10-11 06:51:31 engine.py:160] Traceback (most recent call last):
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 10-11 06:51:31 engine.py:160]     return func(*args, **kwargs)
ERROR 10-11 06:51:31 engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1698, in execute_model
ERROR 10-11 06:51:31 engine.py:160]     model_input.async_callback()
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 1122, in weak_bound
ERROR 10-11 06:51:31 engine.py:160]     unbound(inst, *args, **kwargs)
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1210, in _process_model_outputs
ERROR 10-11 06:51:31 engine.py:160]     self.do_log_stats(scheduler_outputs, outputs, finished_before,
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1543, in do_log_stats
ERROR 10-11 06:51:31 engine.py:160]     stats = self._get_stats(scheduler_outputs, model_output,
ERROR 10-11 06:51:31 engine.py:160]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1664, in _get_stats
ERROR 10-11 06:51:31 engine.py:160]     latency = seq_group.get_last_latency(now)
ERROR 10-11 06:51:31 engine.py:160]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/sequence.py", line 772, in get_last_latency
ERROR 10-11 06:51:31 engine.py:160]     raise ValueError(
ERROR 10-11 06:51:31 engine.py:160] ValueError: seq_group.get_last_latency() should not be called if the seq_group is in prefill phase.
ERROR 10-11 06:51:31 engine.py:160] 
ERROR 10-11 06:51:31 engine.py:160] The above exception was the direct cause of the following exception:
ERROR 10-11 06:51:31 engine.py:160] 
ERROR 10-11 06:51:31 engine.py:160] Traceback (most recent call last):
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 158, in start
ERROR 10-11 06:51:31 engine.py:160]     self.run_engine_loop()
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 221, in run_engine_loop
ERROR 10-11 06:51:31 engine.py:160]     request_outputs = self.engine_step()
ERROR 10-11 06:51:31 engine.py:160]                       ^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 239, in engine_step
ERROR 10-11 06:51:31 engine.py:160]     raise e
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 230, in engine_step
ERROR 10-11 06:51:31 engine.py:160]     return self.engine.step()
ERROR 10-11 06:51:31 engine.py:160]            ^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1386, in step
ERROR 10-11 06:51:31 engine.py:160]     outputs = self.model_executor.execute_model(
ERROR 10-11 06:51:31 engine.py:160]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 82, in execute_model
ERROR 10-11 06:51:31 engine.py:160]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 10-11 06:51:31 engine.py:160]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 155, in _driver_execute_model
ERROR 10-11 06:51:31 engine.py:160]     return self.driver_worker.execute_model(execute_model_req)
ERROR 10-11 06:51:31 engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-11 06:51:31 engine.py:160]     output = self.model_runner.execute_model(
ERROR 10-11 06:51:31 engine.py:160]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-11 06:51:31 engine.py:160]     return func(*args, **kwargs)
ERROR 10-11 06:51:31 engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 10-11 06:51:31 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper
ERROR 10-11 06:51:31 engine.py:160]     raise type(err)(f"Error in model execution: "
ERROR 10-11 06:51:31 engine.py:160] ValueError: Error in model execution: seq_group.get_last_latency() should not be called if the seq_group is in prefill phase.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

tonyaw · 2024-10-16T02:14:14Z

Any update or suggestion for this?
The priority scheduling feature is very import for us.
Thanks in advance!

gilljon · 2024-10-28T21:21:51Z

Encountering the exact same thing.

hediyuan · 2024-10-29T07:55:17Z

Encountering the same situation。
using the AsyncLLMEngine Directly，not multiprocess method

ZJUFangzh · 2024-10-30T02:57:05Z

Encountering the exact same thing. Is there any one fix?

ZJUFangzh · 2024-10-30T03:25:10Z

I use --disable-async-output-proc to avoid this problem，so I think #7049 this pr raise the bug

gilljon · 2024-10-31T00:54:43Z

I use --disable-async-output-proc to avoid this problem，so I think #7049 this pr raise the bug

This solved it for me, thank you! 🙏

kar9999 · 2024-12-20T02:26:46Z

Does the vllm priority scheduling work? I've set the "max-num-seqs" to 32. At first, I sent 30 requests with a priority of 0 and 1 request with a priority of -3 to access it together. Later, I didn't set priorities and sent 30 requests with a priority of 0 and 1 request with a priority of 0 to access it together. The "token_per_s" of the request with a priority of -3 is the same as that of the requests with a priority of 0 later. What's the problem?

tonyaw added the bug Something isn't working label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: vllm crashes when preemption of priority scheduling is triggered on vllm-0.6.3.dev173+g36ea7907.d20241011 #9342

[Bug]: vllm crashes when preemption of priority scheduling is triggered on vllm-0.6.3.dev173+g36ea7907.d20241011 #9342

tonyaw commented Oct 14, 2024

tonyaw commented Oct 16, 2024

gilljon commented Oct 28, 2024

hediyuan commented Oct 29, 2024

ZJUFangzh commented Oct 30, 2024

ZJUFangzh commented Oct 30, 2024 •

edited

Loading

gilljon commented Oct 31, 2024

kar9999 commented Dec 20, 2024

[Bug]: vllm crashes when preemption of priority scheduling is triggered on vllm-0.6.3.dev173+g36ea7907.d20241011 #9342

[Bug]: vllm crashes when preemption of priority scheduling is triggered on vllm-0.6.3.dev173+g36ea7907.d20241011 #9342

Comments

tonyaw commented Oct 14, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

tonyaw commented Oct 16, 2024

gilljon commented Oct 28, 2024

hediyuan commented Oct 29, 2024

ZJUFangzh commented Oct 30, 2024

ZJUFangzh commented Oct 30, 2024 • edited Loading

gilljon commented Oct 31, 2024

kar9999 commented Dec 20, 2024

ZJUFangzh commented Oct 30, 2024 •

edited

Loading