-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: vLLM v0.6.1 Instability issue under load. #8219
Comments
I changed some arguments and took the load test again.
engine still dead but the error message is different.
|
I have observed the same issue while load testing 0.6.0. I also have observed the error when the GPU KV cache usage was close to 100%. I'm not sure there is a causal relation, though. This issue leads to no graceful degradation, vLLM needs to be restarted. |
I have now observed another exception:
Both exceptions have in common, that they expect content, but instead there is nothing there. It seems that some sequences get lost on high load. |
Also, adding |
Neither does increasing |
cc @SolitaryThinker is it caused by multi-step? |
Don't think I can reproduce, but probably not cause my multi-step as it's disabled by default |
the same exception under load (running 32B-W4A16) 4090 RTX --dtype auto vLLM version 0.6.0CRITICAL 09-07 01:49:38 launcher.py:98] AsyncLLMEngine is already dead, terminating server process |
Facing the same issue, used to be that it handled running out of kv cache space gracefully: Now it throws the |
Update: |
I'm working on a fix, should be ready soon. |
@alexm-neuralmagic @youkaichao @SolitaryThinker @robertgshaw2-neuralmagic
Now the timing of the error is a little different.
|
load test on v0.6.1 today, and engine crash. vLLM version 0.6.1
|
issue disappeared in v0.6.1.post1. Thank you guys!! |
@ashgold Cool! |
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
I did a load test on vLLM v0.6.0 based on the conversation history.
I've run the test about 3 times, and I've always encountered the issue. I'm not sure if this issue appears after GPU cache usage reaches 100%, but so far it's been reproduced during load after GPU cache usage reaches 100%.
The average number of input prompt tokens in the conversation history we used for testing was 1,600 tokens, and the average number of answer tokens from the model was 150 tokens.
The following are the vLLM startup arguments.
This error did not occur in versions prior to v0.5.5. (I did load tests 3 times with exactly same arguments.)
During a load, I also got this warning message 5 minutes before the system died. Could this be related to the issue?
WARNING 09-05 17:10:02 scheduler.py:1355] Sequence group cmpl-0c49e59124fe4f9c8b8e6e0f4bae49d7-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: