-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: ValueError: could not broadcast input array from shape (513,) into shape (512,) #8068
Comments
great investigation! I think the the problem here, is that cudagraph only works for block table with 512 blocks. but somehow we are using cudagraph even if it has 513 blocks. if you can provide the input example, or even find out which code made the wrong decision, it would be very helpful. you should be able to find the code to blame in vllm/vllm/worker/model_runner.py or vllm/attention/backends/flash_attn.py . |
I'm using the conversation history to generate load, but the timing of when vLLM throws the error keeps varying, so it seems like it's not due to a specific prompt, but rather a specific condition that needs to be met while the system is under load. Anyway, if I set the --max-seq-len-to-capture option to be the same as --max-model-len, would I be able to avoid the current error? |
I think so, but not sure. Welcome to have a try and report back. |
I think this pr may solve this problem #8145 |
@Ximingwang-09 thanks for the investigation!
I just notice @ashgold uses multi-step scheduling, which indeed has lookahead slots. |
I can confirm, I have the same issue, but only if using multi-step scheduling - without it everything works fine. |
This is an assumption, but if we do multi step scheduling, engine needs consecutive memory space for multi step scheduling, and in that case, is there any chance that it's going to go beyond the 512 that was allocated? |
This may provide a tmp solution #8340. Would be good to know if this solves your issue. |
I'll see if this issue will be reproduced when the next release comes out. Thank you! |
@ashgold you don't need to wait for release. we have per-commit wheel. after that pr is merged, you can follow https://docs.vllm.ai/en/latest/getting_started/installation.html to install the wheel for the commit. in addition, if you want to have a try, you can even just add the lines in the pr into your code. it's just several python lines. |
Okay. |
I checked that issue not reproduced in v0.6.1.post1. I think problem solved! |
it seemed that issue reproduced in v0.6.3 again. downgrading to v0.6.1.post1 solved the problem in my case. |
@andrea-veritas please open a new issue with detailed info |
|
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
This is the same issue raised in #5563. From what I've researched, it was related to the size of allocated space associated with CUDA Graph.
here is arguments vllm to run.
The error was occurring at the following points.
vllm/vllm/attention/backends/flash_attn.py
Lines 457 to 467 in 5b86b19
it's regarding max context len // block size.
I went to the part that allocates self.runner.graph_block_tables, and it was allocating it like this.
vllm/vllm/worker/model_runner.py
Lines 871 to 873 in 5b86b19
vllm/vllm/worker/model_runner.py
Lines 1011 to 1013 in 5b86b19
max_seq_len_to_capture was set to the default value of 8192 unless otherwise set.
vllm/vllm/engine/arg_utils.py
Line 100 in 5b86b19
Ultimately, the value of self.max_seq_len_to_capture was determined by the following logic.
vllm/vllm/config.py
Lines 333 to 337 in 5b86b19
I think this can be fixed by replacing min with max.
I'm curious what your intentions were in taking the min value.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: