-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Incomplete tool calling response for pipeline-parallel vllm with ray #7194
Comments
i also occur this error in Yi-34B |
cc @andoorve |
how about using a single node and pp size 2 ( so that it uses multiprocessing backend)? does it still have this issue? |
I just tested it, but the bug appears there as well:
again the incomplete response:
|
I solved this, as you only need to add "min_tokens": **. I set it to 50 due to my completion length. |
I did some investigation, the problem is quite deep. In the case of guided decoding we are using stateful logits processors. However, in PP, the logits processors will get cloned when sent to the workers other than the driver workers. This is ok for the TP use cases since the logits processor is not cloned and the state lives as long as the sequence group does. This is an issue for beyond PP however, and will affect anything SPMD. For those cases though, one solution might be to have the logits processor live on the worker. Pinging @njhill as this is quite similar to the seed issue and he might have some suggestions on the best way to resolve it.
|
I think we need to overhaul logits processor part. it should not be a part of sampling parameter. sampling parameter should just store the constraint, like regex or json schema. and the constraint decoding state (i.e. FSM state) should live in each sequence. The FSM itself, with masks for every state, should be retrieved from some cache that maps from regex to FSM. |
Yes exactly
We can keep the FSM as part of logits processor but map from sequence to FSM in the worker similar to torch Generator. Ideally this is done in a clean and future-proof way. @rkooo567 You mention support for guided decoding in #7109. Is this something you're looking into? |
also totally agreed with @youkaichao. I think that's the direction we should go (and what we wanted to achieve with #5423). |
Yes @rkooo567 I think a lot of the stuff from SPMD is basically directly the same for PP. |
There is already as a PR to make the stateful logit processor shareable The idea is as described above; pass around logit_processor factories instead of instantiation, and then each Sequence (that lives on each worker) has its own instantiation of logit_processor |
@maxdebayser are you going to continue your work in #5329? |
Absolutely, this has been planned for some time - a larger overhaul of how logits processors work is planned e.g. #5423 but is stalled a bit given how it needs to fit with other changes. I feel we can make some incremental improvements in the meantime, potentially starting with #5329 that @jon-chuang referenced. Re needing the state to be in the worker for PP etc, I know it's also required for SPMD but imo should just be part of making the batches stateful (same for the torch.Generators). |
@jon-chuang , I've been focusing on other issues in the past couple of weeks, but if this can be an incremental step on the way of a broader refactoring as @njhill mentioned, I'll sync this PR with main again. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
Version: vllm==0.6.4.post1 We've encountered an issue, which I strongly assume is related to this. When hosting Qwen/Qwen2-VL-72B-Instruct-AWQ in distributed mode via ray (with "--tensor-parallel-size 4 --pipeline-parallel-size 2") it is not possible to use guided_json mode. It does not return an error code, but stops generating after 3 tokens: ´´´{"content"```. Increasing the min_tokens as mentioned above via extra_body does not resolve the issue. When disabling "guided_json" the model generates a response without an issue. When not using "--pipeline-parallel-size 2" it is possible to use guided_json properly. If needed, I can provide more details/output, but currently this looks quite related. |
@njhill As far as I can tell we still expect this to be broken right? I think some of the PRs planned around this are still open as far as I can tell. |
I'm removing the ray label since it appears in MP executor as well. cc @richardliaw |
Your current environment
vllm v0.5.4
Setup A) single docker container with vllm, no pipeline-parallelism
Setup B) two docker containers with ray + vllm (pipeline parallelism)
The issue does not depend on the model; e.g. it also appears with
meta-llama/Meta-Llama-3-70B-Instruct
instead of Llama-3.1🐛 Describe the bug
Without pipeline-parallelism, a request with tool calling is responded correctly with a valid tool call.
With pipeline-parallelism, the same request is responded with incomplete tool call (just a few tokens long, but still status 200).
Example request:
Response for setup A (no pipeline-parallelism)
The response for setup B (pipeline-parallelism with ray)
The text was updated successfully, but these errors were encountered: