-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Improve guided decoding (logit_processor) APIs and performance. #5423
Comments
I have a few questions:
Can you elaborate on why you think placing the guided decoding parameters in the
Do you maybe mean stateless? If not, what do you mean exactly? Regarding the topic of statefulness: we probably don't want to limit ourselves to stateless logits processors. If we manage to make the API so that it is easy to implement stateful logits processors, we would already make things much better. E.g. I think that a very good thing to address would be to add infrastructure for pooling stateful objects and making it easy to define that one such object should not be shared across sequences and requests, or at least should be reset before being used. Could you also please elaborate on the new Are there maybe some type annotations missing for the return values of e.g. I might have misunderstood the proposal though. So, I'd be really happy if you could elaborate on it. All in all, I would be very interested in improvements in this area, so I'm glad you're working on it! |
It's like moving the functionality to the core API. Right now, it is implemented like an add-on (only working with OpenAI server), and it doesn't work with tools like https://github.com/anyscale/ray-llm (because we directly use the core API). It requires code that breaks the abstraction barrier (i.e., creating logit processor), and given the guided decoding is a core function, I feel like having the API in SamplingParams make sense.
To improve time to prepare masks for json mode, we want to use parallel processing tools such as threadpool or ray. It requires the logit processor to be "stateful" because we don't want to recreate actors or threadpools everytime logit processos is requested (it should be created in
+1. I think it'd be an implementation of part 2.
It will replace
You are right the prep and apply is stateful. We can make it this way as well.
But I found it easier to just make it fully stateful. Hope this clarifies the proposal a little bit! |
We should make this work with the following RFCs @NadavShmayo #4769 |
My initial thoughts;
|
Some ideas:
With an additional post-sampling callback, this would subsume my SequenceController #4775 :
|
I see. I found that API is limited for our particular use case because as you know it is applied after sampling is done (whereas we want to apply logit processor on final logits). It's great if we can subsume it.
I am open to it, but right now there's no specific use cases.
How is this guaranteed now? |
@rkooo567 thanks, let me see if I understand it: The idea is that the logits processors will be asked to This means that the whole process needs to guarantee that there is one logits processor instance per request per sequence. Correct? The implementation will need to be very careful to avoid contention issues. Regarding the combination of this with the other PRs: I'm still struggling a bit to understand what general design we need. Let me explain: The logits processors are now applied in the models; so the general signature of the operation is
We want to support ff-tokens or backtracking (e.g. #4775). These things happen a few layers above the model and don't fit this API above. So we're talking about different things in different abstraction layers at the same time. Am I the only one? Is the design clear to you folks? If so, I would appreciate it a lot if someone could describe where which types of object would play which role. |
@br3no One thing that took me a while to see is that there is only one There was some discussion of allowing a list of those, but IMHO it's easy to write a I'm the one asking for |
@rkooo567 @simon-mo @mmoskal some additional thoughts after we talked offline yesterday: It's a concern that the current support is kind of broken, it doesn't work for input batches or beam search due to the stateful/concurrency thing. So I wonder if we could prioritize some simpler immediate fixes for that along with the egregious performance overhead with json mode due to having to construct a new A couple of other thoughts about the proposed interface:
|
@mmoskal thanks for your answer! I also would like to support ff-tokens since I think this would contribute to alleviate the performance issues. @njhill I’m not familiar with lm-format-enforcer, but for the Outlines processors now only the CFG one is problematic. The others are now stateless. Should we concentrate on a “fix” for the output_format: json issue? This would involve an object pool for the CFGGuide for that particular use case. Or am I missing other aspects here? |
I also agree with it. I have impression the current interface is a little over-designed with some vague implementation in mind. For ff-tokens and backtracking, I would like to see the implementation otherwise it is very difficult to design the interface (that's why we punted). I think the interface I propose here is not going to bother us getting there (logit processor API also feels like it is not very stable API yet, so we have time to iterate).
Does it mean supporting stateful logit processor first (meaning merging the open PR)? I am okay with this.
I think regular constructor could work. The main reason was we need to pass the decode config to the logit processor, and since it is inside the model, the required change was big. I think constructor makes more sense actually.
Yeah it is a good point. for our internal impl, we just need seq_data, seq_ids, request_id, and sampling params. |
I did a first pass on this in #6273 |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
Hi all, for those who are following this thread, I started benchmarking current performance for guided decoding in vLLM here #10046 |
Motivation.
Currently, guided decoding & logit processor API is incomplete has has several issues. The RFC is intended to bring up problems and solutions. Some of issues may have been already addressed and there are PRs out already.
There are 3 major issues.
Proposed Change.
API
guided decoding parameters are not supported with SamplingParams. It is addressed from #4130
Performance
Currently, logit processors APIs are applied row by row blocking (
vllm/vllm/model_executor/layers/logits_processor.py
Line 112 in 246598a
This requires logit processor to be
prepare
andapply
assume 1:1 calls. E.g., once prepare is called, apply has to be called before another prepare is called. I think it is the safe assumption. Alternatively, we can make prepare return a class, but that will make interface surface larger, so I don't prefer that solution (but I am open to hear feedback!)This is the example usage of the API
We are also considering to upstream Ray based batch processing implementation with lmformatenforcer.
Failure Handling
When using a stateful logit processor, it is possible requests are failed. For example, if we use Ray, Ray actors can die. Or there could be user's schema issue that cannot be caught ahead of time.
When it happens, we should fail the seq_group immediately. We will introduce a new status "FINISHED_INTERNAL_ERROR = enum.auto()" to
vllm/vllm/sequence.py
Line 42 in 246598a
Feedback Period.
No response
CC List.
cc @simon-mo @Yard1
Any Other Things.
No response
The text was updated successfully, but these errors were encountered: