-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Unblock LLM while handling long sequences / Handling multiple prefills at the same time #10774
Comments
Hey thanks for taking your time to share this. Have you tried giving In case you're referring to a more elaborate policy where you're actively discouraging long-running requests, this is not implemented afaik and is indeed non-trivial, requiring modifications to https://github.com/vllm-project/vllm/blob/main/vllm/core/scheduler.py#L507 (chunked prefill policy being one example of how to do it). You would probably want to have the running request queue be a prio q, but it would break the FIFO assumption that it's kinda a founding design choice in vllm. Please keep in mind the scheduler is also targeted in the v1 arch changes #8779, so it may be harder to plan out big changes to core components before the re-design. |
Thanks @NickLucche for your answer. I clearly shouldn't have typed this up in a hurry. I am well aware of chunked prefill. As I said, it's the only reason I think there is a solution to this problem at all.
Technically it doesn't stall but for the user it feels like it. Let's say chunked prefill is enabled and a very long sequence comes (number of tokens a significant multiple of What I am proposing is a method to be able to handle multiple prefills simultaneously - even if one of them could easily fill up I also don't think any of this can just be handled by a priority queue. First of all there is already a priority mechanism in vllm and it doesn't change the fact that a single long prefill occupies the model and also the proposed method necessarily means looking at multiple request at the same time when planning the next engine step.
Thanks, I saw that. Currently contemplating if I should base my implementation directly of v1 instead of having to do the whole thing again once v1 comes out. Also I am very open for other suggestions. I'd much rather not spend all the energy implementing this. But at the moment using vllm to serve models in a customer facing product is actually a bit suboptimal because no matter how many replicas I deploy and no matter how many GPU's I throw at the problem a single long request can still slow down response times to other users as long as they unlucky enough to have their request routed to the same engine-instance. |
Thanks for elaborating @schoennenbeck, a lot clearer now, apologies for the misunderstanding!
Totally, I was just naively suggesting to rank inversely based on eg number of chunks request is split on, but I see now you would have to unify running+pending to avoid prioritizing already running ones. But this clearly requires more work. Anyways, I think my inputs here are not needed, let's wait until a maintainer swings by to comment on adding new scheduling policies @DarkLight1337 @simon-mo @WoosukKwon. |
Interesting idea current Scheduling strategy “aka first-come-first-served“ only prefills the first request. If the first request is very large, it will block. So @schoennenbeck hope top_k request will be executed fairly when prefilling. top_k = 1 degenerate to first-come-first-served |
We're already prototyping a solution within IBM here: #10235 Seems to be working okay so far, we wanted to gather more data internally before really pushing for that to be merged. We'd love more feedback! |
@joerunde Thank you so much for the link. I was already wondering if we were the only ones having this problem. Let me know if I can help in any form with the PR. |
@schoennenbeck Yeah if you wanna try out the changes, or give any feedback on what the CLI flags should be named I'd really appreciate it |
🚀 The feature, motivation and pitch
Motivation
If an engine is currently handling a single long sequence in the prefill stage any other incoming sequence has to wait untill the LLM is done with the long one before it gets to be handled. This means that in a situation with multiple users it can easily happen that a single user's ill-conceived (or simply long) request makes the LLM unresponsive for all other users.
Initial ideas
There are a couple of ways one can currently approach this.
However, most of these ideas either come with their own problems and/or don't actually solve the problem.
Suggestion
I don't know of any approach that would work without chunked prefill. However, if we do do chunked prefill the following approach could work:
min_num_concurrent_sequences
(with default set to 1 which is just the current behaviour)min_num_concurrent_sequences
that are handled during the next step (or if there aren't enough total sequences that all are handled).Example
Say
min_num_concurrent_sequences=2
,max_num_batched_tokens=512
and we have two sequences with 8000 and 300 tokens respectively. Then we would do chunked prefill for both sequences with 256 tokens each.Expected result
Implementing this would mean that no single user could block other users from getting their answers in a timely manner. Clearly the long sequence would now take longer to be handled but it would make for a little fairer handling of requests. It is still very much possible to get slow answers if the LLM is under high load but we could service more users at the same time at the cost of higher ITL for each user individually which I personally think is, in a lot of cases, prefarable to one user being serviced fast while everybody else has to wait.
Call for comments
I am currently trying my hands at a prototype implementation (basically because I need this for my use case) but it is hardly trivial. Any thoughts, comments and suggestions are welcome.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: