Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Unblock LLM while handling long sequences / Handling multiple prefills at the same time #10774

Open
1 task done
schoennenbeck opened this issue Nov 29, 2024 · 7 comments

Comments

@schoennenbeck
Copy link
Contributor

schoennenbeck commented Nov 29, 2024

🚀 The feature, motivation and pitch

Motivation

If an engine is currently handling a single long sequence in the prefill stage any other incoming sequence has to wait untill the LLM is done with the long one before it gets to be handled. This means that in a situation with multiple users it can easily happen that a single user's ill-conceived (or simply long) request makes the LLM unresponsive for all other users.

Initial ideas

There are a couple of ways one can currently approach this.

  • Simply accepting this fact. We do first come first serve and people have to wait.
  • Upscale the LLM and host multiple replicas or scale a single replica over multiple GPUs to alleviate this a little bit.
  • Use priority scheduling and give longer requests lower priority

However, most of these ideas either come with their own problems and/or don't actually solve the problem.

Suggestion

I don't know of any approach that would work without chunked prefill. However, if we do do chunked prefill the following approach could work:

  • Introduce a new parameter to the engine min_num_concurrent_sequences (with default set to 1 which is just the current behaviour)
  • While scheduling, first schedule decodes (as is currently the case) as these take only a single available token from the budget.
  • The remaining tokens are now spread over enough chunked prefills to make sure there are at least min_num_concurrent_sequences that are handled during the next step (or if there aren't enough total sequences that all are handled).

Example

Say min_num_concurrent_sequences=2, max_num_batched_tokens=512 and we have two sequences with 8000 and 300 tokens respectively. Then we would do chunked prefill for both sequences with 256 tokens each.

Expected result

Implementing this would mean that no single user could block other users from getting their answers in a timely manner. Clearly the long sequence would now take longer to be handled but it would make for a little fairer handling of requests. It is still very much possible to get slow answers if the LLM is under high load but we could service more users at the same time at the cost of higher ITL for each user individually which I personally think is, in a lot of cases, prefarable to one user being serviced fast while everybody else has to wait.

Call for comments

I am currently trying my hands at a prototype implementation (basically because I need this for my use case) but it is hardly trivial. Any thoughts, comments and suggestions are welcome.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@NickLucche
Copy link
Contributor

Hey thanks for taking your time to share this.
I am not sure if I understood your proposal clearly, at least in online/async mode the engine does not stall on multiple reqs.

Have you tried giving enable-chunked-prefill a try?
By default the scheduling policy will switch to be decode-first (default one is prefill first to optimize TTFT) and you can set the number of sequences with --max-num-seqs.

In case you're referring to a more elaborate policy where you're actively discouraging long-running requests, this is not implemented afaik and is indeed non-trivial, requiring modifications to https://github.com/vllm-project/vllm/blob/main/vllm/core/scheduler.py#L507 (chunked prefill policy being one example of how to do it). You would probably want to have the running request queue be a prio q, but it would break the FIFO assumption that it's kinda a founding design choice in vllm.

Please keep in mind the scheduler is also targeted in the v1 arch changes #8779, so it may be harder to plan out big changes to core components before the re-design.

@schoennenbeck
Copy link
Contributor Author

Thanks @NickLucche for your answer. I clearly shouldn't have typed this up in a hurry.

I am well aware of chunked prefill. As I said, it's the only reason I think there is a solution to this problem at all.

at least in online/async mode the engine does not stall on multiple reqs

Technically it doesn't stall but for the user it feels like it. Let's say chunked prefill is enabled and a very long sequence comes (number of tokens a significant multiple of max_num_batched_tokens) then any sequence that comes in after will have to hang around in pending until the long sequence's prefill-stage is complete. As soon as the long sequence enters the decode stage there will be enough room in the batch to handle the prefill stage for the sequences in pending. For a user sending in a short request in the mean time it looks like the LLM is stuck because TTFT is very long. None of this is in any way affected by setting max_num_sequences. As long as there is one lone sequence in the prefill stage the LLM can't really do much else (tough it can do decodes when in chunked-prefill-mode which is already great).

What I am proposing is a method to be able to handle multiple prefills simultaneously - even if one of them could easily fill up max_num_batched_tokens on its own - by distributing the token budget more evenly over the current and pending prefills. I am aware that this breaks FIFO but as I mentioned I believe there are a lot of situations where slower handling for long sequences is well worth the significantly improved response time for all other users who interact with the model at the same time. (Basically a min-response vs median-response trade-off).

I also don't think any of this can just be handled by a priority queue. First of all there is already a priority mechanism in vllm and it doesn't change the fact that a single long prefill occupies the model and also the proposed method necessarily means looking at multiple request at the same time when planning the next engine step.

Please keep in mind the scheduler is also targeted in the v1 arch changes #8779, so it may be harder to plan out big changes to core components before the re-design.

Thanks, I saw that. Currently contemplating if I should base my implementation directly of v1 instead of having to do the whole thing again once v1 comes out.

Also I am very open for other suggestions. I'd much rather not spend all the energy implementing this. But at the moment using vllm to serve models in a customer facing product is actually a bit suboptimal because no matter how many replicas I deploy and no matter how many GPU's I throw at the problem a single long request can still slow down response times to other users as long as they unlucky enough to have their request routed to the same engine-instance.

@NickLucche
Copy link
Contributor

NickLucche commented Nov 29, 2024

Thanks for elaborating @schoennenbeck, a lot clearer now, apologies for the misunderstanding!
I think new scheduling policies are always worth taking into consideration and you bring up a very real use-case; unfortunately code-wise adding a new policy is not seamless yet, so I would advise basing on v1 hoping the new abstraction makes the coding work a lot less painful..

I also don't think any of this can just be handled by a priority queue..single long prefill occupies the model and also the proposed method necessarily means looking at multiple request at the same time when planning the next engine step.

Totally, I was just naively suggesting to rank inversely based on eg number of chunks request is split on, but I see now you would have to unify running+pending to avoid prioritizing already running ones. But this clearly requires more work.

Anyways, I think my inputs here are not needed, let's wait until a maintainer swings by to comment on adding new scheduling policies @DarkLight1337 @simon-mo @WoosukKwon.

@noooop
Copy link
Contributor

noooop commented Nov 30, 2024

Interesting idea

current Scheduling strategy “aka first-come-first-served“ only prefills the first request.

If the first request is very large, it will block.

So @schoennenbeck hope top_k request will be executed fairly when prefilling.

top_k = 1 degenerate to first-come-first-served

@joerunde
Copy link
Collaborator

We're already prototyping a solution within IBM here: #10235

Seems to be working okay so far, we wanted to gather more data internally before really pushing for that to be merged. We'd love more feedback!

@schoennenbeck
Copy link
Contributor Author

@joerunde Thank you so much for the link. I was already wondering if we were the only ones having this problem. Let me know if I can help in any form with the PR.

@joerunde
Copy link
Collaborator

@schoennenbeck Yeah if you wanna try out the changes, or give any feedback on what the CLI flags should be named I'd really appreciate it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants