-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Priority Scheduling #6077
Comments
I'm curious to learn more about the details of user facing and configuration API |
Currently, we are thinking of passing priority as a single optional variable in the generate function of LLM engine and specify the policy (like fcfs, sp, etc.) in scheduler config options. The priority variable could be replaced in future with a more general data class if required to support other policies. Open to any other suggestions. |
Something I would like is the ability to set a lower max concurrent requests for batch vs API requests. I would like to maintain a high t/s for users while allowing for high numbers of users if necessary. |
can you give more rationale on why do you need this inside vLLM? vLLM is an inference engine, it's goal is to give results as quickly as possible. It does not make any sense to consider request priority here in my opinion. request prioritization should happen one layer above vLLM, e.g. in load balance layer and cross-instance scheduling layer. |
Yes, it makes sense to have majority of the prioritization logic outside vLLM in a global orchestration framework for load balancing/autoscaling/etc. However, we still need some prioritization support inside vLLM to support movement of high priority requests from waiting queue to running queue. This feature cannot be supported outside vLLM as there is no way to inform the scheduler. Otherwise, high priority requests would be in the waiting queue for several seconds due to HOL blocking (especially for longer context). |
we have an abort interface, and similarly maybe we can have some I don't think you need a full-fledged sorting of all requests, in every step, to achieve your goal. |
I'd like to know your answer to the questions listed in #6867 (comment) to help further discussion.
|
Yes, a cut-in interface would also work, so we could eliminate sorting. If we skip sorting the running queue, one of the choices to make with a cut-in interface is whether to maintain the FCFS order within each priority. If FCFS order is to be maintained, high priority requests would need to be added to the middle of the running queue (a relatively expensive operation with Python lists), or FCFS is ignored so the cut-in operations are cheap additions to the front of the queue. In either case, both possibilities are faster than sorting. Regarding the other questions:
We only need to preempt out the minimum number of requests such that sufficient budget is available to schedule the high priority request.
No we do not change the priority between preemptions, it is static. However, as you hinted earlier, the priorities could be made dynamic by the global orchestration logic sitting outside vLLM by using abort and re-send the request with a different priority. |
Got it.
I appreciate this idea. vLLM should only see requests with static priority.
I have a somewhat different opinion about this. The priority should be priority in the waiting queue, i.e. when we can schedule a new request, the request with the largest priority will be selected. It should not apply to running queue, i.e. it should not preempt existing running requests. Implementation wise, I think we should still remove the sorting policy, but move the interface to when we add the request, with a |
It may still be beneficial to preempt the running queue because high priority requests could be blocked at the front of the waiting queue if no token blocks are available. This blocking time could be several seconds or longer if the high priority request has a large token requirement. On the downside, it increases preemptions and drops throughput for the low priority requests, but perhaps a configurable flag can be provided to decide between waiting queue and running queue cut-in? So it is not a decision that vLLM needs to make and can be deferred to the global orchestration.
Yes, this idea sounds good and sorting can be dropped in favor of the cut-in interface. |
Thanks for the discussion. I think we might use some priority queue in the future, so that priority scheduling will be made easier. We don't even need a cut-in queue, just assign priority based on arrival time first, so it is FCFS by default. And if priority is set for requests, we can increase their priority. One take-way message, from our discussion, is that we don't need dynamic priority. As the priority of a request will be static during its lifespan, per-iteration sorting can be dropped. cc @njhill |
Thanks @youkaichao for the summary. I am wondering if we should wait for the new scheduler code or redo the PR with priority queue? |
@saurabhjha1 which pr do you refer to? |
@saurabhjha1 is referring to the corresponding code PR for the design discussion which is #5958 |
I think you should wait for the new scheduler code. There will be a heavy refactor in scheduler recently, for a more efficient scheduler. |
I think sharing about the "higher level" scheduling mechanism (as it is being referred to earlier in this issue) I implemented might be beneficial here: Basically I implemented scheduling logic external to vLLM, which is user aware and sends requests to vLLM in a calculated manner so each user gets it's fair-share, and no single user can over-load the vLLM instance by itself. Now to improve on this existing layer, I want to implement "opportunistic" scheduling, so that if one user doesn't use its share of requests, another user would be able to "borrow" it. This leads me to thoughts similar to other comments in this issue, that I believe that vLLM should support external hooks for preemption and continuation of requests, as I want to avoid aborting and resending the request since it causes multiple redundant prefill calculations. I think an interface for preemption and continuation of requests is the most flexible form of external scheduling control, and it should be possible to also implement priority scheduling using these hooks. (Note that by preemption and continuation I do not mean offloading the KV caches of the requests to the CPU, all I mean is that the request shouldn't be scheduled until it is continued) |
@apatke and @youkaichao , what is the schedule plan for this PR? |
@tonyaw We are working on an updated implementation to work with the refactored scheduler. The implementation should be ready in a few days. Please stay tuned for updates! |
That is great! Thanks for your effort! :-) |
Motivation.
vLLM supports first-come-first-served scheduling based on the arrival time of requests.
A prioritization mechanism that enables certain requests to be given higher preference in scheduling is useful because it can enable:
Batch and interactive requests co-location: Batch and interactive requests can be served on the same vLLM instance. If interactive requests arrive while batch requests are executing, they preempt the batch requests for immediate execution. Once the interactive requests are completed, the batch request can resume. If the KV cache for batch requests is preserved, disruption to the overall throughput can be minimized.
Maintain fairness between multiple interactive requests: Recent papers (such as Andes, VTC, and QLM) have proposed mechanisms to maintain fairness between multiple executing requests (for e.g. by maintaining inter-token latency). Most of these mechanisms can be implemented by dynamically changing priority of requests in the waiting and running queue of vLLM.
Proposed Change.
vLLM already has a pluggable scheduling policy class to implement priority scheduling. Hence, the overall change is relatively minimal. These are the major changes required:
Introduction of a priority field in the sequence group: This priority can be static or dynamic based on the specific scheduling policy.
Waiting queue sorting based on priority: Currently, only the running queue is sorted based on the scheduling policy, and the waiting queue is ordered by FCFS. The waiting queue ordering can be made dependent on the policy.
Jointly sorting the waiting and running queue: While both waiting and running queue can be sorted independently, there can still exist priority inversions between the two. Therefore, they also need to be sorted jointly. Sorting the two queues jointly is not possible without forcefully preempting our requests from the running queue and replacing them with requests from the waiting queue.
Enabling forced preemption while preserving the KV cache: Recomputation after forced preemption can lead to repeated computation and KV cache can be preserved to prevent this repeated computation. KV cache swapping can piggyback on the existing implementation of KV cache swapping (implemented in _preempt_by_swap).
Dynamically changing priorities: To maintain fairness between executing requests, their priorities can be dynamically adjusted to prevent starvation and maintain inter-token latency. For example, the Andes paper adjusts request priorities based on estimated Quality of Experience gain (i.e. most starved request with high preempted time gets higher priority). Similarly other policies can be implemented within the generic priority interface.
PR #5958 implements 1,2, and 3. In this PR, priority scheduling can be disabled if not required or overhead is unacceptable.
The PR also adds a priority field to generate function in LLM engine. In future, this could be replaced by a more general scheduling metadata structure to allow for other policy-specific variables. Any suggestions/comments on this metadata structure would be helpful.
Feedback Period.
No response
CC List.
@njhill @saurabhjha1
Any Other Things.
No response
The text was updated successfully, but these errors were encountered: