[RFC]: Priority Scheduling #6077

apatke · 2024-07-02T19:27:53Z

Motivation.

vLLM supports first-come-first-served scheduling based on the arrival time of requests.

A prioritization mechanism that enables certain requests to be given higher preference in scheduling is useful because it can enable:

Batch and interactive requests co-location: Batch and interactive requests can be served on the same vLLM instance. If interactive requests arrive while batch requests are executing, they preempt the batch requests for immediate execution. Once the interactive requests are completed, the batch request can resume. If the KV cache for batch requests is preserved, disruption to the overall throughput can be minimized.
Maintain fairness between multiple interactive requests: Recent papers (such as Andes, VTC, and QLM) have proposed mechanisms to maintain fairness between multiple executing requests (for e.g. by maintaining inter-token latency). Most of these mechanisms can be implemented by dynamically changing priority of requests in the waiting and running queue of vLLM.

Proposed Change.

vLLM already has a pluggable scheduling policy class to implement priority scheduling. Hence, the overall change is relatively minimal. These are the major changes required:

Introduction of a priority field in the sequence group: This priority can be static or dynamic based on the specific scheduling policy.
Waiting queue sorting based on priority: Currently, only the running queue is sorted based on the scheduling policy, and the waiting queue is ordered by FCFS. The waiting queue ordering can be made dependent on the policy.
Jointly sorting the waiting and running queue: While both waiting and running queue can be sorted independently, there can still exist priority inversions between the two. Therefore, they also need to be sorted jointly. Sorting the two queues jointly is not possible without forcefully preempting our requests from the running queue and replacing them with requests from the waiting queue.
Enabling forced preemption while preserving the KV cache: Recomputation after forced preemption can lead to repeated computation and KV cache can be preserved to prevent this repeated computation. KV cache swapping can piggyback on the existing implementation of KV cache swapping (implemented in _preempt_by_swap).
Dynamically changing priorities: To maintain fairness between executing requests, their priorities can be dynamically adjusted to prevent starvation and maintain inter-token latency. For example, the Andes paper adjusts request priorities based on estimated Quality of Experience gain (i.e. most starved request with high preempted time gets higher priority). Similarly other policies can be implemented within the generic priority interface.

PR #5958 implements 1,2, and 3. In this PR, priority scheduling can be disabled if not required or overhead is unacceptable.

The PR also adds a priority field to generate function in LLM engine. In future, this could be replaced by a more general scheduling metadata structure to allow for other policy-specific variables. Any suggestions/comments on this metadata structure would be helpful.

Feedback Period.

No response

CC List.

@njhill @saurabhjha1

Any Other Things.

No response

simon-mo · 2024-07-02T19:35:22Z

I'm curious to learn more about the details of user facing and configuration API

apatke · 2024-07-02T20:30:02Z

Currently, we are thinking of passing priority as a single optional variable in the generate function of LLM engine and specify the policy (like fcfs, sp, etc.) in scheduler config options. The priority variable could be replaced in future with a more general data class if required to support other policies. Open to any other suggestions.

w013nad · 2024-07-04T15:39:14Z

Something I would like is the ability to set a lower max concurrent requests for batch vs API requests. I would like to maintain a high t/s for users while allowing for high numbers of users if necessary.

youkaichao · 2024-07-30T23:37:27Z

can you give more rationale on why do you need this inside vLLM?

vLLM is an inference engine, it's goal is to give results as quickly as possible. It does not make any sense to consider request priority here in my opinion.

request prioritization should happen one layer above vLLM, e.g. in load balance layer and cross-instance scheduling layer.

apatke · 2024-07-30T23:58:42Z

Yes, it makes sense to have majority of the prioritization logic outside vLLM in a global orchestration framework for load balancing/autoscaling/etc.

However, we still need some prioritization support inside vLLM to support movement of high priority requests from waiting queue to running queue. This feature cannot be supported outside vLLM as there is no way to inform the scheduler. Otherwise, high priority requests would be in the waiting queue for several seconds due to HOL blocking (especially for longer context).

youkaichao · 2024-07-31T00:23:44Z

we have an abort interface, and similarly maybe we can have some cut-in interface, to add certain requests with high priority.

I don't think you need a full-fledged sorting of all requests, in every step, to achieve your goal.

youkaichao · 2024-07-31T00:59:36Z

I'd like to know your answer to the questions listed in #6867 (comment) to help further discussion.

For example, if a request joins, with highest priority, do you need to stop or preempt the running requests? When you swap out a request, and swap it in again, do you change it's priority? Is request priority a constant (determined when received), or can be dynamic and changing in every iteration?

apatke · 2024-07-31T03:02:36Z

Yes, a cut-in interface would also work, so we could eliminate sorting.

If we skip sorting the running queue, one of the choices to make with a cut-in interface is whether to maintain the FCFS order within each priority. If FCFS order is to be maintained, high priority requests would need to be added to the middle of the running queue (a relatively expensive operation with Python lists), or FCFS is ignored so the cut-in operations are cheap additions to the front of the queue. In either case, both possibilities are faster than sorting.

Regarding the other questions:

if a request joins, with highest priority, do you need to stop or preempt the running requests?

We only need to preempt out the minimum number of requests such that sufficient budget is available to schedule the high priority request.

When you swap out a request, and swap it in again, do you change it's priority?
Is request priority a constant (determined when received), or can be dynamic and changing in every iteration?

No we do not change the priority between preemptions, it is static. However, as you hinted earlier, the priorities could be made dynamic by the global orchestration logic sitting outside vLLM by using abort and re-send the request with a different priority.

youkaichao · 2024-07-31T04:40:33Z

Got it.

the priorities could be made dynamic by the global orchestration logic sitting outside vLLM by using abort and re-send the request with a different priority.

I appreciate this idea. vLLM should only see requests with static priority.

We only need to preempt out the minimum number of requests such that sufficient budget is available to schedule the high priority request.

I have a somewhat different opinion about this. The priority should be priority in the waiting queue, i.e. when we can schedule a new request, the request with the largest priority will be selected. It should not apply to running queue, i.e. it should not preempt existing running requests.

Implementation wise, I think we should still remove the sorting policy, but move the interface to when we add the request, with a priority field somewhere. None means FCFS.

apatke · 2024-07-31T10:11:54Z

The priority should be priority in the waiting queue, i.e. when we can schedule a new request, the request with the largest priority will be selected. It should not apply to running queue, i.e. it should not preempt existing running requests.

It may still be beneficial to preempt the running queue because high priority requests could be blocked at the front of the waiting queue if no token blocks are available. This blocking time could be several seconds or longer if the high priority request has a large token requirement.

On the downside, it increases preemptions and drops throughput for the low priority requests, but perhaps a configurable flag can be provided to decide between waiting queue and running queue cut-in? So it is not a decision that vLLM needs to make and can be deferred to the global orchestration.

Implementation wise, I think we should still remove the sorting policy, but move the interface to when we add the request, with a priority field somewhere. None means FCFS.

Yes, this idea sounds good and sorting can be dropped in favor of the cut-in interface.

youkaichao · 2024-07-31T19:11:55Z

Thanks for the discussion. I think we might use some priority queue in the future, so that priority scheduling will be made easier. We don't even need a cut-in queue, just assign priority based on arrival time first, so it is FCFS by default. And if priority is set for requests, we can increase their priority.

One take-way message, from our discussion, is that we don't need dynamic priority. As the priority of a request will be static during its lifespan, per-iteration sorting can be dropped.

cc @njhill

saurabhjha1 · 2024-07-31T21:07:56Z

Thanks @youkaichao for the summary. I am wondering if we should wait for the new scheduler code or redo the PR with priority queue?

youkaichao · 2024-07-31T21:42:36Z

@saurabhjha1 which pr do you refer to?

apatke · 2024-08-01T00:21:33Z

@saurabhjha1 is referring to the corresponding code PR for the design discussion which is #5958

youkaichao · 2024-08-01T00:37:07Z

I think you should wait for the new scheduler code. There will be a heavy refactor in scheduler recently, for a more efficient scheduler.

NadavShmayo · 2024-08-12T19:57:50Z

I think sharing about the "higher level" scheduling mechanism (as it is being referred to earlier in this issue) I implemented might be beneficial here:

Basically I implemented scheduling logic external to vLLM, which is user aware and sends requests to vLLM in a calculated manner so each user gets it's fair-share, and no single user can over-load the vLLM instance by itself.

Now to improve on this existing layer, I want to implement "opportunistic" scheduling, so that if one user doesn't use its share of requests, another user would be able to "borrow" it.
To implement this correctly, I have to consider the case in which one user "borrows" from another, but then the user that didn't use its requests now sends them, so we have to preempt the former user's requests.

This leads me to thoughts similar to other comments in this issue, that I believe that vLLM should support external hooks for preemption and continuation of requests, as I want to avoid aborting and resending the request since it causes multiple redundant prefill calculations.

I think an interface for preemption and continuation of requests is the most flexible form of external scheduling control, and it should be possible to also implement priority scheduling using these hooks.

(Note that by preemption and continuation I do not mean offloading the KV caches of the requests to the CPU, all I mean is that the request shouldn't be scheduled until it is continued)

tonyaw · 2024-08-30T01:27:25Z

@apatke and @youkaichao , what is the schedule plan for this PR?
We need this feature very much.
If you can share your draft code for refactoring schedule, that is also perfect!
Thanks in advance! :-)

apatke · 2024-08-30T01:41:13Z

@tonyaw We are working on an updated implementation to work with the refactored scheduler. The implementation should be ready in a few days. Please stay tuned for updates!
@saurabhjha1 @njhill

tonyaw · 2024-08-30T04:51:25Z

That is great! Thanks for your effort! :-)

tonyaw · 2024-09-09T07:06:25Z

@apatke , may I ask the official schedule for your PR #5958? :-)
It is the PR for refactored schedule, right?

apatke added the RFC label Jul 2, 2024

njhill mentioned this issue Jul 30, 2024

[core][scheduler] simplify and improve scheduler #6867

Merged

apatke mentioned this issue Aug 7, 2024

[RFC]: Replaceable Scheduler #7123

Closed

apatke mentioned this issue Sep 5, 2024

[Core] Adding Priority Scheduling #5958

Merged

youkaichao closed this as completed in #5958 Sep 25, 2024

James-QiuHaoran mentioned this issue Nov 13, 2024

关于不同的调度策略是如何在系统中实现的 James-QiuHaoran/LLM-serving-with-proxy-models#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Priority Scheduling #6077

[RFC]: Priority Scheduling #6077

apatke commented Jul 2, 2024

simon-mo commented Jul 2, 2024

apatke commented Jul 2, 2024

w013nad commented Jul 4, 2024

youkaichao commented Jul 30, 2024

apatke commented Jul 30, 2024

youkaichao commented Jul 31, 2024

youkaichao commented Jul 31, 2024

apatke commented Jul 31, 2024 •

edited

Loading

youkaichao commented Jul 31, 2024

apatke commented Jul 31, 2024

youkaichao commented Jul 31, 2024

saurabhjha1 commented Jul 31, 2024

youkaichao commented Jul 31, 2024

apatke commented Aug 1, 2024

youkaichao commented Aug 1, 2024

NadavShmayo commented Aug 12, 2024

tonyaw commented Aug 30, 2024

apatke commented Aug 30, 2024

tonyaw commented Aug 30, 2024

tonyaw commented Sep 9, 2024

[RFC]: Priority Scheduling #6077

[RFC]: Priority Scheduling #6077

Comments

apatke commented Jul 2, 2024

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

simon-mo commented Jul 2, 2024

apatke commented Jul 2, 2024

w013nad commented Jul 4, 2024

youkaichao commented Jul 30, 2024

apatke commented Jul 30, 2024

youkaichao commented Jul 31, 2024

youkaichao commented Jul 31, 2024

apatke commented Jul 31, 2024 • edited Loading

youkaichao commented Jul 31, 2024

apatke commented Jul 31, 2024

youkaichao commented Jul 31, 2024

saurabhjha1 commented Jul 31, 2024

youkaichao commented Jul 31, 2024

apatke commented Aug 1, 2024

youkaichao commented Aug 1, 2024

NadavShmayo commented Aug 12, 2024

tonyaw commented Aug 30, 2024

apatke commented Aug 30, 2024

tonyaw commented Aug 30, 2024

tonyaw commented Sep 9, 2024

apatke commented Jul 31, 2024 •

edited

Loading