Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Priority Scheduling #6077

Closed
apatke opened this issue Jul 2, 2024 · 20 comments · Fixed by #5958
Closed

[RFC]: Priority Scheduling #6077

apatke opened this issue Jul 2, 2024 · 20 comments · Fixed by #5958
Labels

Comments

@apatke
Copy link
Contributor

apatke commented Jul 2, 2024

Motivation.

vLLM supports first-come-first-served scheduling based on the arrival time of requests.

A prioritization mechanism that enables certain requests to be given higher preference in scheduling is useful because it can enable:

  1. Batch and interactive requests co-location: Batch and interactive requests can be served on the same vLLM instance. If interactive requests arrive while batch requests are executing, they preempt the batch requests for immediate execution. Once the interactive requests are completed, the batch request can resume. If the KV cache for batch requests is preserved, disruption to the overall throughput can be minimized.

  2. Maintain fairness between multiple interactive requests: Recent papers (such as Andes, VTC, and QLM) have proposed mechanisms to maintain fairness between multiple executing requests (for e.g. by maintaining inter-token latency). Most of these mechanisms can be implemented by dynamically changing priority of requests in the waiting and running queue of vLLM.

Proposed Change.

vLLM already has a pluggable scheduling policy class to implement priority scheduling. Hence, the overall change is relatively minimal. These are the major changes required:

  1. Introduction of a priority field in the sequence group: This priority can be static or dynamic based on the specific scheduling policy.

  2. Waiting queue sorting based on priority: Currently, only the running queue is sorted based on the scheduling policy, and the waiting queue is ordered by FCFS. The waiting queue ordering can be made dependent on the policy.

  3. Jointly sorting the waiting and running queue: While both waiting and running queue can be sorted independently, there can still exist priority inversions between the two. Therefore, they also need to be sorted jointly. Sorting the two queues jointly is not possible without forcefully preempting our requests from the running queue and replacing them with requests from the waiting queue.

  4. Enabling forced preemption while preserving the KV cache: Recomputation after forced preemption can lead to repeated computation and KV cache can be preserved to prevent this repeated computation. KV cache swapping can piggyback on the existing implementation of KV cache swapping (implemented in _preempt_by_swap).

  5. Dynamically changing priorities: To maintain fairness between executing requests, their priorities can be dynamically adjusted to prevent starvation and maintain inter-token latency. For example, the Andes paper adjusts request priorities based on estimated Quality of Experience gain (i.e. most starved request with high preempted time gets higher priority). Similarly other policies can be implemented within the generic priority interface.

PR #5958 implements 1,2, and 3. In this PR, priority scheduling can be disabled if not required or overhead is unacceptable.

The PR also adds a priority field to generate function in LLM engine. In future, this could be replaced by a more general scheduling metadata structure to allow for other policy-specific variables. Any suggestions/comments on this metadata structure would be helpful.

Feedback Period.

No response

CC List.

@njhill @saurabhjha1

Any Other Things.

No response

@apatke apatke added the RFC label Jul 2, 2024
@simon-mo
Copy link
Collaborator

simon-mo commented Jul 2, 2024

I'm curious to learn more about the details of user facing and configuration API

@apatke
Copy link
Contributor Author

apatke commented Jul 2, 2024

Currently, we are thinking of passing priority as a single optional variable in the generate function of LLM engine and specify the policy (like fcfs, sp, etc.) in scheduler config options. The priority variable could be replaced in future with a more general data class if required to support other policies. Open to any other suggestions.

@w013nad
Copy link

w013nad commented Jul 4, 2024

Something I would like is the ability to set a lower max concurrent requests for batch vs API requests. I would like to maintain a high t/s for users while allowing for high numbers of users if necessary.

@youkaichao
Copy link
Member

can you give more rationale on why do you need this inside vLLM?

vLLM is an inference engine, it's goal is to give results as quickly as possible. It does not make any sense to consider request priority here in my opinion.

request prioritization should happen one layer above vLLM, e.g. in load balance layer and cross-instance scheduling layer.

@apatke
Copy link
Contributor Author

apatke commented Jul 30, 2024

Yes, it makes sense to have majority of the prioritization logic outside vLLM in a global orchestration framework for load balancing/autoscaling/etc.

However, we still need some prioritization support inside vLLM to support movement of high priority requests from waiting queue to running queue. This feature cannot be supported outside vLLM as there is no way to inform the scheduler. Otherwise, high priority requests would be in the waiting queue for several seconds due to HOL blocking (especially for longer context).

@youkaichao
Copy link
Member

we have an abort interface, and similarly maybe we can have some cut-in interface, to add certain requests with high priority.

I don't think you need a full-fledged sorting of all requests, in every step, to achieve your goal.

@youkaichao
Copy link
Member

I'd like to know your answer to the questions listed in #6867 (comment) to help further discussion.

For example, if a request joins, with highest priority, do you need to stop or preempt the running requests? When you swap out a request, and swap it in again, do you change it's priority? Is request priority a constant (determined when received), or can be dynamic and changing in every iteration?

@apatke
Copy link
Contributor Author

apatke commented Jul 31, 2024

Yes, a cut-in interface would also work, so we could eliminate sorting.

If we skip sorting the running queue, one of the choices to make with a cut-in interface is whether to maintain the FCFS order within each priority. If FCFS order is to be maintained, high priority requests would need to be added to the middle of the running queue (a relatively expensive operation with Python lists), or FCFS is ignored so the cut-in operations are cheap additions to the front of the queue. In either case, both possibilities are faster than sorting.

Regarding the other questions:

if a request joins, with highest priority, do you need to stop or preempt the running requests?

We only need to preempt out the minimum number of requests such that sufficient budget is available to schedule the high priority request.

When you swap out a request, and swap it in again, do you change it's priority?
Is request priority a constant (determined when received), or can be dynamic and changing in every iteration?

No we do not change the priority between preemptions, it is static. However, as you hinted earlier, the priorities could be made dynamic by the global orchestration logic sitting outside vLLM by using abort and re-send the request with a different priority.

@youkaichao
Copy link
Member

Got it.

the priorities could be made dynamic by the global orchestration logic sitting outside vLLM by using abort and re-send the request with a different priority.

I appreciate this idea. vLLM should only see requests with static priority.

We only need to preempt out the minimum number of requests such that sufficient budget is available to schedule the high priority request.

I have a somewhat different opinion about this. The priority should be priority in the waiting queue, i.e. when we can schedule a new request, the request with the largest priority will be selected. It should not apply to running queue, i.e. it should not preempt existing running requests.

Implementation wise, I think we should still remove the sorting policy, but move the interface to when we add the request, with a priority field somewhere. None means FCFS.

@apatke
Copy link
Contributor Author

apatke commented Jul 31, 2024

The priority should be priority in the waiting queue, i.e. when we can schedule a new request, the request with the largest priority will be selected. It should not apply to running queue, i.e. it should not preempt existing running requests.

It may still be beneficial to preempt the running queue because high priority requests could be blocked at the front of the waiting queue if no token blocks are available. This blocking time could be several seconds or longer if the high priority request has a large token requirement.

On the downside, it increases preemptions and drops throughput for the low priority requests, but perhaps a configurable flag can be provided to decide between waiting queue and running queue cut-in? So it is not a decision that vLLM needs to make and can be deferred to the global orchestration.

Implementation wise, I think we should still remove the sorting policy, but move the interface to when we add the request, with a priority field somewhere. None means FCFS.

Yes, this idea sounds good and sorting can be dropped in favor of the cut-in interface.

@youkaichao
Copy link
Member

Thanks for the discussion. I think we might use some priority queue in the future, so that priority scheduling will be made easier. We don't even need a cut-in queue, just assign priority based on arrival time first, so it is FCFS by default. And if priority is set for requests, we can increase their priority.

One take-way message, from our discussion, is that we don't need dynamic priority. As the priority of a request will be static during its lifespan, per-iteration sorting can be dropped.

cc @njhill

@saurabhjha1
Copy link

Thanks @youkaichao for the summary. I am wondering if we should wait for the new scheduler code or redo the PR with priority queue?

@youkaichao
Copy link
Member

@saurabhjha1 which pr do you refer to?

@apatke
Copy link
Contributor Author

apatke commented Aug 1, 2024

@saurabhjha1 is referring to the corresponding code PR for the design discussion which is #5958

@youkaichao
Copy link
Member

I think you should wait for the new scheduler code. There will be a heavy refactor in scheduler recently, for a more efficient scheduler.

@NadavShmayo
Copy link
Contributor

I think sharing about the "higher level" scheduling mechanism (as it is being referred to earlier in this issue) I implemented might be beneficial here:

Basically I implemented scheduling logic external to vLLM, which is user aware and sends requests to vLLM in a calculated manner so each user gets it's fair-share, and no single user can over-load the vLLM instance by itself.

Now to improve on this existing layer, I want to implement "opportunistic" scheduling, so that if one user doesn't use its share of requests, another user would be able to "borrow" it.
To implement this correctly, I have to consider the case in which one user "borrows" from another, but then the user that didn't use its requests now sends them, so we have to preempt the former user's requests.

This leads me to thoughts similar to other comments in this issue, that I believe that vLLM should support external hooks for preemption and continuation of requests, as I want to avoid aborting and resending the request since it causes multiple redundant prefill calculations.

I think an interface for preemption and continuation of requests is the most flexible form of external scheduling control, and it should be possible to also implement priority scheduling using these hooks.

(Note that by preemption and continuation I do not mean offloading the KV caches of the requests to the CPU, all I mean is that the request shouldn't be scheduled until it is continued)

@tonyaw
Copy link

tonyaw commented Aug 30, 2024

@apatke and @youkaichao , what is the schedule plan for this PR?
We need this feature very much.
If you can share your draft code for refactoring schedule, that is also perfect!
Thanks in advance! :-)

@apatke
Copy link
Contributor Author

apatke commented Aug 30, 2024

@tonyaw We are working on an updated implementation to work with the refactored scheduler. The implementation should be ready in a few days. Please stay tuned for updates!
@saurabhjha1 @njhill

@tonyaw
Copy link

tonyaw commented Aug 30, 2024

That is great! Thanks for your effort! :-)

@tonyaw
Copy link

tonyaw commented Sep 9, 2024

@apatke , may I ask the official schedule for your PR #5958? :-)
It is the PR for refactored schedule, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants