-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefix Cache Aware Scheduling [1/n] #10128
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
d92de73
to
393c42e
Compare
|
||
return num_uncached_new_tokens_seq, num_cached_new_tokens_seq | ||
|
||
def _chunk_new_tokens_to_schedule( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mainly refactor from the main's logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Leave some commits.
Also cc @zhuohan123 @alexm-neuralmagic
This pull request has merge conflicts that must be resolved before it can be |
Updates
|
49221fe
to
6ada6fd
Compare
This pull request has merge conflicts that must be resolved before it can be |
lol - fml, i hate DCO. |
f2a9884
to
b3fa9d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The remaining issue per offline discussion is to make sure each scheduled sequence has at least one token in the budget.
cc @zhuohan123 for another review.
Updates
|
4e11a24
to
8c38100
Compare
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: rickyx <rickyx@anyscale.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Signed-off-by: rickyx <rickyx@anyscale.com>
TL;DR
Background
With current impl in main, there are at least 2 places where scheduling is not optimal:
This would result in under-utilization of KV cache, and un optimal scheduling decision for a batch.
For more details, see the #7883
High Level Approach
This PR addresses the issue by solving 1 only:
ComputedBlocksTracker
will track the block hashes for a sequence and tell scheduler/block manager how many tokens have been cached given a sequence.On a high level, there are below major changes.
Throughput Benchmark
Example command
More details in this doc
Serving Benchmark
Serving Results on QPS=10
Serving Results on QPS=15
With a higher request rate (QPS=15), the improvement on TTFT and request rate will be more significant (25%)
Serving Results on No Prefix Shared