[Performance]: Prefix-caching aware scheduling #7883

comaniac · 2024-08-26T21:30:38Z

Proposal to improve performance

The current execution flow with prefix caching is as follows:

Scheduler takes the next prefill sequence:
a. Calculate how many blocks it needs.
b. Check whether we have sufficient number of blocks in the block manager.
c. If so, determine the number of tokens to be prefilled in this batch (it is equal to the prompt length without chunked prefill, or at maximum the chunked size otherwise).
d. Update the batch token budget by subtracting the tokens to be prefilled.
e. Allocate all (regardless how many tokens to prefill in this batch) blocks.
f. Match allocated block IDs with prefix cache, and list them in computed_block_nums.
Prepare input:
a. Get the number of tokens to prefill for this sequence in this batch.
b. Setup input token IDs and positions.
c. If computed_block_nums is not none, then remove the cached tokens from input tokens, and adjust input positions, query length and context length accordingly.
Execute the model.

The inefficiencies are then:

In Step 1.b, we now don't consider prefix caching. Taking a sequence with 16 blocks in prompt as an example, it now requires block manager to have 16 free blocks to be scheduled. However, assuming 12 of 16 blocks are already cached, we actually only need free 4 blocks to schedule this sequence.
In Step 1.d, we now don't consider prefix caching. Assuming the number of batched tokens is set to 2048, and we scheduled 2 sequences with 1024 tokens each. However, if the first 512 prefix tokens are already cached, then the batch size is actually 1024 instead of 2048.

The above inefficiencies come from the fact that we know which blocks are cached starting from Step 1.f. Thus, we propose the following changes:

Improve can_allocate in block manager at Step 1.b to consider prefix caching. For example, although a sequence needs 16 blocks for its prompt, can_allocate could still return True even the block manager only has 4 blocks left when 12 of 16 blocks are already cached.
Improve Step 1.c in the scheduler to consider prefix caching. Specifically, this step should guarantee the number of new tokens to prefill are not cached. If an entire prompt of a sequence is cached, we should only compute the last token.

cc @rkooo567 @sighingnow @Juelianqvq @zhuohan123

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

Juelianqvq · 2024-08-27T01:54:12Z

Overall it makes sense to me, since the two strategies PC + CP are combined together, a more powerful customized scheduler is on the road naturally. Will take some time dig into it.

rkooo567 · 2024-08-28T06:27:07Z

1.b is not specific to CP, is this correct? I think the proposal makes a lot of sense

comaniac · 2024-08-28T06:29:43Z

Yes 1.b is a general problem.

rickyyx · 2024-10-04T22:19:18Z

I’ve been looking into potential ways to implement this feature and wanted to share some notes and proposals for discussion. Please feel free to correct any misunderstandings.

Background

The current prefix cache reuse operates at the allocator/block level, specifically:

When an immutable block (i.e., a full block) is created or promoted in the allocator, its hash is computed based on two factors:
1. The token IDs, and
2. The block's index in the sequence.
Blocks with the same hash are reused during allocation/creation in a way that is transparent to the caller (block manager/block table).

Issues

While the existing system effectively handles cache reuse transparently for upper components, several issues arise with new features:

Additional Metadata for Cache Reuse:
To handle recent features, more metadata beyond the block index and token IDs should be considered for cache reuse:
- LoRA Request ID: To properly support LoRA, the request ID should also be included in the block hash calculation. This allows differentiation of key-value (KV) caches generated by different LoRA adaptors, even when sequences have the same token IDs.
- Encoder/Decoder Sequences: For encoder-decoder models, we likely need to differentiate between encoder and decoder sequences, even if they contain the same token IDs. It’s possible the current implementation assumes that decoder sequences will always differ from encoder sequences, but this might need further investigation.
Lack of Cache Awareness for Scheduling:
Currently, scheduling cannot determine whether block caches are being reused for a sequence. This lack of visibility makes it impossible to assess whether cache reuse is possible (which seems to be the root cause of the issue mentioned earlier).

Proposal

To make memory management more prefix-aware, I propose the following principle:

The block hash should be treated as metadata associated with the sequence (or block table), rather than being tied to the allocator or the block itself.

While the block could still store this information, it should be treated as an external property (e.g., passed through constructors/setters) rather than being computed within the block. In fact, this seems similar to how the v1 block manager worked.

Potential Implementation:

Compute the block hash once and store it as part of the Sequence.
Provide an API to set the block hash on a block, rather than computing it within PrefixCachingBlock.
When scheduling a sequence, use the block’s stored hash to look up any existing blocks, allowing for efficient scheduling based on cache reuse potential.

comaniac · 2024-10-04T22:23:48Z

Thanks for the write up and it's exactly what I'm thinking. Regarding (1) that stores in the sequence similar to block manager v1, I don't know why we didn't do that in v2. @youkaichao @rkooo567 @cadedaniel can any of you provide some design rationales? Thanks

cadedaniel · 2024-10-05T21:58:43Z

some thoughts that I hope are helpful!

storing the hash info in Sequence is a valid design choice. I was unable to create a design that worked with the requirements CoW x prefix caching x lookahead x swap x sliding window that didn't completely blow up the sequence class. But it could be doable if you drop requirements (seems you can remove CoW and Swapping?). The whole reason the block manager v2 has such design sophistication is because it became very difficult to work in v1 with the many different requirements, leading to bugs and impossible to extend code.
for LoRA, why not just add the Lora ID to the hash function? I think this will compose well with existing prefix caching system, but I might miss some edge cases
for EncDec, I think the cache characteristics are different enough to warrant a completely different block manager implementation.
if I were doing this proposal and didn't want to rewrite the block management system, I would add an API to the block manager which queries how many new blocks would be required for a given sequence to be scheduled. Then you can use this to improve scheduling to include prefix caching hints. I believe this shouldn't be that hard but I might be missing some edge case.

comaniac · 2024-10-06T02:11:10Z

Thanks for the useful information! We will take a deeper look at CoW and swapping.

For LoRA ID, it's doable but needs some refactoring, as LoRA ID currently cannot be directly obtained by block manager.

For the extended API that returns number of new blocks, yes that's one option. The concern here is that we need to use prompt IDs to query this API and it needs to compute the hashes, which becomes an additional hash computation. Maybe we could let the new API return hashes and pass to another new API such as allocate_by_hash().

rickyyx · 2024-12-20T01:57:28Z

Should we close this given the release is out?

I think there could be some further optimizations in v0:

we are still computing block hashes twice (once when adding the request to tracker, once when allocating the block) though -> this could be a different issue from this.
we are still not being prefix cache aware when allocating blocks so there's some under utilization of kv cache (this seems to be fine since it only happens when the KV cache runs out, and the under utilization actually sometimes prevents preemption to happen).

comaniac added help wanted Extra attention is needed performance Performance-related issues labels Aug 26, 2024

comaniac mentioned this issue Aug 26, 2024

[Performance] Enable chunked prefill and prefix caching together #7753

Merged

This was referenced Oct 31, 2024

Prefix Cache Aware Scheduling rickyyx/vllm#1

Draft

[Prototype][WIP] Prefix Cache Aware Scheduling for V0 #9862

Closed

rickyyx mentioned this issue Nov 12, 2024

Prefix Cache Aware Scheduling [1/n] #10128

Merged

comaniac closed this as completed Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Prefix-caching aware scheduling #7883

[Performance]: Prefix-caching aware scheduling #7883

comaniac commented Aug 26, 2024 •

edited

Loading

Juelianqvq commented Aug 27, 2024

rkooo567 commented Aug 28, 2024

comaniac commented Aug 28, 2024

rickyyx commented Oct 4, 2024

comaniac commented Oct 4, 2024

cadedaniel commented Oct 5, 2024

comaniac commented Oct 6, 2024 •

edited

Loading

rickyyx commented Dec 20, 2024 •

edited

Loading

[Performance]: Prefix-caching aware scheduling #7883

[Performance]: Prefix-caching aware scheduling #7883

Comments

comaniac commented Aug 26, 2024 • edited Loading

Proposal to improve performance

Before submitting a new issue...

Juelianqvq commented Aug 27, 2024

rkooo567 commented Aug 28, 2024

comaniac commented Aug 28, 2024

rickyyx commented Oct 4, 2024

Background

Issues

Proposal

Potential Implementation:

comaniac commented Oct 4, 2024

cadedaniel commented Oct 5, 2024

comaniac commented Oct 6, 2024 • edited Loading

rickyyx commented Dec 20, 2024 • edited Loading

comaniac commented Aug 26, 2024 •

edited

Loading

comaniac commented Oct 6, 2024 •

edited

Loading

rickyyx commented Dec 20, 2024 •

edited

Loading