-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching #8240
[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching #8240
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
vllm/core/block_manager_v2.py
Outdated
@@ -149,7 +149,10 @@ def _allocate_sequence(self, seq: Sequence) -> BlockTable: | |||
block_allocator=self.block_allocator, | |||
max_block_sliding_window=self.max_block_sliding_window, | |||
) | |||
block_table.allocate(seq.get_token_ids()) | |||
|
|||
contextual_hash = hash((seq.prompt_adapter_id, seq.lora_int_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inject the hash value from the prompt adapter ID and LoRA ID, which will be used in the hash of the prefix caching block.
|
||
Returns: | ||
- int: The computed hash value for the block. | ||
""" | ||
assert (prev_block_hash is None) == is_first_block | ||
return hash((is_first_block, prev_block_hash, *cur_block_token_ids)) | ||
return hash((is_first_block, prev_block_hash, *cur_block_token_ids, | ||
contextual_hash)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part where the final hash value is generated.
@alexm-neuralmagic @youkaichao |
This pull request has merge conflicts that must be resolved before it can be |
cf4d3d7
to
b264d7d
Compare
…ock Manager v2 Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
b264d7d
to
44fb3f0
Compare
I'm sorry for the incorrect labeling and the notification. I made a mistake while rebasing my code. @rickyyx Could you also review my code, please? This PR applies LoRA and Prefix Adapter while leveraging prefix caching, which is closely related to your recent PRs. |
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
9159cc3
to
3daa8dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks good to me - I had tried something similar but even further into including token ids as part of the contextual_hash
before, which I think might yield potential perf benefits by avoiding append token ids to the blocks.
A few nits:
- the name of
contextual_hash
is not too straightforward for me. Personally i feel something likeaux_hash_metadata
,extra_hash_data
or something similar. But just nit. - Could we have some tests?
Also, cc @comaniac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Few things:
- Please fix the default values. We should use None instead of 0.
- Please add unit tests.
- For naming, I agree with @rickyyx that
extra_hash
might be more generic.
vllm/core/block/block_table.py
Outdated
@@ -80,7 +80,8 @@ def get_num_required_blocks(token_ids: List[int], | |||
|
|||
def allocate(self, | |||
token_ids: List[int], | |||
device: Device = Device.GPU) -> None: | |||
device: Device = Device.GPU, | |||
contextual_hash: Optional[int] = 0) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value should be None as 0 is potentially a valid hash value.
vllm/core/block/common.py
Outdated
@@ -177,7 +177,8 @@ def __init__(self, block_size: int, create_block: Block.Factory, | |||
token_ids=[], | |||
block_size=self._block_size, | |||
allocator=self._allocator, | |||
block_id=None)) | |||
block_id=None, | |||
contextual_hash=0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again do not use 0 as the default value.
vllm/core/block/interfaces.py
Outdated
@@ -50,6 +50,11 @@ def is_full(self) -> bool: | |||
def prev_block(self) -> Optional["Block"]: | |||
pass | |||
|
|||
@property | |||
@abstractmethod | |||
def contextual_hash(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type annotation
vllm/core/block/interfaces.py
Outdated
@@ -99,18 +106,21 @@ def content_hash(self) -> Optional[int]: | |||
class BlockAllocator(ABC): | |||
|
|||
@abstractmethod | |||
def allocate_mutable_block(self, prev_block: Optional[Block]) -> Block: | |||
def allocate_mutable_block(self, prev_block: Optional[Block], | |||
contextual_hash: Optional[int]) -> Block: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto to all similar places.
contextual_hash: Optional[int]) -> Block: | |
contextual_hash: Optional[int] = None) -> Block: |
vllm/sequence.py
Outdated
@@ -527,6 +527,15 @@ def hash_of_block(self, logical_idx: int) -> int: | |||
hashed_tokens = self.data.get_prefix_token_ids(num_tokens) | |||
return hash((hashed_tokens, self.lora_int_id)) | |||
|
|||
def contextual_hash_of_block(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sequence should not have the concept of "block".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. The hash_of_block
function is used by Block Manager V1, so I followed that convention. I calculate an additional hash using only the LoRA ID or prompt adapter, so I plan to remove the concept of "block" in this context.
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
…lock Manager v2 prefix caching (vllm-project#8240)
Summary
Block Manager v2, unlike v1, did not support LoRA and prompt adapter for the block hash in prefix caching mode.
I added logic to inject the LoRA ID and prompt adapter ID into the block hash function to support LoRA and prompt adapter while using
prefix caching mode
withblock manager v2
.Detail
Block Manager v1 uses the following hash_of_block function to generate a content hash in prefix caching mode:
vllm/vllm/sequence.py
Lines 460 to 468 in baa5467
However, Block Manager v2 only uses token IDs, as shown here:
vllm/vllm/core/block/prefix_caching_block.py
Line 855 in baa5467