[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching #8240

llsj14 · 2024-09-06T14:04:11Z

Summary

Block Manager v2, unlike v1, did not support LoRA and prompt adapter for the block hash in prefix caching mode.
I added logic to inject the LoRA ID and prompt adapter ID into the block hash function to support LoRA and prompt adapter while using prefix caching mode with block manager v2.

Detail

Block Manager v1 uses the following hash_of_block function to generate a content hash in prefix caching mode:

vllm/vllm/sequence.py

Lines 460 to 468 in baa5467

    
           def hash_of_block(self, logical_idx: int) -> int: 
        
               # TODO This can produce incorrect hash when block size > prompt size 
        
               # Compute the number of tokens in the sequence 
        
               # TODO: The current hashing function is O(L^2). We should optimize 
        
               # this in the future. 
        
               num_tokens = self.num_hashed_tokens_of_block(logical_idx) 
        
               hashed_tokens = self.data.get_prefix_token_ids(num_tokens) 
        
               return hash((hashed_tokens, self.lora_int_id))

However, Block Manager v2 only uses token IDs, as shown here:

vllm/vllm/core/block/prefix_caching_block.py

Line 855 in baa5467

return hash((is_first_block, prev_block_hash, *cur_block_token_ids))

github-actions · 2024-09-06T14:04:26Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

llsj14 · 2024-09-06T14:51:13Z

vllm/core/block_manager_v2.py

@@ -149,7 +149,10 @@ def _allocate_sequence(self, seq: Sequence) -> BlockTable:
            block_allocator=self.block_allocator,
            max_block_sliding_window=self.max_block_sliding_window,
        )
-        block_table.allocate(seq.get_token_ids())
+
+        contextual_hash = hash((seq.prompt_adapter_id, seq.lora_int_id))


Inject the hash value from the prompt adapter ID and LoRA ID, which will be used in the hash of the prefix caching block.

llsj14 · 2024-09-06T14:51:56Z

vllm/core/block/prefix_caching_block.py


        Returns:
        - int: The computed hash value for the block.
        """
        assert (prev_block_hash is None) == is_first_block
-        return hash((is_first_block, prev_block_hash, *cur_block_token_ids))
+        return hash((is_first_block, prev_block_hash, *cur_block_token_ids,
+                     contextual_hash))


The part where the final hash value is generated.

llsj14 · 2024-09-12T09:50:51Z

@alexm-neuralmagic @youkaichao
May I request a review to apply prefix caching with LoRA and a prompt adapter?

mergify · 2024-11-26T05:50:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @llsj14.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ock Manager v2 Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

llsj14 · 2024-12-11T14:27:14Z

I'm sorry for the incorrect labeling and the notification. I made a mistake while rebasing my code.

@rickyyx Could you also review my code, please? This PR applies LoRA and Prefix Adapter while leveraging prefix caching, which is closely related to your recent PRs.

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

rickyyx

The approach looks good to me - I had tried something similar but even further into including token ids as part of the contextual_hash before, which I think might yield potential perf benefits by avoiding append token ids to the blocks.

A few nits:

the name of contextual_hash is not too straightforward for me. Personally i feel something like aux_hash_metadata, extra_hash_data or something similar. But just nit.
Could we have some tests?

Also, cc @comaniac

comaniac

Overall LGTM. Few things:

Please fix the default values. We should use None instead of 0.
Please add unit tests.
For naming, I agree with @rickyyx that extra_hash might be more generic.

comaniac · 2024-12-11T19:55:28Z

vllm/core/block/block_table.py

@@ -80,7 +80,8 @@ def get_num_required_blocks(token_ids: List[int],

    def allocate(self,
                 token_ids: List[int],
-                 device: Device = Device.GPU) -> None:
+                 device: Device = Device.GPU,
+                 contextual_hash: Optional[int] = 0) -> None:


The default value should be None as 0 is potentially a valid hash value.

comaniac · 2024-12-11T20:01:52Z

vllm/core/block/common.py

@@ -177,7 +177,8 @@ def __init__(self, block_size: int, create_block: Block.Factory,
                                   token_ids=[],
                                   block_size=self._block_size,
                                   allocator=self._allocator,
-                                   block_id=None))
+                                   block_id=None,
+                                   contextual_hash=0))


Again do not use 0 as the default value.

comaniac · 2024-12-11T20:04:13Z

vllm/core/block/interfaces.py

@@ -50,6 +50,11 @@ def is_full(self) -> bool:
    def prev_block(self) -> Optional["Block"]:
        pass

+    @property
+    @abstractmethod
+    def contextual_hash(self):


type annotation

vllm/core/block/interfaces.py

comaniac · 2024-12-11T20:13:53Z

vllm/core/block/interfaces.py

@@ -99,18 +106,21 @@ def content_hash(self) -> Optional[int]:
 class BlockAllocator(ABC):

    @abstractmethod
-    def allocate_mutable_block(self, prev_block: Optional[Block]) -> Block:
+    def allocate_mutable_block(self, prev_block: Optional[Block],
+                               contextual_hash: Optional[int]) -> Block:


Ditto to all similar places.

Suggested change

contextual_hash: Optional[int]) -> Block:

contextual_hash: Optional[int] = None) -> Block:

comaniac · 2024-12-11T20:15:33Z

vllm/sequence.py

@@ -527,6 +527,15 @@ def hash_of_block(self, logical_idx: int) -> int:
        hashed_tokens = self.data.get_prefix_token_ids(num_tokens)
        return hash((hashed_tokens, self.lora_int_id))

+    def contextual_hash_of_block(self) -> int:


Sequence should not have the concept of "block".

Yes, I agree. The hash_of_block function is used by Block Manager V1, so I followed that convention. I calculate an additional hash using only the LoRA ID or prompt adapter, so I plan to remove the concept of "block" in this context.

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

comaniac

Otherwise LGTM

vllm/core/block/prefix_caching_block.py

llsj14 · 2024-12-13T09:17:01Z

Thank you for the detailed feedback and great reviews! @comaniac @rickyyx

…lock Manager v2 prefix caching (vllm-project#8240)

llsj14 commented Sep 6, 2024

View reviewed changes

simon-mo requested review from zhuohan123, youkaichao, alexm-neuralmagic, comaniac and njhill as code owners November 26, 2024 05:49

mergify bot added the needs-rebase label Nov 26, 2024

llsj14 force-pushed the feat/block-manager-v2-hash branch from cf4d3d7 to b264d7d Compare December 11, 2024 14:13

llsj14 requested review from DarkLight1337, ywang96, robertgshaw2-neuralmagic, simon-mo and WoosukKwon as code owners December 11, 2024 14:13

mergify bot added documentation Improvements or additions to documentation frontend labels Dec 11, 2024

llsj14 added 6 commits December 11, 2024 14:19

feat: support LoRA and prompt adapter in content-based hashing for Bl…

f57af1b

…ock Manager v2 Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

fix format

4dd2ff2

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

support mutable block hash

81a770f

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

make format

143250c

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

fix: NiaveBlock init error due to contextual_hash

60443f6

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

consider contextual_hash in ComputedBlocksTracker

363e3b2

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

llsj14 force-pushed the feat/block-manager-v2-hash branch from b264d7d to 44fb3f0 Compare December 11, 2024 14:22

mergify bot removed the needs-rebase label Dec 11, 2024

llsj14 added 2 commits December 11, 2024 15:39

amend function name and comment

df4d10c

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

make argument optional

d8d17af

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

make yapf

3daa8dc

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

llsj14 force-pushed the feat/block-manager-v2-hash branch from 9159cc3 to 3daa8dc Compare December 11, 2024 15:39

comaniac self-assigned this Dec 11, 2024

rickyyx reviewed Dec 11, 2024

View reviewed changes

comaniac requested changes Dec 11, 2024

View reviewed changes

DarkLight1337 removed request for simon-mo, WoosukKwon, robertgshaw2-neuralmagic, ywang96 and DarkLight1337 December 11, 2024 22:54

llsj14 added 7 commits December 12, 2024 12:29

rename function and variables

546dfa2

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

set default value of extra_hash to None

4bb73a1

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

add default value

543f992

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

add type annotations

54ee3dc

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

fix: set default value for extra_hash to ensure compatibility

cadf7ac

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

add unit test

8c6c2bc

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

make format

0a180a3

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

comaniac reviewed Dec 12, 2024

View reviewed changes

vllm/core/block/prefix_caching_block.py Show resolved Hide resolved

comaniac added ready ONLY add when PR is ready to merge/full CI is needed and removed documentation Improvements or additions to documentation frontend labels Dec 13, 2024

comaniac approved these changes Dec 13, 2024

View reviewed changes

comaniac merged commit c31d4a5 into vllm-project:main Dec 13, 2024
64 checks passed

BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024

[Core] support LoRA and prompt adapter in content-based hashing for B…

dbb44d3

…lock Manager v2 prefix caching (vllm-project#8240)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching #8240

[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching #8240

llsj14 commented Sep 6, 2024 •

edited

Loading

github-actions bot commented Sep 6, 2024

llsj14 Sep 6, 2024

llsj14 Sep 6, 2024

llsj14 commented Sep 12, 2024

mergify bot commented Nov 26, 2024

llsj14 commented Dec 11, 2024

rickyyx left a comment

comaniac left a comment

comaniac Dec 11, 2024

comaniac Dec 11, 2024

comaniac Dec 11, 2024

comaniac Dec 11, 2024

comaniac Dec 11, 2024

llsj14 Dec 12, 2024

comaniac left a comment

llsj14 commented Dec 13, 2024

	def hash_of_block(self, logical_idx: int) -> int:
	# TODO This can produce incorrect hash when block size > prompt size

	# Compute the number of tokens in the sequence
	# TODO: The current hashing function is O(L^2). We should optimize
	# this in the future.
	num_tokens = self.num_hashed_tokens_of_block(logical_idx)
	hashed_tokens = self.data.get_prefix_token_ids(num_tokens)
	return hash((hashed_tokens, self.lora_int_id))

	contextual_hash: Optional[int]) -> Block:
	contextual_hash: Optional[int] = None) -> Block:

[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching #8240

[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching #8240

Conversation

llsj14 commented Sep 6, 2024 • edited Loading

Summary

Detail

github-actions bot commented Sep 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llsj14 commented Sep 12, 2024

mergify bot commented Nov 26, 2024

llsj14 commented Dec 11, 2024

rickyyx left a comment

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

llsj14 commented Dec 13, 2024

llsj14 commented Sep 6, 2024 •

edited

Loading