Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) #4837

Merged
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
7eb0e0d
added block manager tests
afeldman-nm May 15, 2024
6e41c39
passing block manager encoder/decoder test
afeldman-nm May 15, 2024
7bcc4ef
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 15, 2024
f04ee73
block manager v2 changes to pass test_can_allocate_seq_group_encoder_…
afeldman-nm May 15, 2024
07bbd8a
block manager v2 support for encoder/decoder
afeldman-nm May 15, 2024
85e602b
Merge branch 'upstream-main' into infra_enc_dec_block_manager_merge
afeldman-nm May 15, 2024
3e95602
renamed encoder to cross in block manager v2, regarding block tables
afeldman-nm May 15, 2024
04f38a8
renamed encoder to cross where appropriate
afeldman-nm May 15, 2024
2dcd663
formatting
afeldman-nm May 15, 2024
22d4c17
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 15, 2024
a6aba57
Merge branch 'upstream-main' into infra_enc_dec_block_manager_merge
afeldman-nm May 16, 2024
22d9dba
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 17, 2024
954cd54
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 17, 2024
2e245b3
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 21, 2024
63dd42d
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 21, 2024
2ced012
fix wording nits (ben->been, decoder->encoder/decoder)
afeldman-nm May 21, 2024
ed337e8
Merge branch 'upstream-main' into infra_enc_dec_block_manager_merge
afeldman-nm May 22, 2024
8286b4c
changed two block manager tests to construct fake prompts that are eq…
afeldman-nm May 22, 2024
eba551c
keyword args for dummy prompt construction in block manager encoder/d…
afeldman-nm May 22, 2024
a7c8b19
bugfix - decoder prompt kwarg repeated in lieu of encoder prompt kwarg
afeldman-nm May 22, 2024
9feb994
In block manager test which used with block to detect error - created…
afeldman-nm May 22, 2024
5eb0032
refactoring block manager v1/v2 swap in/swap out functions
afeldman-nm May 22, 2024
0644cde
formatting; changed blocktable type specifier from Dict to List[int]
afeldman-nm May 22, 2024
19ed741
prefixed internal method with _
afeldman-nm May 22, 2024
a557972
refactored self-/cross-attention allocation functions into a single h…
afeldman-nm May 22, 2024
e48bebf
Refactored block manager v2 self-/cross-block-table alloc functions t…
afeldman-nm May 22, 2024
18b415f
Merge branch 'upstream-main' into infra_enc_dec_block_manager_review
afeldman-nm May 22, 2024
c6842c8
Merge branch 'infra_enc_dec_block_manager_review' into infra_enc_dec_…
afeldman-nm May 22, 2024
ac2da97
formatting
afeldman-nm May 22, 2024
e985a2f
refactored out block manager v1 swap_n/swap_out helper functions
afeldman-nm May 22, 2024
98c5863
Help function avoids prefix caching code in encoder/decoder scenarios…
afeldman-nm May 22, 2024
defa279
Merge branch 'upstream-main' into infra_enc_dec_block_manager_merge
afeldman-nm May 23, 2024
f3b1b94
Merge branch 'upstream-main' into infra_enc_dec_block_manage_reviews
afeldman-nm May 23, 2024
84f5510
block manager v1 NotImplementError's for sliding window and automatic…
afeldman-nm May 23, 2024
cc61959
Fixes
afeldman-nm May 23, 2024
dcb9abe
formatting
afeldman-nm May 23, 2024
e8c40fc
explanatory comment
afeldman-nm May 23, 2024
5ccb70b
various fixes according to reviews
afeldman-nm May 23, 2024
dfcc28b
slight refactoring
afeldman-nm May 23, 2024
8d3ad05
small refactor
afeldman-nm May 23, 2024
5a76979
replaced all encoder_seq is not None with not decoder_only
afeldman-nm May 23, 2024
09ae4ad
added is_encoder_decoder() method to sequence group
afeldman-nm May 23, 2024
ecd1a99
tests for NotImplemented errors when encoder/decoder models are used …
afeldman-nm May 23, 2024
191a5b6
Merge branch 'upstream-main' into infra_enc_dec_block_manager_reviews
afeldman-nm May 23, 2024
d3935f7
rename tests
afeldman-nm May 23, 2024
e6a7125
spelling error
afeldman-nm May 23, 2024
68b4762
isort
afeldman-nm May 23, 2024
0c5fc61
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 24, 2024
845f040
Merge branch 'upstream-main' into infra_enc_dec_block_manager
afeldman-nm May 24, 2024
849e49c
Merge branch 'upstream-main' into infra_enc_dec_block_manager_reviews
afeldman-nm May 26, 2024
a80325d
return output of SequenceGroup constructor
afeldman-nm May 26, 2024
8b38776
capitalize constants
afeldman-nm May 26, 2024
f39c313
refactored swap-block-table functionality
afeldman-nm May 26, 2024
90b5a0e
Refactored block manager + enc dec + unsupported feature checks into …
afeldman-nm May 26, 2024
9ee2582
removed circular import
afeldman-nm May 26, 2024
5d0ac23
apparently isort has to run last?
afeldman-nm May 26, 2024
1bcc949
slight name change
afeldman-nm May 26, 2024
5ae5969
merge
afeldman-nm May 28, 2024
1bece71
wip merge
afeldman-nm May 28, 2024
1d882ca
fixed utils to correctly handle encoder/decoder unsupported scenarios
afeldman-nm May 28, 2024
dfd9469
formatting
afeldman-nm May 28, 2024
611df43
yapf fix
afeldman-nm May 29, 2024
8ee49dd
yapf fix
afeldman-nm May 29, 2024
6f4b49e
Merge branch 'upstream-main' into infra_enc_dec_block_manager_reviews
afeldman-nm May 29, 2024
039c25e
upstream merge
afeldman-nm May 29, 2024
8e9ef5b
fix formatting issue
afeldman-nm May 29, 2024
2b59ddc
formatting
afeldman-nm May 29, 2024
471569f
Merge branch 'upstream-main' into infra_enc_dec_block_manager_reviews
afeldman-nm May 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 52 additions & 1 deletion tests/core/block/test_block_manager_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from vllm.sequence import Logprob, SequenceStatus
from vllm.utils import chunk_list

from ..utils import create_seq_group
from ..utils import create_seq_group, create_seq_group_encoder_decoder


@pytest.mark.parametrize("block_size", [16])
Expand Down Expand Up @@ -52,6 +52,57 @@ def test_can_allocate_seq_group(block_size: int, num_seqs_per_group: int,
assert can_allocate_result == AllocStatus.LATER


@pytest.mark.parametrize("block_size", [16])
@pytest.mark.parametrize("num_gpu_blocks", [16, 80, 160])
@pytest.mark.parametrize("num_seqs_per_group", [1, 4])
@pytest.mark.parametrize("watermark", [0.0, 0.5])
def test_can_allocate_seq_group_encoder_decoder(block_size: int,
num_seqs_per_group: int,
num_gpu_blocks: int,
watermark: float):
block_manager = BlockSpaceManagerV2(
block_size=block_size,
num_gpu_blocks=num_gpu_blocks,
num_cpu_blocks=1024,
watermark=watermark,
)
num_watermark_blocks = int(watermark * num_gpu_blocks)

num_output_blocks_per_seq = 1

# NOTE: This should be num_output_blocks_per_seq * num_seqs_per_group, but
# the current implementation assumes all seqs are new prompts / don't have
# different output lens.
num_output_blocks = num_output_blocks_per_seq

for bdx, num_prompt_blocks in enumerate(
range(1, num_gpu_blocks - num_output_blocks)):
num_cross_blocks_per_seq = num_prompt_blocks

seq_group = create_seq_group_encoder_decoder(
seq_prompt_len=block_size * num_prompt_blocks,
seq_output_lens=[
block_size * num_output_blocks_per_seq
for _ in range(num_seqs_per_group)
],
request_id=str(bdx))

assert num_prompt_blocks + num_output_blocks <= num_gpu_blocks

can_allocate_result = block_manager.can_allocate(seq_group)

num_required_blocks = num_prompt_blocks + \
num_output_blocks + \
num_cross_blocks_per_seq

if num_gpu_blocks - num_required_blocks < num_watermark_blocks:
assert can_allocate_result == AllocStatus.NEVER
elif num_gpu_blocks >= num_required_blocks:
assert can_allocate_result == AllocStatus.OK
else:
assert can_allocate_result == AllocStatus.LATER


@pytest.mark.parametrize("block_size", [1, 8])
@pytest.mark.parametrize("prompt_len", [1, 7, 8])
@pytest.mark.parametrize("num_slots_to_append", [1, 8, 129])
Expand Down
147 changes: 146 additions & 1 deletion tests/core/test_block_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from vllm.sequence import Logprob, Sequence, SequenceGroup, SequenceStatus
from vllm.utils import Device

from .utils import create_dummy_prompt
from .utils import create_dummy_prompt, create_dummy_prompt_encoder_decoder


def test_block_allocator_allocate():
Expand Down Expand Up @@ -90,6 +90,38 @@ def test_allocate():
assert block_manager.can_allocate(seq_group) != AllocStatus.OK


def test_allocate_encoder_decoder():
block_size = 4
num_cpu_blocks = 4
num_gpu_blocks = 4
block_req_per_seq_group = 2
block_manager = BlockSpaceManagerV1(block_size,
num_cpu_blocks,
num_gpu_blocks,
watermark=0)

# Allocate same sequence group to all available gpu blocks.
for i in range(num_gpu_blocks // block_req_per_seq_group):
_, _, seq_group = create_dummy_prompt_encoder_decoder(
str(i), block_size, block_size)
assert block_manager.can_allocate(seq_group)
afeldman-nm marked this conversation as resolved.
Show resolved Hide resolved
block_manager.allocate(seq_group)
assert block_manager.can_allocate(seq_group) != AllocStatus.OK

# Allocate same sequence group to all available gpu blocks.
# Use watermark to reserve one gpu block.
block_manager = BlockSpaceManagerV1(block_size,
afeldman-nm marked this conversation as resolved.
Show resolved Hide resolved
num_cpu_blocks,
num_gpu_blocks,
watermark=1 / num_gpu_blocks)
for i in range((num_gpu_blocks - 1) // block_req_per_seq_group):
_, _, seq_group = create_dummy_prompt_encoder_decoder(
str(i), block_size // 2, block_size // 2)
afeldman-nm marked this conversation as resolved.
Show resolved Hide resolved
assert block_manager.can_allocate(seq_group)
block_manager.allocate(seq_group)
assert block_manager.can_allocate(seq_group) != AllocStatus.OK


def test_append_slot_single_seq():
block_size = 4
num_cpu_blocks = 4
Expand Down Expand Up @@ -241,6 +273,62 @@ def test_swap():
assert before_gpu_blocks == after_gpu_blocks + len(cpu_blocks)


def test_swap_encoder_decoder():
block_size = 4
num_cpu_blocks = 4
num_gpu_blocks = 4
block_manager = BlockSpaceManagerV1(block_size,
num_cpu_blocks,
num_gpu_blocks,
watermark=0)

decoder_prompt, encoder_prompt, seq_group = \
create_dummy_prompt_encoder_decoder(
"1",
decoder_prompt_length=block_size,
encoder_prompt_length=block_size)
decoder_prompt.status = SequenceStatus.WAITING
encoder_prompt.status = SequenceStatus.WAITING
block_manager.allocate(seq_group)

# Emulate a forward pass by appending a single token.
# The block manager then knows how many unprocessed
# tokens will be written in the next forward pass.
token_id = 0
decoder_prompt.status = SequenceStatus.RUNNING
decoder_prompt.append_token_id(token_id, {token_id: Logprob(0.0)})

# Swap encoder/decoder seq group from GPU -> CPU.
decoder_gpu_blocks = block_manager.get_block_table(decoder_prompt)
cross_gpu_blocks = block_manager.get_cross_block_table(seq_group)
gpu_blocks = decoder_gpu_blocks + cross_gpu_blocks
assert block_manager.can_swap_out(seq_group)
afeldman-nm marked this conversation as resolved.
Show resolved Hide resolved
before_cpu_blocks = block_manager.get_num_free_cpu_blocks()
before_gpu_blocks = block_manager.get_num_free_gpu_blocks()
mapping = block_manager.swap_out(seq_group)
assert [x[0] for x in mapping] == gpu_blocks
#assert list(mapping.keys()) == gpu_blocks
after_cpu_blocks = block_manager.get_num_free_cpu_blocks()
after_gpu_blocks = block_manager.get_num_free_gpu_blocks()
assert before_cpu_blocks == after_cpu_blocks + len(gpu_blocks)
assert before_gpu_blocks + len(gpu_blocks) == after_gpu_blocks
decoder_prompt.status = SequenceStatus.SWAPPED

# Swap decoder seq group from CPU -> GPU.
afeldman-nm marked this conversation as resolved.
Show resolved Hide resolved
decoder_cpu_blocks = block_manager.get_block_table(decoder_prompt)
cross_cpu_blocks = block_manager.get_cross_block_table(seq_group)
cpu_blocks = decoder_cpu_blocks + cross_cpu_blocks
assert block_manager.can_swap_in(seq_group) == AllocStatus.OK
before_cpu_blocks = block_manager.get_num_free_cpu_blocks()
before_gpu_blocks = block_manager.get_num_free_gpu_blocks()
mapping = block_manager.swap_in(seq_group)
assert [x[0] for x in mapping] == cpu_blocks
after_cpu_blocks = block_manager.get_num_free_cpu_blocks()
after_gpu_blocks = block_manager.get_num_free_gpu_blocks()
assert before_cpu_blocks + len(cpu_blocks) == after_cpu_blocks
assert before_gpu_blocks == after_gpu_blocks + len(cpu_blocks)


def test_free():
block_size = 4
num_cpu_blocks = 4
Expand All @@ -265,6 +353,38 @@ def test_free():
block_manager.get_block_table(prompt)


def test_free_encoder_decoder():
block_size = 4
num_cpu_blocks = 4
num_gpu_blocks = 4
block_manager = BlockSpaceManagerV1(block_size,
num_cpu_blocks,
num_gpu_blocks,
watermark=0)

decoder_prompt, encoder_prompt, seq_group = \
create_dummy_prompt_encoder_decoder(
"1",
decoder_prompt_length=block_size // 2,
encoder_prompt_length=block_size // 2)
block_manager.allocate(seq_group)

# Free allocated seq.
decoder_prompt_blocks = len(block_manager.get_block_table(decoder_prompt))
encoder_prompt_blocks = len(block_manager.get_cross_block_table(seq_group))
prompt_blocks = decoder_prompt_blocks + encoder_prompt_blocks
before_blocks = block_manager.get_num_free_gpu_blocks()
block_manager.free(decoder_prompt)
block_manager.free_cross(seq_group)
after_blocks = block_manager.get_num_free_gpu_blocks()
assert after_blocks == before_blocks + prompt_blocks

# Block table for freed encoder & decoder seq's are deleted.
with pytest.raises(KeyError):
block_manager.get_block_table(decoder_prompt)
block_manager.get_block_table(encoder_prompt)
afeldman-nm marked this conversation as resolved.
Show resolved Hide resolved


def test_reset():
block_size = 4
num_cpu_blocks = 4
Expand All @@ -286,6 +406,31 @@ def test_reset():
assert block_manager.get_num_free_gpu_blocks() == original_blocks


def test_reset_encoder_decoder():
block_size = 4
num_cpu_blocks = 4
num_gpu_blocks = 4
block_req_per_seq_group = 2
block_manager = BlockSpaceManagerV1(block_size,
num_cpu_blocks,
num_gpu_blocks,
watermark=0)

# Allocate same seq group on all available gpu blocks.
original_blocks = block_manager.get_num_free_gpu_blocks()
for i in range(num_gpu_blocks // block_req_per_seq_group):
_, _, seq_group = create_dummy_prompt_encoder_decoder(
f"{i}",
decoder_prompt_length=block_size,
encoder_prompt_length=block_size)
block_manager.allocate(seq_group)
assert block_manager.get_num_free_gpu_blocks() == 0

# Resetting block manager frees all allocated blocks.
block_manager.reset()
assert block_manager.get_num_free_gpu_blocks() == original_blocks


def test_sliding_window_multi_seq():
"""
Tests that memory allocation and deallocation is handled
Expand Down
81 changes: 81 additions & 0 deletions tests/core/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,40 @@ def create_dummy_prompt(
return prompt, seq_group


def create_dummy_prompt_encoder_decoder(
request_id: str,
decoder_prompt_length: int,
encoder_prompt_length: int,
block_size: Optional[int] = None,
lora_request: Optional[LoRARequest] = None,
use_beam_search: bool = False,
best_of: int = 1,
) -> Tuple[Sequence, SequenceGroup]:
if not block_size:
block_size = decoder_prompt_length

# Create dummy prompt sequence with tokens 0...block_size-1
# and prompt "0 ... block_size".
decoder_prompt_tokens = list(range(decoder_prompt_length))
decoder_prompt_str = " ".join([str(t) for t in decoder_prompt_tokens])
decoder_prompt = Sequence(int(request_id), decoder_prompt_str,
decoder_prompt_tokens, block_size)
encoder_prompt_tokens = list(reversed(list(range(encoder_prompt_length))))
encoder_prompt_str = " ".join([str(t) for t in encoder_prompt_tokens])
encoder_prompt = Sequence(int(request_id), encoder_prompt_str,
encoder_prompt_tokens, block_size)
seq_group = SequenceGroup(request_id=request_id,
seqs=[decoder_prompt],
sampling_params=SamplingParams(
use_beam_search=use_beam_search,
best_of=best_of),
arrival_time=time.time(),
lora_request=lora_request,
encoder_seq=encoder_prompt)

return decoder_prompt, encoder_prompt, seq_group


def create_seq_group(
seq_prompt_len: int = 1024,
seq_output_lens: Iterable[int] = (128, ),
Expand Down Expand Up @@ -73,5 +107,52 @@ def create_seq_group(
return seq_group


def create_seq_group_encoder_decoder(
seq_prompt_len: int = 1024,
seq_output_lens: Iterable[int] = (128, ),
request_id: str = '0',
seq_id_start: int = 0,
sampling_params: Optional[SamplingParams] = None) -> SequenceGroup:

assert len(seq_output_lens) > 0

if sampling_params is None:
sampling_params = SamplingParams()

prompt_token_ids = [0] * seq_prompt_len

seqs = []
for seq_id_offset, output_len in enumerate(seq_output_lens):
seq = Sequence(
seq_id=seq_id_start + seq_id_offset,
prompt="",
prompt_token_ids=prompt_token_ids,
afeldman-nm marked this conversation as resolved.
Show resolved Hide resolved
block_size=16,
)

for i in range(output_len):
seq.append_token_id(
token_id=i,
logprobs={i: Logprob(0.0)},
)
seqs.append(seq)

# Encoder sequence
encoder_seq = Sequence(
seq_id=seq_id_start + len(seq_output_lens),
prompt="",
prompt_token_ids=prompt_token_ids,
block_size=16,
)

seq_group = SequenceGroup(request_id=request_id,
seqs=seqs,
sampling_params=sampling_params,
arrival_time=time.time(),
encoder_seq=encoder_seq)

return seq_group
afeldman-nm marked this conversation as resolved.
Show resolved Hide resolved


def round_up_to_next_block(seq_len: int, block_size: int) -> int:
return (seq_len + block_size - 1) // block_size
Loading
Loading