Implement PagedAttention V2 #1348

WoosukKwon · 2023-10-13T23:11:52Z

This PR implements the first part of the PagedAttention V2 kernel, which uses sequence-level parallelism for better work partitioning. Compared to V1, the V2 kernel achieves huge speedup when the batch size is small (e.g., <= 8). We will further optimize the kernel henceforth.

zhuohan123

Thanks for the great work! In general LGTM. Left some style comments.

zhuohan123 · 2023-10-16T04:19:38Z

vllm/model_executor/layers/attention.py

+                block_size,
+                input_metadata.max_context_len,
+                None,  # alibi_slopes
+            )


Should we modify the Alibi paged attention to let it use paged attention v2?

Good catch! Fixed.

benchmarks/kernels/benchmark_paged_attention.py

csrc/attention/attention_kernels.cu

tests/kernels/test_attention.py

zhuohan123 · 2023-10-16T05:28:07Z

vllm/model_executor/layers/attention.py

+        # sequences or heads is large, we use V1 since there is enough work
+        # to parallelize.
+        # TODO(woosuk): Tune this heuristic.
+        use_v1 = max_num_partitions == 1 or num_seqs * num_heads > 512


Why is the threshold 512? Is this number related to the number of SMs a GPU has?

Yes. As we discussed offline, this is a simple heuristic to make sure that the V1 kernel is used when num_seq * num_heads is roughly larger than 4 * SM count in A100 and H100 GPUs. Actually, this can be improved by considering the GPU's actual SM counts. For now, I leave this as future work.

Yard1 · 2023-10-16T06:31:08Z

csrc/attention/attention_kernels.cu

+
+#define LAUNCH_PAGED_ATTENTION_V2(T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS, PARTITION_SIZE)      \
+  vllm::paged_attention_v2_kernel<T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS, PARTITION_SIZE>      \
+  <<<grid, block, shared_mem_size, stream>>>(                                                 \


do we not need to set cudaFuncAttributeMaxDynamicSharedMemorySize here like we do for v1?

No it's not necessary because in V2 each thread block only handles PARTITION_SIZE (=512) tokens. So, if we actually use V2 in all cases, we can remove the shared memory check and support (almost) arbitrary length in all GPUs.

Awesome, thanks for explaining! Should we then force v2 to be used if the check fails, in that case? It could be done in a followup PR.

@Yard1 That's a good idea! Let's do it in a followup PR.

WoosukKwon · 2023-10-16T07:04:46Z

@zhuohan123 I addressed your comments. PTAL.

zhuohan123

LGTM! Thanks for the awesome work!

WoosukKwon added 28 commits October 12, 2023 03:26

PagedAttention V1

7d057f9

Mid

2cc7bff

PagedAttention V1

8946093

Undef DIVIDE_ROUND_UP

f5b05fc

Add empty PagedAttention V2

235f273

Minor

472ee66

Minor

3827e24

Implement PagedAttention V2

2605c6e

Add comment

877a3f5

Fix performance bug

634f961

Fix attention test

7585101

Add heuristic

3ea3891

Minor optimization

ab89848

Add benchmark

d83ce92

Minor

760e7a2

yapf

e6d8a15

Minor fix on comments

4313691

Add comment on heuristic

c0021c1

Fix test_attention

8ddb426

Merge branch 'main' into pa-v2

ae14bba

yapf

08e92c3

Minor

dac5e24

Minor

d674616

Reimplement

612236b

Rename

3d2eff1

Minor

57b3071

yapf

cb3af6d

Remove unnecessary fns

000abdf

WoosukKwon changed the title ~~[WIP] Paged Attention V2~~ Implement PagedAttention V2 Oct 15, 2023

WoosukKwon marked this pull request as ready for review October 15, 2023 07:51

WoosukKwon requested a review from zhuohan123 October 15, 2023 07:51

WoosukKwon mentioned this pull request Oct 14, 2023

[v0.2.1] Release Tracker #1346

Closed

3 tasks

zhuohan123 reviewed Oct 16, 2023

View reviewed changes

Address comments

f80f49f

Yard1 reviewed Oct 16, 2023

View reviewed changes

WoosukKwon added 4 commits October 16, 2023 06:39

Minor fix

5b0a536

Support attention with ALiBi

f3c8cb0

yapf

bfa8569

yapf

9451b2d

WoosukKwon requested a review from zhuohan123 October 16, 2023 07:04

WoosukKwon mentioned this pull request Oct 16, 2023

does vllm use Flash-Decoding? #1362

Closed

zhuohan123 approved these changes Oct 16, 2023

View reviewed changes

WoosukKwon merged commit 928de46 into main Oct 16, 2023
2 checks passed

WoosukKwon deleted the pa-v2 branch October 16, 2023 08:00

dongs0104 mentioned this pull request Oct 20, 2023

Add Flash Decoding huggingface/text-generation-inference#1151

Closed

masahi mentioned this pull request Oct 26, 2023

[Unity][Contrib] Add vLLM paged attention kernel apache/tvm#15995

Closed

slaren mentioned this pull request Jan 18, 2024

ggml : add Flash Attention ggerganov/llama.cpp#5021

Merged

8 tasks

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Implement PagedAttention V2 (vllm-project#1348)

034e0ee

sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024

Implement PagedAttention V2 (vllm-project#1348)

3d0a1b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement PagedAttention V2 #1348

Implement PagedAttention V2 #1348

WoosukKwon commented Oct 13, 2023 •

edited

Loading

zhuohan123 left a comment

zhuohan123 Oct 16, 2023

WoosukKwon Oct 16, 2023

zhuohan123 Oct 16, 2023

WoosukKwon Oct 16, 2023

Yard1 Oct 16, 2023

WoosukKwon Oct 16, 2023 •

edited

Loading

Yard1 Oct 16, 2023 •

edited

Loading

WoosukKwon Oct 16, 2023

WoosukKwon commented Oct 16, 2023

zhuohan123 left a comment

Implement PagedAttention V2 #1348

Implement PagedAttention V2 #1348

Conversation

WoosukKwon commented Oct 13, 2023 • edited Loading

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Oct 16, 2023

Choose a reason for hiding this comment

WoosukKwon Oct 16, 2023

Choose a reason for hiding this comment

zhuohan123 Oct 16, 2023

Choose a reason for hiding this comment

WoosukKwon Oct 16, 2023

Choose a reason for hiding this comment

Yard1 Oct 16, 2023

Choose a reason for hiding this comment

WoosukKwon Oct 16, 2023 • edited Loading

Choose a reason for hiding this comment

Yard1 Oct 16, 2023 • edited Loading

Choose a reason for hiding this comment

WoosukKwon Oct 16, 2023

Choose a reason for hiding this comment

WoosukKwon commented Oct 16, 2023

zhuohan123 left a comment

Choose a reason for hiding this comment

WoosukKwon commented Oct 13, 2023 •

edited

Loading

WoosukKwon Oct 16, 2023 •

edited

Loading

Yard1 Oct 16, 2023 •

edited

Loading