-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement PagedAttention V2 #1348
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! In general LGTM. Left some style comments.
block_size, | ||
input_metadata.max_context_len, | ||
None, # alibi_slopes | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we modify the Alibi paged attention to let it use paged attention v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Fixed.
# sequences or heads is large, we use V1 since there is enough work | ||
# to parallelize. | ||
# TODO(woosuk): Tune this heuristic. | ||
use_v1 = max_num_partitions == 1 or num_seqs * num_heads > 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the threshold 512? Is this number related to the number of SMs a GPU has?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. As we discussed offline, this is a simple heuristic to make sure that the V1 kernel is used when num_seq * num_heads
is roughly larger than 4 * SM count
in A100 and H100 GPUs. Actually, this can be improved by considering the GPU's actual SM counts. For now, I leave this as future work.
|
||
#define LAUNCH_PAGED_ATTENTION_V2(T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS, PARTITION_SIZE) \ | ||
vllm::paged_attention_v2_kernel<T, HEAD_SIZE, BLOCK_SIZE, NUM_THREADS, PARTITION_SIZE> \ | ||
<<<grid, block, shared_mem_size, stream>>>( \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not need to set cudaFuncAttributeMaxDynamicSharedMemorySize
here like we do for v1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's not necessary because in V2 each thread block only handles PARTITION_SIZE
(=512) tokens. So, if we actually use V2 in all cases, we can remove the shared memory check and support (almost) arbitrary length in all GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for explaining! Should we then force v2 to be used if the check fails, in that case? It could be done in a followup PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1 That's a good idea! Let's do it in a followup PR.
@zhuohan123 I addressed your comments. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the awesome work!
This PR implements the first part of the PagedAttention V2 kernel, which uses sequence-level parallelism for better work partitioning. Compared to V1, the V2 kernel achieves huge speedup when the batch size is small (e.g., <= 8). We will further optimize the kernel henceforth.