[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency #12921

lingfanyu · 2025-02-07T22:47:42Z

Summary

This PR is a follow-up of #11277 . It improves code quality and efficiency, and enables kernel to process larger inputs.

Following things are done in this PR:

fix 2 tiling issues triggered when seqlen_q (i.e. chunk size in chunked-prefill) is larger than 128 and 512
get rid of the limit of how many KV cache block each tile can access, currently tile_size / block_size <= 128
unit tests with larger inputs (e.g. seqlen_q > 128, num_blocks_per_tile > 128)
skip computation in areas masked out by causal masks
load kv cache once for query heads sharing the same kv head
remove unused code and format code in pre-commit

Signed-off-by: Lingfan Yu <lingfany@amazon.com>

github-actions · 2025-02-07T22:47:56Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Lingfan Yu <lingfany@amazon.com>

liangfu

LGTM, but feel free to chime in.

…ention and Improve Efficiency (vllm-project#12921) Signed-off-by: Lingfan Yu <lingfany@amazon.com> Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>

lingfanyu · 2025-02-12T18:50:40Z

Thanks!

…ention and Improve Efficiency (vllm-project#12921) Signed-off-by: Lingfan Yu <lingfany@amazon.com>

lingfanyu added 3 commits February 7, 2025 22:02

remove unused code, skip areas masked out by causality, and format code

7205843

Signed-off-by: Lingfan Yu <lingfany@amazon.com>

remove num_block_per_tile limit and format code

a55de6f

Signed-off-by: Lingfan Yu <lingfany@amazon.com>

support seqlen_q larger than 128 and test with larger inputs

b81348d

Signed-off-by: Lingfan Yu <lingfany@amazon.com>

reformat using pre-commit

3bc5ffc

Signed-off-by: Lingfan Yu <lingfany@amazon.com>

lingfanyu changed the title ~~[Neuron][Kernel] Improve NKI-based flash PagedAttention~~ [Neuron][Kernel] Improve NKI-based Flash PagedAttention Feb 7, 2025

fix typing

0d0601d

Signed-off-by: Lingfan Yu <lingfany@amazon.com>

lingfanyu changed the title ~~[Neuron][Kernel] Improve NKI-based Flash PagedAttention~~ [Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency Feb 7, 2025

lingfanyu marked this pull request as ready for review February 7, 2025 23:24

lingfanyu added 2 commits February 9, 2025 23:03

Merge branch 'main' into nki_pa_improve

eff5f0b

fix missing testing args

cda0703

Signed-off-by: Lingfan Yu <lingfany@amazon.com>

liangfu approved these changes Feb 11, 2025

View reviewed changes

simon-mo approved these changes Feb 12, 2025

View reviewed changes

simon-mo merged commit e92694b into vllm-project:main Feb 12, 2025
17 of 19 checks passed

lingfanyu deleted the nki_pa_improve branch February 12, 2025 18:50

panf2333 pushed a commit to yottalabsai/vllm that referenced this pull request Feb 18, 2025

[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAtt…

b701b9b

…ention and Improve Efficiency (vllm-project#12921) Signed-off-by: Lingfan Yu <lingfany@amazon.com>

kerthcet pushed a commit to kerthcet/vllm that referenced this pull request Feb 21, 2025

[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAtt…

9c8a014

…ention and Improve Efficiency (vllm-project#12921) Signed-off-by: Lingfan Yu <lingfany@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency #12921

[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency #12921

lingfanyu commented Feb 7, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 7, 2025

liangfu left a comment •

edited

Loading

lingfanyu commented Feb 12, 2025

[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency #12921

[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency #12921

Conversation

lingfanyu commented Feb 7, 2025 • edited by github-actions bot Loading

Summary

github-actions bot commented Feb 7, 2025

liangfu left a comment • edited Loading

Choose a reason for hiding this comment

lingfanyu commented Feb 12, 2025

lingfanyu commented Feb 7, 2025 •

edited by github-actions bot

Loading

liangfu left a comment •

edited

Loading