[V1] Optimize the CPU overheads in FlashAttention custom op #10733

WoosukKwon · 2024-11-28T04:05:29Z

With piece-wise CUDA graphs, we have to make sure that the attention custom op causes minimal CPU overheads. This PR made a few changes to optimize the CPU overheads in the FlashAttention custom op:

~~We directly use torch.ops.vllm_flash_attn_c.varlen_fwd rather than flash_attn_varlen_func, since FlashAttnFunc which inherits torch.autograd.Function causes unnecessary overheads.~~
We move the reshapes and shape check logics to outside of the custom op, so that they can be done at the CUDA graph capture time.

Results of python benchmarks/benchmark_latency.py (opt-125m) on a single H100 GPU:

V1 main: 227 ms
V1 this PR: 192 ms
V0 + 8-step: 130 ms

Next step: further reduce the unnecessary CPU ops inside the FlashAttention op.

github-actions · 2024-11-28T04:05:40Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

youkaichao · 2024-11-28T04:47:17Z

~~well, I think I forgot to update the v1 flash attention file, after #10558 , you don't need the torch.ops.vllm.unified_v1_flash_attention call.~~

nvm

youkaichao · 2024-11-28T08:42:52Z

vllm/v1/attention/backends/flash_attn.py

@@ -203,23 +209,31 @@ def unified_v1_flash_attention(
        v_scale,
    )

-    attn_output = flash_attn_varlen_func(


can you also update the corresponding v0 code?

tlrmchlsmth

Looking at profile results on #9856, this saves about 60µs off of the CPU time spent in each flash attention call (approx 300µs -> 240µs)

Thanks!

mgoin

LGTM with Kaichao's comment, thanks for quickly improving this. The failing test is due to neuralmagic/Phi-3-medium-128k-instruct-quantized.w4a16 and unrelated

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon · 2024-11-28T16:49:35Z

@youkaichao @mgoin As we merged vllm-project/flash-attention#30, we don't have to directly use torch.ops.vllm_flash_attn_c.varlen_fwd. We can just use flash_attn_varlen_func as we currently do. Both V0 and V1 already gets the benefits after vllm-project/flash-attention#30.

WoosukKwon · 2024-11-28T16:56:52Z

One weird phenomenon I found is that V1 has a spike in latency:

Avg latency: 0.20093455887205589 seconds
10% percentile latency: 0.1931818482640665 seconds
25% percentile latency: 0.19354040725738741 seconds
50% percentile latency: 0.19391279752017 seconds
75% percentile latency: 0.19426249974640086 seconds
90% percentile latency: 0.1961068181961309 seconds
99% percentile latency: 0.3368887884780999 seconds

This is highly reproducible on my dev machine. Can this be because of Python gc or something like that?

robertgshaw2-redhat · 2024-11-29T03:11:16Z

One weird phenomenon I found is that V1 has a spike in latency:
Avg latency: 0.20093455887205589 seconds
10% percentile latency: 0.1931818482640665 seconds
25% percentile latency: 0.19354040725738741 seconds
50% percentile latency: 0.19391279752017 seconds
75% percentile latency: 0.19426249974640086 seconds
90% percentile latency: 0.1961068181961309 seconds
99% percentile latency: 0.3368887884780999 seconds
This is highly reproducible on my dev machine. Can this be because of Python gc or something like that?

It’s probably the prefix caching …

comaniac · 2024-11-29T05:37:44Z

Hmm but benchmark_latency.py does sample each prompts separately: https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_latency.py#L36

comaniac · 2024-11-29T05:40:14Z

Hmm but benchmark_latency.py does sample each prompts separately: https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_latency.py#L36

Just found that it has a warmup phase. It's still possible due to prefix caching if all prompts are cached then. Suggest to explicitly disable prefix caching to double check.

WoosukKwon · 2024-12-01T22:18:34Z

@comaniac @robertgshaw2-neuralmagic You're right. The latency becomes stable when prefix caching is turned off.

Avg latency: 0.1945609479948568 seconds
10% percentile latency: 0.19310778125654907 seconds
25% percentile latency: 0.19390572598786093 seconds
50% percentile latency: 0.19475348049309105 seconds
75% percentile latency: 0.195164829317946 seconds
90% percentile latency: 0.19570096801035106 seconds
99% percentile latency: 0.1962820820847992 seconds

…ject#10733) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>

…ject#10733) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon requested review from robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners November 28, 2024 04:05

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2024

youkaichao reviewed Nov 28, 2024

View reviewed changes

tlrmchlsmth approved these changes Nov 28, 2024

View reviewed changes

mgoin approved these changes Nov 28, 2024

View reviewed changes

Re

456980b

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon force-pushed the v1-flash-opt branch from e4f8b06 to 456980b Compare November 28, 2024 16:45

WoosukKwon merged commit 98f47f2 into main Nov 28, 2024
15 of 18 checks passed

WoosukKwon deleted the v1-flash-opt branch November 28, 2024 17:01

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[V1] Optimize the CPU overheads in FlashAttention custom op (vllm-pro…

17b4a20

…ject#10733) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024

[V1] Optimize the CPU overheads in FlashAttention custom op (vllm-pro…

252ba9e

…ject#10733) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Optimize the CPU overheads in FlashAttention custom op #10733

[V1] Optimize the CPU overheads in FlashAttention custom op #10733

WoosukKwon commented Nov 28, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 28, 2024

youkaichao commented Nov 28, 2024 •

edited

Loading

youkaichao Nov 28, 2024

tlrmchlsmth left a comment •

edited

Loading

mgoin left a comment

WoosukKwon commented Nov 28, 2024

WoosukKwon commented Nov 28, 2024

robertgshaw2-redhat commented Nov 29, 2024

comaniac commented Nov 29, 2024

comaniac commented Nov 29, 2024 •

edited

Loading

WoosukKwon commented Dec 1, 2024

[V1] Optimize the CPU overheads in FlashAttention custom op #10733

[V1] Optimize the CPU overheads in FlashAttention custom op #10733

Conversation

WoosukKwon commented Nov 28, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 28, 2024

youkaichao commented Nov 28, 2024 • edited Loading

youkaichao Nov 28, 2024

Choose a reason for hiding this comment

tlrmchlsmth left a comment • edited Loading

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

WoosukKwon commented Nov 28, 2024

WoosukKwon commented Nov 28, 2024

robertgshaw2-redhat commented Nov 29, 2024

comaniac commented Nov 29, 2024

comaniac commented Nov 29, 2024 • edited Loading

WoosukKwon commented Dec 1, 2024

WoosukKwon commented Nov 28, 2024 •

edited by github-actions bot

Loading

youkaichao commented Nov 28, 2024 •

edited

Loading

tlrmchlsmth left a comment •

edited

Loading

comaniac commented Nov 29, 2024 •

edited

Loading