[Feature]: Integrate `flash-infer` FP8 KV Cache Chunked-Prefill (Append Attention) #7450

jon-chuang · 2024-08-13T00:26:58Z

🚀 The feature, motivation and pitch

From new Flash Infer Release https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.1.4

Additional context

Follow up to: #7208, #7185

jon-chuang · 2024-08-13T00:39:05Z

Actually, @comaniac, I noticed that there are explicit asserts forbidding use of flash infer kernels for chunked prefill

vllm/vllm/attention/backends/flashinfer.py

Line 195 in 774cd1d

# Currently chunked prefill is not supported

As pointed out in: flashinfer-ai/flashinfer#392 (comment)

My understanding is that this is because vLLM runs prefill and decode in two separate kernel invocations by default (as is the case for flash-attention, see: #6052), and this applies to flash-infer as well?

Perhaps the first step is to unify the flash infer kernels to use a single kernel, similar to #6052, or at least clarify in what scenario it is ok to run flash-infer kernels for chunked prefill, because according to @yzh119 in flashinfer-ai/flashinfer#392, this should be supported by flash-infer already.

jon-chuang · 2024-08-13T00:43:30Z

Anw, please assign it to me, I will investigate further

comaniac · 2024-08-13T00:57:45Z

We are already working on this cc @Yard1

pavanimajety · 2024-09-05T23:27:53Z

@comaniac Any updates or open PRs on this that we can take a look at?

taegeonum · 2024-11-20T13:20:27Z

@comaniac Any updates?

jon-chuang added the feature request label Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Integrate `flash-infer` FP8 KV Cache Chunked-Prefill (Append Attention) #7450

[Feature]: Integrate `flash-infer` FP8 KV Cache Chunked-Prefill (Append Attention) #7450

jon-chuang commented Aug 13, 2024 •

edited

Loading

jon-chuang commented Aug 13, 2024

jon-chuang commented Aug 13, 2024

comaniac commented Aug 13, 2024

pavanimajety commented Sep 5, 2024

taegeonum commented Nov 20, 2024

[Feature]: Integrate flash-infer FP8 KV Cache Chunked-Prefill (Append Attention) #7450

[Feature]: Integrate flash-infer FP8 KV Cache Chunked-Prefill (Append Attention) #7450

Comments

jon-chuang commented Aug 13, 2024 • edited Loading

🚀 The feature, motivation and pitch

Additional context

jon-chuang commented Aug 13, 2024

jon-chuang commented Aug 13, 2024

comaniac commented Aug 13, 2024

pavanimajety commented Sep 5, 2024

taegeonum commented Nov 20, 2024

[Feature]: Integrate `flash-infer` FP8 KV Cache Chunked-Prefill (Append Attention) #7450

[Feature]: Integrate `flash-infer` FP8 KV Cache Chunked-Prefill (Append Attention) #7450

jon-chuang commented Aug 13, 2024 •

edited

Loading