-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Integrate flash-infer
FP8 KV Cache Chunked-Prefill (Append Attention)
#7450
Comments
Actually, @comaniac, I noticed that there are explicit asserts forbidding use of flash infer kernels for chunked prefill vllm/vllm/attention/backends/flashinfer.py Line 195 in 774cd1d
As pointed out in: flashinfer-ai/flashinfer#392 (comment) My understanding is that this is because vLLM runs prefill and decode in two separate kernel invocations by default (as is the case for flash-attention, see: #6052), and this applies to flash-infer as well? Perhaps the first step is to unify the flash infer kernels to use a single kernel, similar to #6052, or at least clarify in what scenario it is ok to run flash-infer kernels for chunked prefill, because according to @yzh119 in flashinfer-ai/flashinfer#392, this should be supported by flash-infer already. |
Anw, please assign it to me, I will investigate further |
We are already working on this cc @Yard1 |
@comaniac Any updates or open PRs on this that we can take a look at? |
@comaniac Any updates? |
🚀 The feature, motivation and pitch
From new Flash Infer Release https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.1.4
cc @comaniac
Additional context
Follow up to: #7208, #7185
The text was updated successfully, but these errors were encountered: