-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Q&A] Any palns for different dtypes for Q (query) and KV (kv-cache)? #285
Comments
Thanks for your suggestions. Sure, I think we can definitely support it, and using fp8 for kv-cache and fp16 for q sounds reasonable to me. I'll separate the DTypeIn to DTypeQ and DTypeKV in the kernel implementations, and the python APIs doesn't have to change. |
Seconding this - I was actually thinking of submitting a PR myself. @yzh119 let me know if you need any help on this (from what I can tell, it should be quite straightforward). Semi-related, can we expect fp8 support for prefill any time soon? How complicated would it be to add that? |
Sounds good, I would really appreciate your help!
Yes we are in the last step of dealing with transposed ldmatrix for fp8 (for V matrix). It should be available soon :) |
Ok, let me see if I can get a PR going this week! |
Hi, All! This is just a question of whether there are such plans or not...
Right now, Flashinfer lib requires Q (query) and KV (kv-cache) to have the same dtype.
Just an example from the code,
q
andpaged_kv
have the sameDTypeIn
:Are there any plans to support different dtypes for KV-cache and Q (query)?
My personal interest is
fp8
for kv-cache andfp16
for query.Thank you in advance!
cc @yzh119
The text was updated successfully, but these errors were encountered: