-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FP8 KVCache Quantization #575
Comments
I think if it is merged in vllm and we can directly use it then. What would you like us to do for this at this moment? |
Hi @ZiyueHuang, are we able to use your cool work (FP8 K-V cache) inside vLLM now? Thanks |
@leocnj Thanks for your interest. This feature is not merged into vllm. You can use it by installing from source ( |
Hi,
Thanks for the amazing project.
I noticed the vector-wise KVCache quantization technique used in this project, which may decrease the memory usage and also increase the throughput. I have tried FP8 KVCache quantization for Qwen on vLLM, which may cost less cycles, and the results look promising and are posted here.
The text was updated successfully, but these errors were encountered: