-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Roadmap] vLLM Roadmap Q4 2024 #9006
Comments
Support for KV cache compression
|
Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards. |
Hi, do we have any follow-up issue or Slack channel for the "KV cache offload to CPU and disk" task? Our team has previously explored some "KV cache offload" work based on vLLM, and we’d be happy to join any relevant discussion or contribute to the development if there's such chance~ Personally, also looking forward to know more about "More control in prefix caching, and scheduler policies" part😊. |
@simon-mo hi,regarding the topic “KV cache offload to CPU and disk”, I previously implemented a version that stores kv cache in a local file(#8018). Of course, I also did relevant abstractions and can add other media. Is there a slack channel for this? We can discuss the specific scheme. I am also quite interested in this function. |
@sylviayangyy @zeroorhero thank you for your interests! Yes. @KuntaiDu has created a #feat-kvcache-offloading to discuss that. |
It looks like LoRA is now supported. Are you encountering any issues? |
Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since. In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic. |
Whether there is an opportunity to participate in changes related to speculative decoding? I'm working on some of the practices that are going to help you |
I second this. We are using vLLM to host our production inference servers and all of our downstream applications rely on guided json decoding to ensure that output is parsable. There is a significant performance difference between guided and non-guided decoding and any performance improvements would be helpful to increase throughput. |
Hey, I maintain the guidance project and we worked on the first proposal in #6273 . Looks like vLLM has changed significantly since then, but if there is appetite for upgraded/more performant guided decoding work from the maintainers, we're happy to take another look and investigate a new PR. In particular, guidance (and our high performance rust implementation in llguidance already does async computations on CPU, calculates fast forward tokens, etc. and is typically accelerative for JSON schema. |
Yes, if we look at the class in mixtral_quant.py, it does not have SupportsLora which means lora is not supported for quantized Mixtral. but for mixtral.py, we have SupportsLora included in MixtralForCausalLM. I have a LORA adapter trained which I want to use on top of mixtral-awq model without merging, directly as a hot swap. Let me know if you know a better way to tackle this situation |
I'm guessing you explicitly set the llm = LLM(
model="Mixtral-8x7B-Instruct-v0.1-GPTQ",
trust_remote_code=True,
gpu_memory_utilization=0.6,
enable_lora=True,
) |
Improvements in guided generation performance would be very welcome. There is a helpful comment by @stas00 from last month with a nice summary of where things currently stand. |
Tried this, but does not work. I get the same error. Just mentioning that I use awq quantized model |
Which vllm version are you using? According to the code in https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/model_executor/model_loader/utils.py#L30, both GPTQ and AWQ quantization methods should be compatible when using version 0.6.3post1 |
Any interest in vAttention? |
More and more speech model is using a LLM to predict non-text tokens. Like ChatTTS or FishTTS, they are all using a llama to predict speech tokens. |
Do we have plans to improve concurrency performance for guided decoding? Enhancements in concurrency performance for guided decoding would greatly benefit high-volume, real-time applications. |
Quick update -- we've made an initial PR to support guidance as a backend, which does improve performance over current implementations (#10217). Of course, better support for concurrency in general would also help guidance get significantly faster. Happy to support there and help if we can too! |
I am interested in optimizations related to speculative decoding. Is there an opportunity to get involved? |
I have a somewhat similar question to @wanghongyu2001: if someone is interested in contributing to a specific aspect of vLLM, what’s the recommended path to get involved? Specifically, are there any suggested learning resources to systematically understand the vLLM codebase and, in particular, the v1 architecture? In addition to navigating through the codebase, are there other structured ways to ramp up, such as design docs, or suggested youtube videos (in case I miss anything), any important PRs/files worth reading through? Would be thrilled to dive in and contribute to the project. Any guidance would be much appreciated! |
We could definitely use a thread pool to process logits list in parallel. As VLLM can run different number of logits processors for each logits in a batch, batched logits processor seems complex to implement.
Also, I think VLLM is capable of providing multiple output tokens per sequence per step, we can leverage it for fast-forwarded tokens in JSON guided generation (super beneficial to improve performance) |
Interested in thoughts/plan on EXL2 support: #3203 |
Integrating xgrammar could be a good choice: https://github.com/mlc-ai/xgrammar . |
This page is accessible via roadmap.vllm.ai
Themes.
As before, we categorized our roadmap into 6 broad themes: broad model support, wide hardware coverage, state of the art performance optimization, production level engine, strong OSS community, and extensible architectures. As we are seeing more
Broad Model Support
Help wanted:
BERTModel
(firstencoder-only
embedding model) #9056Hardware Support
Help wanted:
Performance Optimizations
Help wanted:
Production Features
Help wanted
OSS Community
Help wanted
Extensible Architecture
If any of the items you wanted is not on the roadmap, your suggestion and contribution is still welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.
Historical Roadmap: #5805, #3861, #2681, #244
The text was updated successfully, but these errors were encountered: