-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][Spec Decode] Add multi-proposer support for variable and flexible speculative decoding #7947
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
@cadedaniel Do you have time to look at this PR and give me some advice? Also, I am investigating how to integrate LoRA and multi-step scheduling with spec decode. At this point, vllm supports many optimization options, but they cannot work with each other at the same time, this is frustrating to some degree when it comes to production development. |
I am afk for a few weeks unfortunately. cc @LiuXiaoxuanPKU @sroy745 @njhill vLLM spec decode experts |
Hi, thanks for the contribution. Will take a look by the end of today. |
Thanks for the great PR description, it's well motivated. After scanning through the PR, I have some questions:
For the multi-step scheduling and lora, I feel we can disable those two features when multiple proposers are used because vllm might use async scheduling in the next 1-2 month. Also, I'm curious about if you have any workloads/numbers that demonstrate the benefits of the multi proposer method? |
Thank you very much for your time.
The original motivation is to support different types of proposers for various situations. However, I believe it is possible that some users may train multiple draft models of the same type for different applications to meet the data distribution. So the answer to the second question is yes, it is a legitimate but niche choice. As for the first question, I haven't refactored the
Sure. I have tested all additional-weight-required implementations (including typical draft model, MLP, Medusa, Eagle) with Ngram. They are all compatible. You can consider this proposed class, Why Ngram? I think a lot about what would be most practical for users. We both know that training a draft model is not trivial. For most users, Ngram can be set by default without too many prerequisites, as we can see in (#5805).
Therefore, I implemented this demo of supporting Ngram as a backup to pair up with another slower but more accurate proposer as the first step.
There is currently no public dataset that can support our experiments. So we collected some real requests and conducted several experiments in an internal RAG Chatbot application. Consider we process each request one by one, if RAG hits, we use Ngram, if not, we use a 0.5B draft model, the average latency of all requests with MultiProposerWorker reduces by ~16% compared to pure NgramWorker, and ~40% compared to pure MultiStepWorker. However, if requests with different proposers are coming in batches, things could be complicated as I mentioned in the PR statement due to continuous batching. Since we only have a target model instance and need to perform the batch together currently, 'Divide and conquer' might have a problem. So I think it would be better to discuss with the community before I turn this demo into a proper PR. |
Hello @LiuXiaoxuanPKU . Any news since our last conversation? I think maybe there are 2 ways to support dynamic speculative decoding with various proposers during runtime. The first one is that we handle the engine-level or request-level scheduling gracefully as we discussed before. The only tricky thing to note is that when we switch from a proposer that does not require KVCache to a proposer that does, we need to rerun the prefill phase of this new spec model for those running requests. The other one is that we merely provide some online API to make speculative decoding switchable like LoRA and leave it to the engine maintainer.
So we can control whether to use spec decode and which proposer to use without shutting down the engine with 'CTRL+C'. The clean switching can happen when all previous requests have been completed. I know you have been busy removing Batch Expansion. Great work. Let me know what you think of this PR when you are available. |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: ShangmingCai <csmthu@gmail.com>
Signed-off-by: ShangmingCai <csmthu@gmail.com>
Signed-off-by: ShangmingCai <csmthu@gmail.com>
Signed-off-by: ShangmingCai <csmthu@gmail.com>
Signed-off-by: ShangmingCai <csmthu@gmail.com>
Signed-off-by: ShangmingCai <csmthu@gmail.com>
This PR plans to add multiple-proposer support for speculative decoding as mentioned in (#6300).
With this feature, varying scheduling policies could be applied:
SpecDecodeParams
. The engine maintainer can detect and choose a suitable proposer to handle specific requests.SpecDecodeParams
is similar toLoRARequest
, we need it if we want each request to have such flexibility when using spec decode. Maybenum_speculative_tokens
can be moved fromSequenceGroupMetadata
toSpecDecodeParams
so that we can provide speculation length scheduling for each request in the future.SpecDecodeParams
to cause more metadata overhead. I am not sure which scheduling granularity is best for spec decode yet, it could be related to the use case whether we use one backend for different apps.Since NGram is a lightweight implementation that can be set by default without too many prerequisites, this PR utilizes it to implement a multi-proposer demo for now. More flexible choices will be added in the future.
Changed Code:
SpecDecodeParams
ngram_prompt_lookup_max > 0
to model nameMultiProposerWorker
and support NGram proposer as a backup to pair up with another slower but more accurate proposer.This PR is still working in progress. The test has not been added yet.
@cadedaniel Can you tell me your opinion when you have time to check on this? I think the design details should be determined after discussion.