-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Speculative Decoding] EAGLE Implementation with Top-1 proposer #6830
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
/ready |
For some reason, Eagle seems to have removed the input_layernorm : https://github.com/SafeAILab/EAGLE/blob/main/eagle/model/cnets.py#L419 |
Yes, I saw that recently. But making that change would mean either changing the decoder layer to have input layernorm as optional or rewriting the decoder layer just for EAGLE. Both these options would reduce the freedom to use any decoder with EAGLE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Left some comments. PTAL
@sroy745 Thanks for the review. I've made the changes and responded to your comments. PTAL |
Will review tomorrow ! |
…th `worker.worker_base.LocalOrDistributedWorkerBase`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, need 1 more day to finish review. partial comments below.
self._seq_ids = seq_ids | ||
|
||
def expand_with_bonus_tokens( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not blocking for this PR): these datastructures which require torch operations should not live in sequence // should go under spec_decode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! let's merge once tests are passing
Thanks @cadedaniel for reviewing and approving. This is ready to merge! |
do you have the steps for creating the checkpoint @abhigoyal1997 ? |
python3 benchmarks/benchmark_latency.py --model Qwen/Qwen2-7B-Instruct --speculative-model yuhuili/EAGLE-Qwen2-7B-Instruct --num_speculative_tokens 4 --use-v2-block-manager --batch-size 1 --input-len 1024--output-len 128 --max-model-len 2048 I used the command above to run the latest Eagle speculative decoding PR with Qwen2-7B. And the eagle model has been converted according to the conversion script in the comments. After running, I found that the output generated by speculative decoding is inconsistent with that without speculative decoding. Upon examining the source code, I suspect that there may be an issue with the implementation about input_ids as the implementation of vLLM is inconsistent with the Eagle library implementation: |
Even in vLLM, the first token of the target model is present in input_ids. This is because the first token is generated by the target model in the prefill step which is then added to input_ids and EAGLE only starts generating tokens in the subsequent decode step. As for the masking in the forward pass, that masks the first input token and not any token output by the target model. This didn't make any difference in the outputs. As for why you are seeing inconsistency, if you are using 16-bit precision, could it be related to this: #4978 (comment) ? |
Is the gist in this comment helpful?: vllm/vllm/model_executor/models/eagle.py Line 127 in 80c7b08
|
Sorry, I may not have expressed this issue clearly. I noticed that when using speculative decoding with Eagle in vLLM under the same prompt and same models, with top_k=1 and temperature=0.5, the output is inconsistent with the official Eagle implementation. I will provide my test cases later to help reproduce this issue. |
Great work. Do you have any plans to implement tree decoding? It seems that tree decoding will be very important to improve the results. |
Do you have any plans to support scenarios where tp > 1? @abhigoyal1997 |
Is it possible to reproduce the ~2x-3x speedup reported in EAGLE 1/2 papers with this PR in vLLM? |
…-project#6830) Signed-off-by: Alvant <alvasian@yandex.ru>
I try to use EAGLE on Llama but failed, i want to know how to use EAGLE on vLLM, it's so hard to use with no demo |
Hi @xiongqisong, which checkpoint are you using as the draft model? Is it one of the checkpoints available here ? If so it will not work since the checkpoint needed in vLLM is a bit different from what is available at https://huggingface.co/yuhuili. You need to convert the checkpoint available in https://huggingface.co/yuhuili using the script here and use the converted checkpoint as the draft model in the vllm. Please let us know if this works for you or not. I think @LiuXiaoxuanPKU recently used the script to convert the checkpoint for yuhuili/EAGLE-LLaMA3-Instruct-70B into the vLLM compatible checkpoint and it worked for her. I will add a section on how to use Eagle to the sd documentation here shortly cc: @LiuXiaoxuanPKU |
Thanks for reply, i already use the script to convert EAGLE model weight to vllm format weight, but vLLM can't run EAGLE normally, i share the clue in #11126 , hope you have time to help me @sroy745 ~ |
Can someone help me with the propose of vllm/vllm/spec_decode/multi_step_worker.py Lines 86 to 91 in 289b519
I can't understand why it copies a new model seq without bonus token and sends to execute_model . Finally, at herevllm/vllm/spec_decode/multi_step_worker.py Lines 126 to 127 in 289b519
this function select the original seq. What's the difference with or without the model seq without bonus token created before? @abhigoyal1997 |
This PR adds support for the EAGLE draft model.
Fix SafeAILab/EAGLE#43