Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Speculative Decoding] EAGLE Implementation with Top-1 proposer #6830

Merged
merged 54 commits into from
Aug 22, 2024

Conversation

abhigoyal1997
Copy link
Contributor

@abhigoyal1997 abhigoyal1997 commented Jul 26, 2024

This PR adds support for the EAGLE draft model.

Fix SafeAILab/EAGLE#43

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@abhigoyal1997 abhigoyal1997 marked this pull request as ready for review July 26, 2024 13:33
@abhigoyal1997
Copy link
Contributor Author

/ready

@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 26, 2024
@abhigoyal1997 abhigoyal1997 changed the title [WIP] [Speculative Decoding] EAGLE Implementation with Top-1 proposer [Speculative Decoding] EAGLE Implementation with Top-1 proposer Jul 26, 2024
@caddfa31434
Copy link

For some reason, Eagle seems to have removed the input_layernorm : https://github.com/SafeAILab/EAGLE/blob/main/eagle/model/cnets.py#L419

@abhigoyal1997
Copy link
Contributor Author

For some reason, Eagle seems to have removed the input_layernorm : https://github.com/SafeAILab/EAGLE/blob/main/eagle/model/cnets.py#L419

Yes, I saw that recently. But making that change would mean either changing the decoder layer to have input layernorm as optional or rewriting the decoder layer just for EAGLE. Both these options would reduce the freedom to use any decoder with EAGLE.

@cadedaniel
Copy link
Collaborator

cc @comaniac @sroy745

Copy link
Collaborator

@sroy745 sroy745 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Left some comments. PTAL

tests/spec_decode/e2e/test_eagle_correctness.py Outdated Show resolved Hide resolved
tests/spec_decode/e2e/test_eagle_correctness.py Outdated Show resolved Hide resolved
tests/spec_decode/e2e/test_eagle_correctness.py Outdated Show resolved Hide resolved
tests/spec_decode/e2e/test_eagle_correctness.py Outdated Show resolved Hide resolved
vllm/spec_decode/spec_decode_worker.py Show resolved Hide resolved
vllm/spec_decode/spec_decode_worker.py Show resolved Hide resolved
vllm/worker/worker_base.py Outdated Show resolved Hide resolved
vllm/spec_decode/multi_step_worker.py Show resolved Hide resolved
vllm/model_executor/models/eagle.py Show resolved Hide resolved
@abhigoyal1997
Copy link
Contributor Author

@sroy745 Thanks for the review. I've made the changes and responded to your comments. PTAL

@cadedaniel
Copy link
Collaborator

Will review tomorrow !

Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, need 1 more day to finish review. partial comments below.

vllm/model_executor/models/medusa.py Show resolved Hide resolved
self._seq_ids = seq_ids

def expand_with_bonus_tokens(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not blocking for this PR): these datastructures which require torch operations should not live in sequence // should go under spec_decode.

Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! let's merge once tests are passing

vllm/worker/worker_base.py Outdated Show resolved Hide resolved
@abhigoyal1997
Copy link
Contributor Author

Thanks @cadedaniel for reviewing and approving. This is ready to merge!

@cadedaniel cadedaniel merged commit a3fce56 into vllm-project:main Aug 22, 2024
46 checks passed
@cadedaniel
Copy link
Collaborator

do you have the steps for creating the checkpoint @abhigoyal1997 ?

@jokmingwong
Copy link

jokmingwong commented Aug 30, 2024

python3 benchmarks/benchmark_latency.py --model Qwen/Qwen2-7B-Instruct --speculative-model yuhuili/EAGLE-Qwen2-7B-Instruct --num_speculative_tokens 4 --use-v2-block-manager --batch-size 1 --input-len 1024--output-len 128 --max-model-len 2048

I used the command above to run the latest Eagle speculative decoding PR with Qwen2-7B. And the eagle model has been converted according to the conversion script in the comments. After running, I found that the output generated by speculative decoding is inconsistent with that without speculative decoding. Upon examining the source code, I suspect that there may be an issue with the implementation about input_ids as the implementation of vLLM is inconsistent with the Eagle library implementation:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/eagle.py#L76-L99
https://github.com/SafeAILab/EAGLE/blob/main/eagle/modeling_eagle.py#L1273
Namely, when running the forward pass for the Eagle model, the first token of the original model should also be placed on the input_ids. Can you explain the reason for the difference in these two implementations?

@abhigoyal1997
Copy link
Contributor Author

abhigoyal1997 commented Aug 30, 2024

python3 benchmarks/benchmark_latency.py --model Qwen/Qwen2-7B-Instruct --speculative-model yuhuili/EAGLE-Qwen2-7B-Instruct --num_speculative_tokens 4 --use-v2-block-manager --batch-size 1 --input-len 1024--output-len 128 --max-model-len 2048

I used the command above to run the latest Eagle speculative decoding PR with Qwen2-7B. And the eagle model has been converted according to the conversion script in the comments. After running, I found that the output generated by speculative decoding is inconsistent with that without speculative decoding. Upon examining the source code, I suspect that there may be an issue with the implementation about input_ids as the implementation of vLLM is inconsistent with the Eagle library implementation: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/eagle.py#L76-L99 https://github.com/SafeAILab/EAGLE/blob/main/eagle/modeling_eagle.py#L1273 Namely, when running the forward pass for the Eagle model, the first token of the original model should also be placed on the input_ids. Can you explain the reason for the difference in these two implementations?

Even in vLLM, the first token of the target model is present in input_ids. This is because the first token is generated by the target model in the prefill step which is then added to input_ids and EAGLE only starts generating tokens in the subsequent decode step. As for the masking in the forward pass, that masks the first input token and not any token output by the target model. This didn't make any difference in the outputs.

As for why you are seeing inconsistency, if you are using 16-bit precision, could it be related to this: #4978 (comment) ?

@abhigoyal1997
Copy link
Contributor Author

abhigoyal1997 commented Aug 30, 2024

do you have the steps for creating the checkpoint @abhigoyal1997 ?

Is the gist in this comment helpful?:

# This implementation is incompitable with https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B

@jokmingwong
Copy link

python3 benchmarks/benchmark_latency.py --model Qwen/Qwen2-7B-Instruct --speculative-model yuhuili/EAGLE-Qwen2-7B-Instruct --num_speculative_tokens 4 --use-v2-block-manager --batch-size 1 --input-len 1024--output-len 128 --max-model-len 2048

I used the command above to run the latest Eagle speculative decoding PR with Qwen2-7B. And the eagle model has been converted according to the conversion script in the comments. After running, I found that the output generated by speculative decoding is inconsistent with that without speculative decoding. Upon examining the source code, I suspect that there may be an issue with the implementation about input_ids as the implementation of vLLM is inconsistent with the Eagle library implementation: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/eagle.py#L76-L99 https://github.com/SafeAILab/EAGLE/blob/main/eagle/modeling_eagle.py#L1273 Namely, when running the forward pass for the Eagle model, the first token of the original model should also be placed on the input_ids. Can you explain the reason for the difference in these two implementations?

Even in vLLM, the first token of the target model is present in input_ids. This is because the first token is generated by the target model in the prefill step which is then added to input_ids and EAGLE only starts generating tokens in the subsequent decode step. As for the masking in the forward pass, that masks the first input token and not any token output by the target model. This didn't make any difference in the outputs.

As for why you are seeing inconsistency, if you are using 16-bit precision, could it be related to this: #4978 (comment) ?

Sorry, I may not have expressed this issue clearly. I noticed that when using speculative decoding with Eagle in vLLM under the same prompt and same models, with top_k=1 and temperature=0.5, the output is inconsistent with the official Eagle implementation. I will provide my test cases later to help reproduce this issue.

@Siegfried-qgf
Copy link

Great work. Do you have any plans to implement tree decoding? It seems that tree decoding will be very important to improve the results.

@Sekri0
Copy link

Sekri0 commented Sep 20, 2024

Do you have any plans to support scenarios where tp > 1? @abhigoyal1997

@den-run-ai
Copy link

Is it possible to reproduce the ~2x-3x speedup reported in EAGLE 1/2 papers with this PR in vLLM?

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
@xiongqisong
Copy link

python3 benchmarks/benchmark_latency.py --model Qwen/Qwen2-7B-Instruct --speculative-model yuhuili/EAGLE-Qwen2-7B-Instruct --num_speculative_tokens 4 --use-v2-block-manager --batch-size 1 --input-len 1024--output-len 128 --max-model-len 2048

I used the command above to run the latest Eagle speculative decoding PR with Qwen2-7B. And the eagle model has been converted according to the conversion script in the comments. After running, I found that the output generated by speculative decoding is inconsistent with that without speculative decoding. Upon examining the source code, I suspect that there may be an issue with the implementation about input_ids as the implementation of vLLM is inconsistent with the Eagle library implementation: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/eagle.py#L76-L99 https://github.com/SafeAILab/EAGLE/blob/main/eagle/modeling_eagle.py#L1273 Namely, when running the forward pass for the Eagle model, the first token of the original model should also be placed on the input_ids. Can you explain the reason for the difference in these two implementations?

I try to use EAGLE on Llama but failed, i want to know how to use EAGLE on vLLM, it's so hard to use with no demo

@sroy745
Copy link
Collaborator

sroy745 commented Dec 14, 2024

Hi @xiongqisong, which checkpoint are you using as the draft model? Is it one of the checkpoints available here ? If so it will not work since the checkpoint needed in vLLM is a bit different from what is available at https://huggingface.co/yuhuili. You need to convert the checkpoint available in https://huggingface.co/yuhuili using the script here and use the converted checkpoint as the draft model in the vllm.

Please let us know if this works for you or not. I think @LiuXiaoxuanPKU recently used the script to convert the checkpoint for yuhuili/EAGLE-LLaMA3-Instruct-70B into the vLLM compatible checkpoint and it worked for her.

I will add a section on how to use Eagle to the sd documentation here shortly

cc: @LiuXiaoxuanPKU

@xiongqisong
Copy link

xiongqisong commented Dec 16, 2024

Hi @xiongqisong, which checkpoint are you using as the draft model? Is it one of the checkpoints available here ? If so it will not work since the checkpoint needed in vLLM is a bit different from what is available at https://huggingface.co/yuhuili. You need to convert the checkpoint available in https://huggingface.co/yuhuili using the script here and use the converted checkpoint as the draft model in the vllm.

Please let us know if this works for you or not. I think @LiuXiaoxuanPKU recently used the script to convert the checkpoint for yuhuili/EAGLE-LLaMA3-Instruct-70B into the vLLM compatible checkpoint and it worked for her.

I will add a section on how to use Eagle to the sd documentation here shortly

cc: @LiuXiaoxuanPKU

Thanks for reply, i already use the script to convert EAGLE model weight to vllm format weight, but vLLM can't run EAGLE normally, i share the clue in #11126 , hope you have time to help me @sroy745 ~
If you can add a section on how to use EAGLE to the sd documentation https://docs.vllm.ai/en/latest/usage/spec_decode.html, that will be great to let more and more people involve into develope good performance EAGLE speculative decoding on vLLM.

@simon-mo simon-mo mentioned this pull request Dec 27, 2024
10 tasks
@Kamichanw
Copy link

Can someone help me with the propose of _expand_execute_model_request here?

# Expand the batch for sequences with a bonus token.
# Perform a forward pass on the expanded batch and filter the
# response to retain only the original sequences' responses.
expanded_request, indices_of_seq_with_bonus_tokens =\
self._expand_execute_model_request(
execute_model_req, seq_ids_with_bonus_token_in_last_step)

I can't understand why it copies a new model seq without bonus token and sends to execute_model. Finally, at here
filtered_model_outputs = self._filter_model_output(
model_outputs, indices_of_seq_with_bonus_tokens)

this function select the original seq. What's the difference with or without the model seq without bonus token created before?
@abhigoyal1997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support vLLM