Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: The accept rate of typical acceptance sampling #8639

Closed
1 task done
hustxiayang opened this issue Sep 19, 2024 · 1 comment
Closed
1 task done

[Performance]: The accept rate of typical acceptance sampling #8639

hustxiayang opened this issue Sep 19, 2024 · 1 comment
Assignees
Labels
performance Performance-related issues

Comments

@hustxiayang
Copy link

Proposal to improve performance

No response

Report of performance regression

I tested the accept length ( number of tokens per step) withtypical acceptance sampling. The accept length is even smaller than default reject sampling method.
Here is my experimental details:

  1. The dataset I used was mt_bench.
  2. Speculative decoding model's setup:
    llama3.1 8b as target model and Qwama-0.5B-Instruct as a draft model (num of speculative tokens is 2)
    llama3.1 8b as target model with MLP-speculator.
    3 Temperature was set as 0.9
    4 posterior_threshold and posterior_alpha were set as default values.

Do you have some experimental results on this? Or do I need to tune some parameters for typical acceptance sampling? Thanks a lot!

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@hustxiayang hustxiayang added the performance Performance-related issues label Sep 19, 2024
@LiuXiaoxuanPKU
Copy link
Collaborator

Hi, thanks for the question! I did some quick benchmark for typical acceptance:

Settings:
Model: lmsys/vicuna-7b-v1.3
Draft model: abhigoyal/vllm-medusa-vicuna-7b-v1.3
Hardware: 1xH100
Dataset: ShareGPT
vllm version: v0.6.1.post2
Request rate: 1 req/s
Sampling method: greedy decoding

Results

w/o SD SD with rejection sampling SD with typical acceptance before #8562 SD with typical acceptance after #8562
median TTFT (ms) 13.30 12.79 13.46 12.97
median TPOT (ms) 6.97 5.43 7.81 5.53
median End to end (s) 1.10 0.83 1.17 0.84

I also check the token acceptance rate for rejection sampling:
SD with rejection sampling: Draft acceptance rate: 0.287, System efficiency: 0.422
SD with typical acceptance before #8562: Draft acceptance rate: 0.283, System efficiency: 0.317
SD with typical acceptance after #8562:Draft acceptance rate: 0.293, System efficiency: 0.427

Notice in the results above, typical acceptance (after #8562) has similar performance to rejection sampling because we are doing greedy decoding.

Some comments here:

  1. After merging this PR (Fix typical acceptance sampler with correct recovered token ids #8562), the acceptance rate should be higher because we accept one more 'recover' token.
  2. I check the default values of posterior_threshold = 0.09 and posterior_alpha = 0.3, they are already small. You can further reduce the value to see if you can get any benefits, but that might affect the generation quality. I have not tested this thoroughly here, feel free to take a try and share the result here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

2 participants