You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tested the accept length ( number of tokens per step) withtypical acceptance sampling. The accept length is even smaller than default reject sampling method.
Here is my experimental details:
The dataset I used was mt_bench.
Speculative decoding model's setup:
llama3.1 8b as target model and Qwama-0.5B-Instruct as a draft model (num of speculative tokens is 2)
llama3.1 8b as target model with MLP-speculator.
3 Temperature was set as 0.9
4 posterior_threshold and posterior_alpha were set as default values.
Do you have some experimental results on this? Or do I need to tune some parameters for typical acceptance sampling? Thanks a lot!
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
I also check the token acceptance rate for rejection sampling:
SD with rejection sampling: Draft acceptance rate: 0.287, System efficiency: 0.422
SD with typical acceptance before #8562: Draft acceptance rate: 0.283, System efficiency: 0.317
SD with typical acceptance after #8562:Draft acceptance rate: 0.293, System efficiency: 0.427
Notice in the results above, typical acceptance (after #8562) has similar performance to rejection sampling because we are doing greedy decoding.
I check the default values of posterior_threshold = 0.09 and posterior_alpha = 0.3, they are already small. You can further reduce the value to see if you can get any benefits, but that might affect the generation quality. I have not tested this thoroughly here, feel free to take a try and share the result here.
Proposal to improve performance
No response
Report of performance regression
I tested the
accept length
( number of tokens per step) withtypical acceptance sampling
. The accept length is even smaller than default reject sampling method.Here is my experimental details:
llama3.1 8b as target model and Qwama-0.5B-Instruct as a draft model (num of speculative tokens is 2)
llama3.1 8b as target model with MLP-speculator.
3 Temperature was set as 0.9
4
posterior_threshold
andposterior_alpha
were set as default values.Do you have some experimental results on this? Or do I need to tune some parameters for
typical acceptance sampling
? Thanks a lot!Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: