-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Speculative Decoding: Adding Mean Accept Length Metric #11552
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Mohd Muzzammil <me.muzzammil@samsung.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Thanks for the PR. Can you clarify why this metric is important, given that we already have many metrics to show the efficiency of speculative deciding? |
@comaniac Thanks for the question. This metric is more closely related to the |
IIUC, mean accepted tokens can be represented as follows:
If this is true, then you can easily derive mean accepted tokens from the number of total accepted tokens, right? |
I think there is a slight typo in your explanation. Basically, I took a careful look at the existing metrics implementation after your comment and could see: So, yes I think with the current implementation: However, as discussed in Milestone-2 (#4565 (comment)), if the value |
It's not typo but it's not precise. I should've just wrote But yes like you said the current implementation doesn't need additional mean accept length when k is fixed. However, I'm not sure if we will proceed to Milestone-2 in vLLM v0, or directly implement it in vLLM v1. Thus IMO I don't think we need this metrics atm, but I'll let @sroy745 and @LiuXiaoxuanPKU chime in to get different opinions. |
This PR adds the "mean accept length metric" for speculative decoding.
Mean Accept Length: The average length of token sequence accepted by the target model which has been proposed by the draft model during the run of the server with a speculative decoding framework.
0 <= mean_accept_length <= k