Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] Speculative Decoding: Adding Mean Accept Length Metric #11552

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MMuzzammil1
Copy link

@MMuzzammil1 MMuzzammil1 commented Dec 27, 2024

This PR adds the "mean accept length metric" for speculative decoding.

Mean Accept Length: The average length of token sequence accepted by the target model which has been proposed by the draft model during the run of the server with a speculative decoding framework. 0 <= mean_accept_length <= k

Signed-off-by: Mohd Muzzammil <me.muzzammil@samsung.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@comaniac
Copy link
Collaborator

Thanks for the PR. Can you clarify why this metric is important, given that we already have many metrics to show the efficiency of speculative deciding?

@MMuzzammil1
Copy link
Author

@comaniac Thanks for the question. This metric is more closely related to the num_speculative_tokens (k) than all the other speculative metrics we have right now. While draft_acceptance_rate and num_accepted_tokens_total deal with the aggregate total of the number of tokens that the target model has accepted; both of them are not sufficient enough of signal the gap that we have between the number of proposed tokens (k) and the number of tokens which the target accepts on an average in a "single spec decode iteration". Let me know if you have further questions.

@comaniac
Copy link
Collaborator

@comaniac Thanks for the question. This metric is more closely related to the num_speculative_tokens (k) than all the other speculative metrics we have right now. While draft_acceptance_rate and num_accepted_tokens_total deal with the aggregate total of the number of tokens that the target model has accepted; both of them are not sufficient enough of signal the gap that we have between the number of proposed tokens (k) and the number of tokens which the target accepts on an average in a "single spec decode iteration". Let me know if you have further questions.

IIUC, mean accepted tokens can be represented as follows:

mean_accpet_tokens = mean(accept_tokens_per_step / k) = num_accepted_tokens_total / (k * n_step)

If this is true, then you can easily derive mean accepted tokens from the number of total accepted tokens, right?

@MMuzzammil1
Copy link
Author

MMuzzammil1 commented Dec 27, 2024

@comaniac Thanks for the question. This metric is more closely related to the num_speculative_tokens (k) than all the other speculative metrics we have right now. While draft_acceptance_rate and num_accepted_tokens_total deal with the aggregate total of the number of tokens that the target model has accepted; both of them are not sufficient enough of signal the gap that we have between the number of proposed tokens (k) and the number of tokens which the target accepts on an average in a "single spec decode iteration". Let me know if you have further questions.

IIUC, mean accepted tokens can be represented as follows:

mean_accpet_tokens = mean(accept_tokens_per_step / k) = num_accepted_tokens_total / (k * n_step)

If this is true, then you can easily derive mean accepted tokens from the number of total accepted tokens, right?

I think there is a slight typo in your explanation. Basically,
draft_acceptance_rate = num_accepted_tokens_total / num_proposed_tokens
on the other hand, if we assume batch_size = 1 (in each proposal step):
mean_acceptance_length = mean(accept_tokens_per_step_per_sequence) = num_accepted_tokens_total / n_steps.
now since vllm can have different batch_size in each step:
mean_acceptance_length = num_accepted_tokens_total / (bsz_1 + bsz_2 + ... + bsz_{n}).

I took a careful look at the existing metrics implementation after your comment and could see:
num_proposed_tokens += bsz * k (after every proposal step).

So, yes I think with the current implementation:
mean_accept_length = k * num_accepted_tokens_total / num_proposed_tokens = k * draft_acceptance_rate.

However, as discussed in Milestone-2 (#4565 (comment)), if the value k itself becomes variable for each propose step, only then the current PR would be more relevant!!

@comaniac
Copy link
Collaborator

It's not typo but it's not precise. I should've just wrote sum(...) / sum(...) instead of mean(...).

But yes like you said the current implementation doesn't need additional mean accept length when k is fixed. However, I'm not sure if we will proceed to Milestone-2 in vLLM v0, or directly implement it in vLLM v1. Thus IMO I don't think we need this metrics atm, but I'll let @sroy745 and @LiuXiaoxuanPKU chime in to get different opinions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants