[Misc] Speculative Decoding: Adding Mean Accept Length Metric #11552

MMuzzammil1 · 2024-12-27T05:55:45Z

This PR adds the "mean accept length metric" for speculative decoding.

Mean Accept Length: The average length of token sequence accepted by the target model which has been proposed by the draft model during the run of the server with a speculative decoding framework. 0 <= mean_accept_length <= k

Signed-off-by: Mohd Muzzammil <me.muzzammil@samsung.com>

github-actions · 2024-12-27T05:55:58Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

comaniac · 2024-12-27T06:21:55Z

Thanks for the PR. Can you clarify why this metric is important, given that we already have many metrics to show the efficiency of speculative deciding?

MMuzzammil1 · 2024-12-27T06:35:09Z

@comaniac Thanks for the question. This metric is more closely related to the num_speculative_tokens (k) than all the other speculative metrics we have right now. While draft_acceptance_rate and num_accepted_tokens_total deal with the aggregate total of the number of tokens that the target model has accepted; both of them are not sufficient enough of signal the gap that we have between the number of proposed tokens (k) and the number of tokens which the target accepts on an average in a "single spec decode iteration". Let me know if you have further questions.

comaniac · 2024-12-27T07:38:38Z

@comaniac Thanks for the question. This metric is more closely related to the num_speculative_tokens (k) than all the other speculative metrics we have right now. While draft_acceptance_rate and num_accepted_tokens_total deal with the aggregate total of the number of tokens that the target model has accepted; both of them are not sufficient enough of signal the gap that we have between the number of proposed tokens (k) and the number of tokens which the target accepts on an average in a "single spec decode iteration". Let me know if you have further questions.

IIUC, mean accepted tokens can be represented as follows:

mean_accpet_tokens = mean(accept_tokens_per_step / k) = num_accepted_tokens_total / (k * n_step)

If this is true, then you can easily derive mean accepted tokens from the number of total accepted tokens, right?

MMuzzammil1 · 2024-12-27T09:51:33Z

@comaniac Thanks for the question. This metric is more closely related to the num_speculative_tokens (k) than all the other speculative metrics we have right now. While draft_acceptance_rate and num_accepted_tokens_total deal with the aggregate total of the number of tokens that the target model has accepted; both of them are not sufficient enough of signal the gap that we have between the number of proposed tokens (k) and the number of tokens which the target accepts on an average in a "single spec decode iteration". Let me know if you have further questions.

IIUC, mean accepted tokens can be represented as follows:
mean_accpet_tokens = mean(accept_tokens_per_step / k) = num_accepted_tokens_total / (k * n_step)
If this is true, then you can easily derive mean accepted tokens from the number of total accepted tokens, right?

I think there is a slight typo in your explanation. Basically,
draft_acceptance_rate = num_accepted_tokens_total / num_proposed_tokens
on the other hand, if we assume batch_size = 1 (in each proposal step):
mean_acceptance_length = mean(accept_tokens_per_step_per_sequence) = num_accepted_tokens_total / n_steps.
now since vllm can have different batch_size in each step:
mean_acceptance_length = num_accepted_tokens_total / (bsz_1 + bsz_2 + ... + bsz_{n}).

I took a careful look at the existing metrics implementation after your comment and could see:
num_proposed_tokens += bsz * k (after every proposal step).

So, yes I think with the current implementation:
mean_accept_length = k * num_accepted_tokens_total / num_proposed_tokens = k * draft_acceptance_rate.

However, as discussed in Milestone-2 (#4565 (comment)), if the value k itself becomes variable for each propose step, only then the current PR would be more relevant!!

comaniac · 2024-12-27T17:28:06Z

It's not typo but it's not precise. I should've just wrote sum(...) / sum(...) instead of mean(...).

But yes like you said the current implementation doesn't need additional mean accept length when k is fixed. However, I'm not sure if we will proceed to Milestone-2 in vLLM v0, or directly implement it in vLLM v1. Thus IMO I don't think we need this metrics atm, but I'll let @sroy745 and @LiuXiaoxuanPKU chime in to get different opinions.

Adding mean accept length metric to vllm

24609e6

Signed-off-by: Mohd Muzzammil <me.muzzammil@samsung.com>

mergify bot added the speculative-decoding label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc] Speculative Decoding: Adding Mean Accept Length Metric #11552

[Misc] Speculative Decoding: Adding Mean Accept Length Metric #11552

MMuzzammil1 commented Dec 27, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 27, 2024

comaniac commented Dec 27, 2024

MMuzzammil1 commented Dec 27, 2024

comaniac commented Dec 27, 2024

MMuzzammil1 commented Dec 27, 2024 •

edited

Loading

comaniac commented Dec 27, 2024

[Misc] Speculative Decoding: Adding Mean Accept Length Metric #11552

Are you sure you want to change the base?

[Misc] Speculative Decoding: Adding Mean Accept Length Metric #11552

Conversation

MMuzzammil1 commented Dec 27, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 27, 2024

comaniac commented Dec 27, 2024

MMuzzammil1 commented Dec 27, 2024

comaniac commented Dec 27, 2024

MMuzzammil1 commented Dec 27, 2024 • edited Loading

comaniac commented Dec 27, 2024

MMuzzammil1 commented Dec 27, 2024 •

edited by github-actions bot

Loading

MMuzzammil1 commented Dec 27, 2024 •

edited

Loading