-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: Speculative Performance almost same or lower #5239
Comments
Hi @tolry418, thanks for raising the issue! Yeah, the performance is expected. Please check out this issue for all potential tasks for improving the speculative decoding performance here. Basially all P0 issues are important for the performance. From what I see for 70B model, my impression: But we do test prompt lookup decoding and see performance improvement on some workloads. Please check the results here. |
@LiuXiaoxuanPKU Thanks for your reply!
I set proposed length 5. And i got one more experiment result. To check GPU resource effect.
Here is the result . SD still worse than only model. |
@tolry418 With the increase in BatchSize, the throughput of Speculative Decoding will be worse than that of the baseline (which uses only the target model). |
I believe that the focus of speculative decoding is on the latency per prompt rather than the overall system throughput. I did a similar experiment. I think the reason the current Speculative Decoding impl is slow is that the proposal process of the draft model is still time-consuming sequential autoregressive decoding. Even if the draft model size is 100 times smaller, its inference time overhead cannot be reduced by 100 times or even to 1/10 of the target model. Therefore, the proposal time overhead is the hidden bottleneck if you use a 0.5B to 7B model as a draft model. The larger the k, the greater the proposal time. Therefore, I suppose implementations that do not require a draft model, such as Medusa/Eagle, will have greater potential in terms of performance gains. |
You can set Do you know your acceptance rate? In my experience choice of draft model makes a pretty big impact on performance, and I think 5 speculative tokens with a quantized tinyLlama model might just lead to a lot of reject tokens, which then just means the larger model still has to do most of the work, with added overhead. Maybe reducing the amount of speculated tokens might be beneficial to overall performance as well? |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
Proposal to improve performance
@LiuXiaoxuanPKU Good to see you again. Thank you for your work.
I guess your working group releases SD a little by little.
I'm wondering about current SD version.
I had experiment result that using Speculative Decoding way is almost same performance or lower than normal(Only using Target Model) even low query per second. Is that reason for SD in progress?
I attached result bellow.
Could you tell me your thought about the result?
Report of performance regression
No response
Misc discussion on performance
Case. 300 prompt examples (Average Input 158), Set max output 100 Tokens .
Target Model "Llama-2-70B-chat" , Draft Model "TinyLlama 1.1B-chat-GPTQ "
Attached result as bellow.
The text was updated successfully, but these errors were encountered: