-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support min-p sampling #1167
Support min-p sampling #1167
Conversation
Is there a real case where the three filters (top-p/top-k/min-p) will be applied altogether? If so, what should be the order? |
Generally, there's little reason to use all three at the same time, but users may still choose do that, and, either way, during batching there can be requests using min-p and others that use top-p/top-k processed together. Regarding the order, I copied the order that is used in VLLM and HF Transformers (top-k->top-p->min-p) |
I'm also interested, in what use case scenarios would it be used? Are there any specific examples? |
Min-p, coupled with a higher temperature is generally used for creative writing (which is a significant chunk of LLM usage), due to allowing for more varied and creative responses, while still remaining coherent. But it is also a good replacement for top-k/top-p in general LLM usage. You can read the explanation and benchmarks in the paper |
Awesome, I've been looking forward to this for a long time |
@intervitens |
Motivation
#1071
Modifications
Implemented the min-p sampling algorithm using both flashinfer kernels and the native pytorch sampling implementation. There is a slight slowdown when using min-p due to the current lack of a fused min-p/top-p/top-k kernel in flashinfer. To avoid this slowdown when min-p is not used, I implemented a fallback to the
top_k_top_p_sampling_from_probs
kernel.Checklist