Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support min-p sampling #1167

Merged
merged 3 commits into from
Aug 21, 2024
Merged

Support min-p sampling #1167

merged 3 commits into from
Aug 21, 2024

Conversation

intervitens
Copy link
Contributor

@intervitens intervitens commented Aug 20, 2024

Motivation

#1071

Modifications

Implemented the min-p sampling algorithm using both flashinfer kernels and the native pytorch sampling implementation. There is a slight slowdown when using min-p due to the current lack of a fused min-p/top-p/top-k kernel in flashinfer. To avoid this slowdown when min-p is not used, I implemented a fallback to the top_k_top_p_sampling_from_probs kernel.

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@yzh119
Copy link
Collaborator

yzh119 commented Aug 21, 2024

Is there a real case where the three filters (top-p/top-k/min-p) will be applied altogether? If so, what should be the order?

@intervitens
Copy link
Contributor Author

Is there a real case where the three filters (top-p/top-k/min-p) will be applied altogether? If so, what should be the order?

Generally, there's little reason to use all three at the same time, but users may still choose do that, and, either way, during batching there can be requests using min-p and others that use top-p/top-k processed together. Regarding the order, I copied the order that is used in VLLM and HF Transformers (top-k->top-p->min-p)

@zhyncs
Copy link
Member

zhyncs commented Aug 21, 2024

I'm also interested, in what use case scenarios would it be used? Are there any specific examples?

@intervitens
Copy link
Contributor Author

I'm also interested, in what use case scenarios would it be used? Are there any specific examples?

Min-p, coupled with a higher temperature is generally used for creative writing (which is a significant chunk of LLM usage), due to allowing for more varied and creative responses, while still remaining coherent. But it is also a good replacement for top-k/top-p in general LLM usage. You can read the explanation and benchmarks in the paper

@hnyls2002 hnyls2002 enabled auto-merge (squash) August 21, 2024 21:41
@hnyls2002 hnyls2002 merged commit 068e9ea into sgl-project:main Aug 21, 2024
5 checks passed
@intervitens intervitens deleted the min_p branch August 22, 2024 02:52
@81549361
Copy link

Awesome, I've been looking forward to this for a long time

@81549361
Copy link

@intervitens
oobabooga/text-generation-webui#5677
Are you interested in implementing this sampler?
This sampler can solve the problem that some models are easily repeated in long chats, such as Nemo 12b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants