Support min-p sampling #1167

intervitens · 2024-08-20T22:42:51Z

Motivation

Modifications

Implemented the min-p sampling algorithm using both flashinfer kernels and the native pytorch sampling implementation. There is a slight slowdown when using min-p due to the current lack of a fused min-p/top-p/top-k kernel in flashinfer. To avoid this slowdown when min-p is not used, I implemented a fallback to the top_k_top_p_sampling_from_probs kernel.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

yzh119 · 2024-08-21T01:04:55Z

Is there a real case where the three filters (top-p/top-k/min-p) will be applied altogether? If so, what should be the order?

intervitens · 2024-08-21T02:07:08Z

Is there a real case where the three filters (top-p/top-k/min-p) will be applied altogether? If so, what should be the order?

Generally, there's little reason to use all three at the same time, but users may still choose do that, and, either way, during batching there can be requests using min-p and others that use top-p/top-k processed together. Regarding the order, I copied the order that is used in VLLM and HF Transformers (top-k->top-p->min-p)

zhyncs · 2024-08-21T07:46:00Z

I'm also interested, in what use case scenarios would it be used? Are there any specific examples?

intervitens · 2024-08-21T11:27:13Z

I'm also interested, in what use case scenarios would it be used? Are there any specific examples?

Min-p, coupled with a higher temperature is generally used for creative writing (which is a significant chunk of LLM usage), due to allowing for more varied and creative responses, while still remaining coherent. But it is also a good replacement for top-k/top-p in general LLM usage. You can read the explanation and benchmarks in the paper

81549361 · 2024-08-22T03:48:44Z

Awesome, I've been looking forward to this for a long time

81549361 · 2024-08-22T03:50:57Z

@intervitens
oobabooga/text-generation-webui#5677
Are you interested in implementing this sampler?
This sampler can solve the problem that some models are easily repeated in long chats, such as Nemo 12b

intervitens added 2 commits August 21, 2024 01:05

added min-p sampler support

8383256

fix formatting

9df818f

hnyls2002 approved these changes Aug 21, 2024

View reviewed changes

hnyls2002 enabled auto-merge (squash) August 21, 2024 21:41

Merge branch 'main' into min_p

99a6013

hnyls2002 merged commit 068e9ea into sgl-project:main Aug 21, 2024
5 checks passed

intervitens deleted the min_p branch August 22, 2024 02:52

merrymercy mentioned this pull request Aug 23, 2024

[Feature] support min_p sampling #1071

Closed

josephrocca mentioned this pull request Aug 31, 2024

[Feature] min_p sampling parameter InternLM/lmdeploy#1745

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support min-p sampling #1167

Support min-p sampling #1167

intervitens commented Aug 20, 2024 •

edited

Loading

yzh119 commented Aug 21, 2024

intervitens commented Aug 21, 2024

zhyncs commented Aug 21, 2024

intervitens commented Aug 21, 2024

81549361 commented Aug 22, 2024

81549361 commented Aug 22, 2024

Support min-p sampling #1167

Support min-p sampling #1167

Conversation

intervitens commented Aug 20, 2024 • edited Loading

Motivation

Modifications

Checklist

yzh119 commented Aug 21, 2024

intervitens commented Aug 21, 2024

zhyncs commented Aug 21, 2024

intervitens commented Aug 21, 2024

81549361 commented Aug 22, 2024

81549361 commented Aug 22, 2024

intervitens commented Aug 20, 2024 •

edited

Loading