Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P-Step Truncation Sampling #5675

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

P-Step Truncation Sampling #5675

wants to merge 1 commit into from

Conversation

p-e-w
Copy link

@p-e-w p-e-w commented Feb 23, 2024

This PR introduces a new truncation sampler called P-Step. The sampler discards all tokens after the first "step" in the probability distribution, that is, the first position where the following tokens are substantially less probable than the preceding ones.

p_step

The step is defined by the probability of a token being less than p_step times the probability of the preceding, more likely token. That is, the first occurrence of p[i+1] < p_step * p[i] leads to all tokens after i being discarded.

I claim that this strategy can offer advantages over existing truncation samplers, namely:

  • Top-K, because it adapts to the specific probability distribution.
  • Top-P, because it doesn't discard tokens just because an arbitrary summation threshold is reached.
  • Min-P, because Min-P can discard tokens that are almost exactly as probable as other tokens it retains, if the threshold implied by min_p * (p of most probable token) happens to fall between them.

(This isn't to say that P-Step obsoletes those samplers; in fact, it works great in combination with them.)

By construction, P-Step truncation has an attractive theoretical property that no other currently implemented sampler provides: Any retained token is at least 1/p_step times more probable than any discarded token. In other words, P-Step splits tokens into two sets with a pairwise classification criterion that is independent of any global properties of the distribution.

P-Step will follow slowly decreasing distributions for long distances. I therefore recommend pairing P-Step with one of the other truncation samplers when using a p_step value of less than 0.5. To get a feeling for how the sampler behaves in practice, try the following parameters:

  • p_step = 0.5 with all other samplers disabled. This gives high coherence within semantic units of the text, and high creativity at the start of new units where the distribution tends to be smooth.
  • p_step = 0.3 plus min_p = 0.02. In my tests, this gives improved coherence compared to the Min-P sampler alone, while maintaining the same level of creativity.

Philosophically, P-Step is the complement of Tail Free Sampling: While TFS cuts off the "tail" of the distribution, that is, the part where probabilities stop getting worse, P-Step cuts off everything except the "head" by identifying where probabilities start to get worse (of course, the exact behavior depends on the chosen parameter values in both cases). P-Step is much simpler than TFS and can be applied by hand when inspecting token distributions. It is therefore easy to find a suitable p_step value for the desired truncation outcome.

As stated previously, this isn't really my field, so if someone had this idea before me, please let me know.

/cc @kalomaze @oobabooga Maybe you find this interesting.

Discards all tokens after the first occurrence of p[i+1] < step * p[i]
@bullno1
Copy link
Contributor

bullno1 commented Feb 23, 2024

Interesting. AFAIK, difference in probability of top choices has been used as a measure of "confidence": https://arxiv.org/pdf/2402.10200.pdf.

In the above paper, they just set a fixed top-k for the first inference step and then follow all the paths greedily.
Finally, the result is decided based on the above confidence measure.

I wonder if we can do p-step filtering in the first step instead of a fixed top-k.
So for cut-and-dry cases, there would be fewer paths.
For more ambiguous cases, it would explore more paths.

@kalomaze
Copy link
Contributor

kalomaze commented Feb 24, 2024

I remember implementing this back in the day and it didn't work well for me, because the largest ratio between token probabilities isn't typically a cutoff point that makes sense. So I didn't PR it anywhere.
Did you notice it was better than just increasing Min P to compensate for your particular use case or are you just experimenting? I'm maybe misinterpreting how you implemented it.

@kalomaze
Copy link
Contributor

kalomaze commented Feb 24, 2024

image

Also, the jump / rate of change between probabilities seems to be exceptionally non-linear in modern LLMs, the distribution does not have a consistently expected "shape". This is all from the same model on different tokens. (Maybe I graphed it wrong, though)

@p-e-w
Copy link
Author

p-e-w commented Feb 24, 2024

@kalomaze

Did you notice it was better than just increasing Min P to compensate for your particular use case or are you just experimenting?

I have indeed found P-Step (in combination with a low Min-P) to be better than any value of Min-P alone in two common situations:

  • At the start of a new text unit (such as a sentence or paragraph), the distribution tends to be rather flat. This reflects the fact that while the next text unit is semantically linked to the preceding one, that doesn't translate to a preference for a specific starting token. Any reasonable value of Min-P (if Min-P is the only truncation sampler used) will cut that distribution off after a few dozen tokens at most. P-Step won't truncate such a distribution at all, because no two successive tokens meet the ratio threshold.
  • When one or two tokens are much more probable than all others, P-Step discards all other tokens. For example, with p_step = 0.5, if the most probable token has p > 2/3 (which is fairly common), it is the only token that will be retained. By contrast, Min-P will usually pick up a few more tokens in such situations, which are sometimes low quality ones reflecting common grammar mistakes (such as continuing a sentence with a word at a position where a comma is required by convention).

To put it simply: If the distribution suggests that "anything goes", P-Step will not truncate at all, and thus allow anything. If the distribution suggests that only one or two tokens are really acceptable, P-Step will only retain those. Compared to Min-P, P-Step allows more creativity where the distribution suggests that it is safe, and enforces more coherence where the distribution suggests it is appropriate. I don't believe quite the same effect can be achieved with any combination of Min-P and Top-P.

Of course, this is all just based on my impressions from looking at a few hundred before/after distributions, and I mostly test with 2-3 models only (Mistral, Mixtral, and Yi-34b). So in a sense, this absolutely is an experiment, and I very much appreciate your feedback and insights! If only there was a way to objectively evaluate whether a sampler improves the output or not...

@cebtenzzre cebtenzzre added the need feedback Testing and feedback with results are needed label Feb 24, 2024
@HiroseKoichi
Copy link

HiroseKoichi commented Mar 6, 2024

Assuming I did the math right, to my knowledge, if a token's probability x p_step is higher than the next token down, then you truncate the rest of the tokens. If this is correct, then I think this is possibly the best sampler available at the moment.

Here is a comparison between min_p = 0.1 and p_step = 0.5, where the lowest probability token after truncation was selected, as well as top-k = 1 for reference:

What are some possible countries of origin for someone with natural blonde hair?

min_p = 0.1

Caucasians and Native Americans have blonde hair in their genetic makeup, but individuals of various nationalities could possess it through hereditary mutation, as blonde hair is not limited to certain groups of people. Would you like a list of countries?

p_step = 0.5

Russia, Finland, Sweden, Denmark, Norway, Germany, Austria, the Netherlands, Belgium, Switzerland, Ireland, the United Kingdom, the United States, and Australia are all countries where natural blonde hair is not uncommon.

Deterministic; top-k = 1

Natural blonde hair is most commonly found in people of Northern European descent. Some countries with significant populations of people with natural blonde hair include:

  1. Sweden
  2. Norway
  3. Denmark
  4. Finland
  5. Iceland
  6. Netherlands
  7. Germany
  8. Austria
  9. Switzerland
  10. Belgium

These are just a few examples, as natural blonde hair can also be found in other countries with significant populations of people of Northern European descent.


On the very first token, min_p had 11 possible choices, while p_step had 96 possible choices. Keep in mind that this is the absolute worst-case scenario for both samplers, yet p_step was much more coherent despite having significantly more options to choose from, which was quite common. When using min_p = 0.1, I saw the top token probability be as high as 81%, yet a 9% token was kept as a valid choice! P_step, on the other hand, would only keep the top token in that scenario.

Results might be different in practice, but if this is p_step's worst-case scenario, then I think it absolutely deserves more research.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Mar 6, 2024

I've been testing p_step for some time now, and it's really interesting. As expected, it gives more variety in answers - but I'd say it also improves creativity and coherence at the same time.

Previously, I've been using noisy min_p with top_p and repetition penalties turned off, as Kalomaze was suggesting. Even without dynatemp (but especially with it), noisy min_p is a massive game changer and is almost essential - especially for 7b models. However, using p_step showed that there's variety that might've been missed.

The only problem I've encountered with p_step is repetition: min_p is quite good at eliminating repetitive structures (as in, entire sentence constructions repeated across the generated text), so I almost forgot about it. p_step doesn't do anything in that regard, so it requires either a specific position in samplers queue, or at least an additional sampler after it (like suggested combination with min_p).

In my tests, repetitions are noticeably less significant once noisy min_p is added. The problem is, min_p is strong and affects variety from p_step. In fact, I saw slightly more typical answers even with min_p = 0.01, which was not enough to battle repetitions.

Alternatively, it might be a good idea to introduce noise into p_step the same way it is used in noisy min_p (given that it's better at reducing repetitions than standard min_p). I tried to do it the crude way, but looks like it requires a better approach.

@HiroseKoichi
Copy link

If I recall correctly, all models are affected by sentence structure repetition to some degree, but Mistral-7b was the most affected. Are you able to test other or larger models to see if p_step still has that issue with them? Either way, if adding noise after p_step fixes it, then raising the temperature should also work since they both add randomness, and p_step should be able to handle a very large temperature without issue.

In any case, if this is its only downside, then I think it would make a great addition to the standard sampler set.

@MaggotHATE
Copy link
Contributor

Indeed, it seems to be the case of high/low temperature, but I've noticed a significant effect of quantization. I previously tested with q6_k model only (7b mistral finetune), but once I switched to mistral-7b-instruct-v0.2.Q8_0, repetitions almost disappeared.

To compare, I've also tested mistral-7b-instruct-v0.2.Q4_K_S, and repetitions are more noticeable, but still, higher temperature helps in this case.

So far it's quite good at:

"temp": 1.9,
"dynatemp_range": 1.8,
"min_p": 0.02,
"p_step": 0.4,

However, it seems like adding noise into p_step helps even more, as now it is more self-sufficient:

"temp": 1.5,
"dynatemp_range": 1.4,
"min_p": 0.00,
"p_step": 0.6,

works pretty well to me now. Tested with ksmt order of samplers, both q6 and q8 models. As previously, I'm not sure if I've added noise properly or not, but it seems to work quite well now.

@p-e-w
Copy link
Author

p-e-w commented Mar 9, 2024

@HiroseKoichi

Assuming I did the math right, to my knowledge, if a token's probability x p_step is higher than the next token down, then you truncate the rest of the tokens.

That is correct, and an alternative and equally valid way to describe what P-Step does.

When using min_p = 0.1, I saw the top token probability be as high as 81%, yet a 9% token was kept as a valid choice!

Indeed, that is precisely Min-P's weakness. Min-P fails to account for the fact that a large gap in probability between two successive tokens often indicates that the less probable token is an inferior choice, such as a common spelling or grammar mistake.

A high value of P-Step basically behaves like deterministic sampling, except at the start of new text units, where it behaves as if no truncation was applied at all. Thus it gives maximum coherence inside text units, and maximum creativity between them. Which I think makes a lot of sense philosophically.

@HiroseKoichi
Copy link

@p-e-w Is it possible for you to make a pull request to add this sampler to the HF loaders in Text-Gen Webui? It's more likely that it would be added there than here, and if it does turn out to be good, then there's a much higher chance it makes its way here officially. I'd hate for this sampler to not be given a real chance.

@p-e-w
Copy link
Author

p-e-w commented Apr 18, 2024

@HiroseKoichi

I have several other novel samplers that I'm experimenting with, but I probably won't try to upstream any more of them. Getting something new merged in these big projects is very difficult and I just don't have the time to pursue it.

@mofosyne mofosyne added generation quality Quality of model output refactoring Refactoring Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generation quality Quality of model output need feedback Testing and feedback with results are needed refactoring Refactoring Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants