-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
P-Step Truncation Sampling #5675
base: master
Are you sure you want to change the base?
Conversation
Discards all tokens after the first occurrence of p[i+1] < step * p[i]
Interesting. AFAIK, difference in probability of top choices has been used as a measure of "confidence": https://arxiv.org/pdf/2402.10200.pdf. In the above paper, they just set a fixed top-k for the first inference step and then follow all the paths greedily. I wonder if we can do p-step filtering in the first step instead of a fixed top-k. |
I remember implementing this back in the day and it didn't work well for me, because the largest ratio between token probabilities isn't typically a cutoff point that makes sense. So I didn't PR it anywhere. |
I have indeed found P-Step (in combination with a low Min-P) to be better than any value of Min-P alone in two common situations:
To put it simply: If the distribution suggests that "anything goes", P-Step will not truncate at all, and thus allow anything. If the distribution suggests that only one or two tokens are really acceptable, P-Step will only retain those. Compared to Min-P, P-Step allows more creativity where the distribution suggests that it is safe, and enforces more coherence where the distribution suggests it is appropriate. I don't believe quite the same effect can be achieved with any combination of Min-P and Top-P. Of course, this is all just based on my impressions from looking at a few hundred before/after distributions, and I mostly test with 2-3 models only (Mistral, Mixtral, and Yi-34b). So in a sense, this absolutely is an experiment, and I very much appreciate your feedback and insights! If only there was a way to objectively evaluate whether a sampler improves the output or not... |
Assuming I did the math right, to my knowledge, if a token's probability x p_step is higher than the next token down, then you truncate the rest of the tokens. If this is correct, then I think this is possibly the best sampler available at the moment. Here is a comparison between min_p = 0.1 and p_step = 0.5, where the lowest probability token after truncation was selected, as well as top-k = 1 for reference: What are some possible countries of origin for someone with natural blonde hair?min_p = 0.1Caucasians and Native Americans have blonde hair in their genetic makeup, but individuals of various nationalities could possess it through hereditary mutation, as blonde hair is not limited to certain groups of people. Would you like a list of countries? p_step = 0.5Russia, Finland, Sweden, Denmark, Norway, Germany, Austria, the Netherlands, Belgium, Switzerland, Ireland, the United Kingdom, the United States, and Australia are all countries where natural blonde hair is not uncommon. Deterministic; top-k = 1Natural blonde hair is most commonly found in people of Northern European descent. Some countries with significant populations of people with natural blonde hair include:
These are just a few examples, as natural blonde hair can also be found in other countries with significant populations of people of Northern European descent. On the very first token, min_p had 11 possible choices, while p_step had 96 possible choices. Keep in mind that this is the absolute worst-case scenario for both samplers, yet p_step was much more coherent despite having significantly more options to choose from, which was quite common. When using min_p = 0.1, I saw the top token probability be as high as 81%, yet a 9% token was kept as a valid choice! P_step, on the other hand, would only keep the top token in that scenario. Results might be different in practice, but if this is p_step's worst-case scenario, then I think it absolutely deserves more research. |
I've been testing p_step for some time now, and it's really interesting. As expected, it gives more variety in answers - but I'd say it also improves creativity and coherence at the same time. Previously, I've been using noisy min_p with top_p and repetition penalties turned off, as Kalomaze was suggesting. Even without dynatemp (but especially with it), noisy min_p is a massive game changer and is almost essential - especially for 7b models. However, using p_step showed that there's variety that might've been missed. The only problem I've encountered with p_step is repetition: min_p is quite good at eliminating repetitive structures (as in, entire sentence constructions repeated across the generated text), so I almost forgot about it. p_step doesn't do anything in that regard, so it requires either a specific position in samplers queue, or at least an additional sampler after it (like suggested combination with min_p). In my tests, repetitions are noticeably less significant once noisy min_p is added. The problem is, min_p is strong and affects variety from p_step. In fact, I saw slightly more typical answers even with min_p = 0.01, which was not enough to battle repetitions. Alternatively, it might be a good idea to introduce noise into p_step the same way it is used in noisy min_p (given that it's better at reducing repetitions than standard min_p). I tried to do it the crude way, but looks like it requires a better approach. |
If I recall correctly, all models are affected by sentence structure repetition to some degree, but Mistral-7b was the most affected. Are you able to test other or larger models to see if p_step still has that issue with them? Either way, if adding noise after p_step fixes it, then raising the temperature should also work since they both add randomness, and p_step should be able to handle a very large temperature without issue. In any case, if this is its only downside, then I think it would make a great addition to the standard sampler set. |
Indeed, it seems to be the case of high/low temperature, but I've noticed a significant effect of quantization. I previously tested with q6_k model only (7b mistral finetune), but once I switched to mistral-7b-instruct-v0.2.Q8_0, repetitions almost disappeared. To compare, I've also tested mistral-7b-instruct-v0.2.Q4_K_S, and repetitions are more noticeable, but still, higher temperature helps in this case. So far it's quite good at: "temp": 1.9,
"dynatemp_range": 1.8,
"min_p": 0.02,
"p_step": 0.4, However, it seems like adding noise into p_step helps even more, as now it is more self-sufficient: "temp": 1.5,
"dynatemp_range": 1.4,
"min_p": 0.00,
"p_step": 0.6, works pretty well to me now. Tested with |
That is correct, and an alternative and equally valid way to describe what P-Step does.
Indeed, that is precisely Min-P's weakness. Min-P fails to account for the fact that a large gap in probability between two successive tokens often indicates that the less probable token is an inferior choice, such as a common spelling or grammar mistake. A high value of P-Step basically behaves like deterministic sampling, except at the start of new text units, where it behaves as if no truncation was applied at all. Thus it gives maximum coherence inside text units, and maximum creativity between them. Which I think makes a lot of sense philosophically. |
@p-e-w Is it possible for you to make a pull request to add this sampler to the HF loaders in Text-Gen Webui? It's more likely that it would be added there than here, and if it does turn out to be good, then there's a much higher chance it makes its way here officially. I'd hate for this sampler to not be given a real chance. |
I have several other novel samplers that I'm experimenting with, but I probably won't try to upstream any more of them. Getting something new merged in these big projects is very difficult and I just don't have the time to pursue it. |
This PR introduces a new truncation sampler called P-Step. The sampler discards all tokens after the first "step" in the probability distribution, that is, the first position where the following tokens are substantially less probable than the preceding ones.
The step is defined by the probability of a token being less than
p_step
times the probability of the preceding, more likely token. That is, the first occurrence ofp[i+1] < p_step * p[i]
leads to all tokens afteri
being discarded.I claim that this strategy can offer advantages over existing truncation samplers, namely:
min_p * (p of most probable token)
happens to fall between them.(This isn't to say that P-Step obsoletes those samplers; in fact, it works great in combination with them.)
By construction, P-Step truncation has an attractive theoretical property that no other currently implemented sampler provides: Any retained token is at least
1/p_step
times more probable than any discarded token. In other words, P-Step splits tokens into two sets with a pairwise classification criterion that is independent of any global properties of the distribution.P-Step will follow slowly decreasing distributions for long distances. I therefore recommend pairing P-Step with one of the other truncation samplers when using a
p_step
value of less than 0.5. To get a feeling for how the sampler behaves in practice, try the following parameters:p_step = 0.5
with all other samplers disabled. This gives high coherence within semantic units of the text, and high creativity at the start of new units where the distribution tends to be smooth.p_step = 0.3
plusmin_p = 0.02
. In my tests, this gives improved coherence compared to the Min-P sampler alone, while maintaining the same level of creativity.Philosophically, P-Step is the complement of Tail Free Sampling: While TFS cuts off the "tail" of the distribution, that is, the part where probabilities stop getting worse, P-Step cuts off everything except the "head" by identifying where probabilities start to get worse (of course, the exact behavior depends on the chosen parameter values in both cases). P-Step is much simpler than TFS and can be applied by hand when inspecting token distributions. It is therefore easy to find a suitable
p_step
value for the desired truncation outcome.As stated previously, this isn't really my field, so if someone had this idea before me, please let me know.
/cc @kalomaze @oobabooga Maybe you find this interesting.