Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynatemp and min_p upgrade? #9178

Closed
drazdra opened this issue Aug 25, 2024 · 3 comments
Closed

Dynatemp and min_p upgrade? #9178

drazdra opened this issue Aug 25, 2024 · 3 comments
Labels

Comments

@drazdra
Copy link

drazdra commented Aug 25, 2024

i've stumbled upon dynatemp and have a question/proposal.

I believe, that the thing that was missed during dynatemp implementation is the underlying concept of what it's needed for.

Prompts may require 2 types of replies: deterministic replies and creative replies. These are opposite in terms of sampling approach.

Deterministic approach would be required, for example, by programming and by answering knowledge related question. Then you wish llm to provide with the most probable tokens.

Creative approach would be required in writing stories and general conversations with llms.

For example, we all know parasite words of llms, like "Maniacally laughing" of llama 3 and "Ahahahaha" that it inserts into nearly every reply. Tokens forming these are super probable. So, in case of using the dynatemp here, we will only increase changes to get "ahahahahahahaha" instead of "ahaha" and that's what i saw in my tests :).

Meanwhile, the whole idea for creative tasks the situation is opposite to deterministic. We need to skip "overfitten" tokens and instead flatten the rest of the tokens to walk around the "deadends".

So, we need to have exactly opposite to min_p and dynatemp. Actually, i thought i could use negative values for dynatemp, but it turned out that in the code we have:

case llama_sampler_type::TEMPERATURE:
if (dynatemp_range > 0) {
float dynatemp_min = std::max(0.0f, temp - dynatemp_range);
float dynatemp_max = std::max(0.0f, temp + dynatemp_range);
llama_sample_entropy(ctx_main, &cur_p, dynatemp_min, dynatemp_max, dynatemp_exponent);
} else {
llama_sample_temp(ctx_main, &cur_p, temp);
}

which makes it impossible, despite the fact that it actually could be possible :).

The question is obvious, shouldn't we patch it to allow for negative dynatemp? It would make perfect sense and would help to get more creative replies, as with positive it creates more deterministic replies.

And we need something like max_p to exclude super probable tokens that are chosen with no alternatives every time.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Aug 26, 2024

The question is obvious, shouldn't we patch it to allow for negative dynatemp? It would make perfect sense and would help to get more creative replies, as with positive it creates more deterministic replies.

A negative dynatemp means you can end up with a temperature that can go below the minimum in the sample function. In particular it can go negative, resulting in negative probability values for tokens. That makes no mathematical sense and probably breaks other stuff.

Probability is always non-negative, meaning you don't want "inverted" sampling for creative output, but rather ways to boost the tail end of the probability distribution to increase chances of the improbable appearing (without losing coherence). A higher temperature is one way to achieve this, in that it reduces distance between token probabilities.

And we need something like max_p to exclude super probable tokens that are chosen with no alternatives every time.

That would just encourage incoherent text, which is "creative" in one sense but probably not what you want. If a token literally has no alternatives no setting will help (since the only way out is to stop generating completely), but if you have a super probable token (>99%), picking from the improbable ones is more likely to produce something ungrammatical than creative. The only real solution there is to improve the model, or avoid the sequence by changing the tokens that came before. Conversely, if you have a more or less healthy mix of probabilities other samplers can ensure the top one isn't always taken, including but not limited to min P and typical P. These combine well with high and/or dynamic temperature. Mirostat is another way to balance this, though less intuitive. For preventing looping output specifically, DRY helps (this is not yet merged in llama.cpp itself, but koboldcpp has it).

It's important to realize that what we consider "creative" output does not correlate directly to selecting improbable tokens. Even highly original stories have sequences of text with considerable length that aren't "creative" at all, in the sense of containing unlikely turns of phrase; what makes them original occurs on a higher level (and what makes them interesting and memorable at an even higher level, the "conscious thought" part that LLMs aren't emulating, at least not quite yet). Simply chopping off probable tokens is a little like suggesting that authors should avoid using the letter "e" in their text, since it's so common and unoriginal. You need a considerable amount of skill to operate under such constraints and still produce something interesting.

As an aside, I've found vanilla Llama models to be almost useless for purposes of creative text generation; they're understandably trained much more towards deterministic ("correct") output. Rather than fiddling with samplers you need things like fine-tuning and merges to get models with any appreciable quality of output.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Aug 29, 2024

You may also be interested in P-step sampling, which is basically the opposite of what you suggested and exemplifies what I wrote above: if there are many tokens with comparable probabilities, there is an opportunity for a creative choice, but if there are only a few tokens with big probabilities, any alternatives beyond those are most likely mistakes. Combined with DRY sampling to stop loops, it seems well placed to get the best out of a model for the ones I've tried. However, my final remark also stands: small parameter models and especially unmodified Llama just don't have much in the way of generating interesting output to begin with, producing the literary equivalent of plain crackers with water; samplers can arrange this meager offering in an attractive display, but not turn it into an actual meal.

Last but not least, from the same author: XTC (what's in a name) which comes closest in spirit to your max-P idea by excluding the top choices (but this is only viable if not done all the time but only with configurable probability, and if carefully combined with other samplers to make sure we don't pick garbage).

Do not count on any of these things to make it into llama.cpp itself, by the way, or at least not any time soon; there is no end to the number and kinds of samplers people can invent and the authors are understandably reluctant to throw in everything into llama.cpp as soon as it's invented (and there's a big refactor going on, and users can already extend sampling support without having to put it inside the core).

@github-actions github-actions bot added the stale label Sep 30, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants