Add the DRY dynamic N-gram anti-repetition sampler #982

pi6am · 2024-07-09T07:43:39Z

The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
oobabooga/text-generation-webui#5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.

The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit.

Merge upstream

pi6am · 2024-07-09T07:46:39Z

Note, I haven't tested building or running these changes on Windows yet. I'm mostly putting the changes up now for preliminary review.

Little known fact: The C++98 standard defines `and` as an alternative token for the `&&` operator (along with a bunch of other digraphs). MSVC does not allow these without using the /Za option or including the <iso646.h> header. Change to the more standard operator to make this code more portable.

Replace the compile-time computation with a floating-point approximation of log(std::numeric_limits<float>::max()).

pi6am · 2024-07-09T09:34:37Z

After the last two commits it should compile for Windows.

The DRY sampler is effectively a repetition penalty and there are very few reasons to apply it at a different place in sampler order than the standard single-token penalty. There are also multiple projects that have dependencies on the existing sampler IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order to minimize the impact of the dependencies of adding the DRY sampler to koboldcpp, it makes the most sense to not add a new ID for now, and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future if we find a use case for splitting the application of rep pen and DRY we can introduce a new enum entry then.

This parameter follows the oobabooga semantics: it's optional, with a default value of zero. Zero means that DRY should sample the entire context. Otherwise, it's the number of tokens from the end of the context that are scanned for repetitions.

pi6am · 2024-07-10T07:29:32Z

With the latest changes this now works in KoboldAI Lite and Silly Tavern.

Enable DRY repetition penalty parameters for koboldcpp. This should only be merged after: LostRuins/koboldcpp#982

LostRuins · 2024-07-11T06:27:46Z

i'm a little busy but i hope to review and merge this by the weekend

pi6am · 2024-07-12T03:40:46Z

Let me know if there are any changes you want me to make or if anything is unclear. I tried to document what is going on in this code and to follow existing patterns but I'm sure there's room for improvements.

LostRuins · 2024-07-12T15:06:57Z

Hi @pi6am , thanks for the PR. I've taken a brief read through, although I don't fully understand the implementation - I think it seems usable.

I do have one concern though, that is, this sampler contains a lot of nested searches heavily dependent on sequence length - and it seems like it might be potentially exploitable when KoboldCpp is run as a server with adversarial generation inputs that could act as a denial of service attack by locking up the server.

Since you are more familiar with this implementation, can we add some sanity limits or checks to prevent this from happening?

I'm not sure of all the places this needs to be added, either some sort of clamped limitation to sequence count/sequence/word length or something similar, to ensure that a shared server does not get taken down by a malicious actor?

The core DRY sampler algorithm is linear in the context length, but there are several parts of the sampler related to multi-token sequence breakers that are potentially quadratic. Without any restrictions, a suitably crafted context and sequence breaker could result in a denial-of-service attack on a server running koboldcpp. This change limits the maximum number of characters and the maximum token length of a sequence breaker in order to limit the maximum overhead associated with the sampler. This change also improves some comments, adding more detail and changing the wording to increase clarity.

pi6am · 2024-07-13T10:27:44Z

I added a change to restrict the maximum length of a sequence breaker in characters and in tokens. Before making this change I was able to trigger a long (20s+) hang during sequence breaker pre-processing using a suitably crafted sequence breaker string. After the commit, the worst-case pre-processing time is negligible with the same inputs.

LostRuins

lgtm, thanks!

* Add the DRY dynamic N-gram anti-repetition sampler The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit. * Update default DRY parameters to match lite * Improve DRY token debug logging * Replace `and` with `&&` to fix MSVC compile error Little known fact: The C++98 standard defines `and` as an alternative token for the `&&` operator (along with a bunch of other digraphs). MSVC does not allow these without using the /Za option or including the <iso646.h> header. Change to the more standard operator to make this code more portable. * Fix MSVC compile error because log is not constexpr Replace the compile-time computation with a floating-point approximation of log(std::numeric_limits<float>::max()). * Remove unused llama sampler variables and clean up sequence breakers. * Remove KCPP_SAMPLER_DRY as a separate enum entry The DRY sampler is effectively a repetition penalty and there are very few reasons to apply it at a different place in sampler order than the standard single-token penalty. There are also multiple projects that have dependencies on the existing sampler IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order to minimize the impact of the dependencies of adding the DRY sampler to koboldcpp, it makes the most sense to not add a new ID for now, and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future if we find a use case for splitting the application of rep pen and DRY we can introduce a new enum entry then. * Add the dry_penalty_last_n to independently control DRY penalty range This parameter follows the oobabooga semantics: it's optional, with a default value of zero. Zero means that DRY should sample the entire context. Otherwise, it's the number of tokens from the end of the context that are scanned for repetitions. * Limit sequence breaker lengths in tokens and characters The core DRY sampler algorithm is linear in the context length, but there are several parts of the sampler related to multi-token sequence breakers that are potentially quadratic. Without any restrictions, a suitably crafted context and sequence breaker could result in a denial-of-service attack on a server running koboldcpp. This change limits the maximum number of characters and the maximum token length of a sequence breaker in order to limit the maximum overhead associated with the sampler. This change also improves some comments, adding more detail and changing the wording to increase clarity.

p-e-w · 2024-07-21T04:21:07Z

Creator of DRY here. Great to see this happening, I only stumbled upon this PR this morning and it's already merged!

This appears to be a brand new, highly sophisticated implementation. Out of curiosity, why didn't you use the llama.cpp PR as a base? Kobold is ultimately a llama.cpp fork, right?

FWIW, testing in the text-generation-webui repo has shown that using the Z-algorithm is unnecessary, and quadratic runtime can be avoided by simply capping the maximum sequence length being matched (to 50, in the case of text-generation-webui). The penalty is exponential in the match length, so this makes no difference in practice as the value will be astronomically large regardless. This can simplify the algorithm considerably, and has been tested to perform well even in the worst-case scenario where the input consists entirely of repetitions of a single token.

pi6am · 2024-07-21T05:39:31Z

@p-e-w Thanks for taking a look at this implementation.

Out of curiosity, why didn't you use the llama.cpp PR as a base? Kobold is ultimately a llama.cpp fork, right?

koboldcpp is a fork of llama.cpp but it has its own implementation for some samplers, in particular sample_rep_pen in gpttype_adapter.cpp which has diverged significantly from llama_sample_repetition_penalties in llama.cpp. The DRY sampler needs to operate on context tokens in a similar way, which means koboldcpp would need to implement its own version of the sampler even if the llama.cpp PR were merged. The other factor is that I wanted to support multi-token sequence breakers and use a linear algorithm for repetition scanning.

FWIW, testing in the text-generation-webui repo has shown that using the Z-algorithm is unnecessary, and quadratic runtime can be avoided by simply capping the maximum sequence length being matched.

That's a fair point. On the other hand, the core of the z-algorithm is only about 35 lines of code, and it's not necessary to understand exactly how it works in order to understand the rest of the DRY scoring process. I used length caps to limit the amount of time koboldcpp spends processing excessively long sequence breakers, but for the DRY sampler it seemed worth using an asymptotically faster algorithm.

p-e-w · 2024-07-27T03:43:58Z

@pi6am

I had some time to look at your code in detail this morning, and I must praise your handling of sequence breakers with GetOverlappingTokenSequences. This approach is clearly better than the tokenization hack from my original implementation, it will work correctly with even future tokenizers that might do strange things, it can handle compound breakers like complex character names etc. Any subsequent implementations of DRY should follow your approach.

/cc @belladoreai: You might be interested in this!

Most tokenizers encode punctuation tokens differently depending on where they occur in the input, and which tokens surround them. With the default sequence breakers, the appropriate encoding usually corresponds to the encoding produced when the token occurs after a word, rather than by itself. To emulate this, prefix the token with "a" before encoding, and extract the final token of the result. See LostRuins/koboldcpp#982 for a correct solution to this problem.

* Silence bogus Clippy warning Clippy's suggestion cannot be implemented because of borrowing issues * Get rid of unnecessary type annotations Interesting that Clippy doesn't catch this * Store default sequence breakers in a slice It's nicer when the length is not hardcoded * Make default sequence breakers private No need to leak this as it's not used elsewhere * Limit match length Avoids quadratic runtime and potential DoS with adversarial inputs Ref oobabooga/text-generation-webui#6047 * "Fix" sequence breaker tokenization Most tokenizers encode punctuation tokens differently depending on where they occur in the input, and which tokens surround them. With the default sequence breakers, the appropriate encoding usually corresponds to the encoding produced when the token occurs after a word, rather than by itself. To emulate this, prefix the token with "a" before encoding, and extract the final token of the result. See LostRuins/koboldcpp#982 for a correct solution to this problem.

@p-e-w

* Implement dry penalty * Add dry sampling params to requests * Handle it * Clippy * Review: "Implement DRY penalty" (#645) * Silence bogus Clippy warning Clippy's suggestion cannot be implemented because of borrowing issues * Get rid of unnecessary type annotations Interesting that Clippy doesn't catch this * Store default sequence breakers in a slice It's nicer when the length is not hardcoded * Make default sequence breakers private No need to leak this as it's not used elsewhere * Limit match length Avoids quadratic runtime and potential DoS with adversarial inputs Ref oobabooga/text-generation-webui#6047 * "Fix" sequence breaker tokenization Most tokenizers encode punctuation tokens differently depending on where they occur in the input, and which tokens surround them. With the default sequence breakers, the appropriate encoding usually corresponds to the encoding produced when the token occurs after a word, rather than by itself. To emulate this, prefix the token with "a" before encoding, and extract the final token of the result. See LostRuins/koboldcpp#982 for a correct solution to this problem. * Nicer * Even better * Complete merge * Fix saturating sub * Handle when no context * Make context the entire sequence and refactor * Remove slicing for all * Fix the bug with penalty Credit to @p-e-w for finding this! Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com> * Add custom logits processor API (#702) * Add custom logits processor api * Typos * Nicer interface and update example * Fix doctest * Update docs * Update exports * Add Gemma 2 PagedAttention support (#704) * Add gemma2 paged attn support * Non cuda support? * Remove error * It works * Faster RmsNorm in gemma/gemma2 (#703) * Fix bug in metal isq (#706) * Support GGUF BF16 tensors (#691) * Support GGUF bf16 tensors * Fix loading of bf16 ggml tensor * Fix dequant of bf16 * Use merged rev * Softcapping, real batching + sliding window support for Flash Attention (#707) * Flash attention varlen kind of works * Seems to work * Now it's nice * Sliding window support and clippy * Remove warning * Support smollm * Update rev to match merged * Remove some usages of 'pub' in models (#708) * Support the Phi 3.5 V model (#710) * Update image_seq_len * Update the examples * Format * Implement the Phi 3.5 MoE model (#709) * Copy the model * Add most of it * Add the blocksparse moe parts * Clippy * Fix mscales * A batch of fixes * Correctly cast it * Handle isq on gate * Even more progress * Runs now * Clippy * Fix to use layernorm * Remove unused * Add docs * Add more docs * Apply review comments * Update readme --------- Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>

wwoodsTM · 2024-10-15T09:55:36Z

@pi6am

Just FYI that I ported your impressive DRY implementation code to work with llama.cpp and submitted it recently as a PR here.

Due to licensing differences between the two projects (llama.cpp being MIT and koboldcpp being AGPL),in order to actually incorporate your great work, my understanding is that we would need your express permission for it to be dual-licensed (that is, assuming you did not sign some kind of separate copyright agreement for your contribution here. And @LostRuins if I am misunderstanding something about how things are structured licensing wise, please correct me here as I don't want to step on any toes and have great respect for your project.)

But @pi6am, if that is something you are open to, I was wondering whether you might join our discussion over at that PR's page.

Thank you!

LostRuins · 2024-10-16T06:03:15Z

@wwoodsTM you are correct, for DRY you'd need to get permission @pi6am as they are the author and hold copyright to it.

Enable DRY repetition penalty parameters for koboldcpp. This should only be merged after: LostRuins/koboldcpp#982

pi6am added 5 commits July 8, 2024 02:04

Update default DRY parameters to match lite

029f041

Improve DRY token debug logging

8ed208f

Merge pull request #1 from LostRuins/concedo

63c4437

Merge upstream

Merge branch 'concedo' into feat/dry-sampler

e2c2ba4

pi6am added 2 commits July 9, 2024 01:45

Fix MSVC compile error because log is not constexpr

cffe69b

Replace the compile-time computation with a floating-point approximation of log(std::numeric_limits<float>::max()).

pi6am mentioned this pull request Jul 9, 2024

Feature request: DRY support #943

Closed

pi6am added 3 commits July 9, 2024 09:38

Remove unused llama sampler variables and clean up sequence breakers.

9f5d8fe

pi6am added a commit to pi6am/SillyTavern that referenced this pull request Jul 10, 2024

Enable DRY for koboldcpp

9da4dc3

Enable DRY repetition penalty parameters for koboldcpp. This should only be merged after: LostRuins/koboldcpp#982

pi6am mentioned this pull request Jul 10, 2024

Enable DRY for koboldcpp SillyTavern/SillyTavern#2502

Merged

1 task

LostRuins added the enhancement New feature or request label Jul 10, 2024

LostRuins changed the base branch from concedo to concedo_experimental July 13, 2024 11:07

LostRuins approved these changes Jul 13, 2024

View reviewed changes

LostRuins merged commit 2645754 into LostRuins:concedo_experimental Jul 13, 2024

pi6am deleted the feat/dry-sampler branch July 13, 2024 19:59

p-e-w mentioned this pull request Jul 27, 2024

Add DRY repetition penalty EricLBuehler/mistral.rs#635

Closed

81549361 mentioned this pull request Aug 11, 2024

[Feature]: Implementation of DRY sampler. aphrodite-engine/aphrodite-engine#574

Open

compilade mentioned this pull request Oct 12, 2024

added implementation of DRY sampler (post-refactor) ggml-org/llama.cpp#9702

Merged

4 tasks

fuchinoko pushed a commit to fuchinoko/SillyTavern that referenced this pull request Oct 29, 2024

Enable DRY for koboldcpp

6f446ec

Enable DRY repetition penalty parameters for koboldcpp. This should only be merged after: LostRuins/koboldcpp#982

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the DRY dynamic N-gram anti-repetition sampler #982

Add the DRY dynamic N-gram anti-repetition sampler #982

pi6am commented Jul 9, 2024

pi6am commented Jul 9, 2024

pi6am commented Jul 9, 2024

pi6am commented Jul 10, 2024

LostRuins commented Jul 11, 2024

pi6am commented Jul 12, 2024

LostRuins commented Jul 12, 2024

pi6am commented Jul 13, 2024

LostRuins left a comment

p-e-w commented Jul 21, 2024

pi6am commented Jul 21, 2024

p-e-w commented Jul 27, 2024

wwoodsTM commented Oct 15, 2024 •

edited

Loading

LostRuins commented Oct 16, 2024

Add the DRY dynamic N-gram anti-repetition sampler #982

Add the DRY dynamic N-gram anti-repetition sampler #982

Conversation

pi6am commented Jul 9, 2024

pi6am commented Jul 9, 2024

pi6am commented Jul 9, 2024

pi6am commented Jul 10, 2024

LostRuins commented Jul 11, 2024

pi6am commented Jul 12, 2024

LostRuins commented Jul 12, 2024

pi6am commented Jul 13, 2024

LostRuins left a comment

Choose a reason for hiding this comment

p-e-w commented Jul 21, 2024

pi6am commented Jul 21, 2024

p-e-w commented Jul 27, 2024

wwoodsTM commented Oct 15, 2024 • edited Loading

LostRuins commented Oct 16, 2024

wwoodsTM commented Oct 15, 2024 •

edited

Loading