Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the DRY dynamic N-gram anti-repetition sampler #982

Merged
merged 11 commits into from
Jul 13, 2024

Conversation

pi6am
Copy link

@pi6am pi6am commented Jul 9, 2024

The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
oobabooga/text-generation-webui#5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.

pi6am added 5 commits July 8, 2024 02:04
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
oobabooga/text-generation-webui#5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.
@pi6am
Copy link
Author

pi6am commented Jul 9, 2024

Note, I haven't tested building or running these changes on Windows yet. I'm mostly putting the changes up now for preliminary review.

pi6am added 2 commits July 9, 2024 01:45
Little known fact: The C++98 standard defines `and` as an
alternative token for the `&&` operator (along with a bunch
of other digraphs). MSVC does not allow these without using
the /Za option or including the <iso646.h> header. Change to
the more standard operator to make this code more portable.
Replace the compile-time computation with a floating-point
approximation of log(std::numeric_limits<float>::max()).
@pi6am
Copy link
Author

pi6am commented Jul 9, 2024

After the last two commits it should compile for Windows.

pi6am added 3 commits July 9, 2024 09:38
The DRY sampler is effectively a repetition penalty and there
are very few reasons to apply it at a different place in sampler
order than the standard single-token penalty. There are also
multiple projects that have dependencies on the existing sampler
IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order
to minimize the impact of the dependencies of adding the DRY sampler
to koboldcpp, it makes the most sense to not add a new ID for now,
and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future
if we find a use case for splitting the application of rep pen and DRY
we can introduce a new enum entry then.
This parameter follows the oobabooga semantics: it's optional, with a
default value of zero. Zero means that DRY should sample the entire
context. Otherwise, it's the number of tokens from the end of the
context that are scanned for repetitions.
@pi6am
Copy link
Author

pi6am commented Jul 10, 2024

With the latest changes this now works in KoboldAI Lite and Silly Tavern.

pi6am added a commit to pi6am/SillyTavern that referenced this pull request Jul 10, 2024
Enable DRY repetition penalty parameters for koboldcpp. This should
only be merged after: LostRuins/koboldcpp#982
@LostRuins LostRuins added the enhancement New feature or request label Jul 10, 2024
@LostRuins
Copy link
Owner

i'm a little busy but i hope to review and merge this by the weekend

@pi6am
Copy link
Author

pi6am commented Jul 12, 2024

Let me know if there are any changes you want me to make or if anything is unclear. I tried to document what is going on in this code and to follow existing patterns but I'm sure there's room for improvements.

@LostRuins
Copy link
Owner

Hi @pi6am , thanks for the PR. I've taken a brief read through, although I don't fully understand the implementation - I think it seems usable.

I do have one concern though, that is, this sampler contains a lot of nested searches heavily dependent on sequence length - and it seems like it might be potentially exploitable when KoboldCpp is run as a server with adversarial generation inputs that could act as a denial of service attack by locking up the server.

Since you are more familiar with this implementation, can we add some sanity limits or checks to prevent this from happening?

I'm not sure of all the places this needs to be added, either some sort of clamped limitation to sequence count/sequence/word length or something similar, to ensure that a shared server does not get taken down by a malicious actor?

The core DRY sampler algorithm is linear in the context length, but
there are several parts of the sampler related to multi-token
sequence breakers that are potentially quadratic. Without any
restrictions, a suitably crafted context and sequence breaker could
result in a denial-of-service attack on a server running koboldcpp.
This change limits the maximum number of characters and the maximum
token length of a sequence breaker in order to limit the maximum
overhead associated with the sampler.

This change also improves some comments, adding more detail and
changing the wording to increase clarity.
@pi6am
Copy link
Author

pi6am commented Jul 13, 2024

I added a change to restrict the maximum length of a sequence breaker in characters and in tokens. Before making this change I was able to trigger a long (20s+) hang during sequence breaker pre-processing using a suitably crafted sequence breaker string. After the commit, the worst-case pre-processing time is negligible with the same inputs.

@LostRuins LostRuins changed the base branch from concedo to concedo_experimental July 13, 2024 11:07
Copy link
Owner

@LostRuins LostRuins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@LostRuins LostRuins merged commit 2645754 into LostRuins:concedo_experimental Jul 13, 2024
@pi6am pi6am deleted the feat/dry-sampler branch July 13, 2024 19:59
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 15, 2024
* Add the DRY dynamic N-gram anti-repetition sampler

The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
oobabooga/text-generation-webui#5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.

* Update default DRY parameters to match lite

* Improve DRY token debug logging

* Replace `and` with `&&` to fix MSVC compile error

Little known fact: The C++98 standard defines `and` as an
alternative token for the `&&` operator (along with a bunch
of other digraphs). MSVC does not allow these without using
the /Za option or including the <iso646.h> header. Change to
the more standard operator to make this code more portable.

* Fix MSVC compile error because log is not constexpr

Replace the compile-time computation with a floating-point
approximation of log(std::numeric_limits<float>::max()).

* Remove unused llama sampler variables and clean up sequence breakers.

* Remove KCPP_SAMPLER_DRY as a separate enum entry

The DRY sampler is effectively a repetition penalty and there
are very few reasons to apply it at a different place in sampler
order than the standard single-token penalty. There are also
multiple projects that have dependencies on the existing sampler
IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order
to minimize the impact of the dependencies of adding the DRY sampler
to koboldcpp, it makes the most sense to not add a new ID for now,
and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future
if we find a use case for splitting the application of rep pen and DRY
we can introduce a new enum entry then.

* Add the dry_penalty_last_n to independently control DRY penalty range

This parameter follows the oobabooga semantics: it's optional, with a
default value of zero. Zero means that DRY should sample the entire
context. Otherwise, it's the number of tokens from the end of the
context that are scanned for repetitions.

* Limit sequence breaker lengths in tokens and characters

The core DRY sampler algorithm is linear in the context length, but
there are several parts of the sampler related to multi-token
sequence breakers that are potentially quadratic. Without any
restrictions, a suitably crafted context and sequence breaker could
result in a denial-of-service attack on a server running koboldcpp.
This change limits the maximum number of characters and the maximum
token length of a sequence breaker in order to limit the maximum
overhead associated with the sampler.

This change also improves some comments, adding more detail and
changing the wording to increase clarity.
@p-e-w
Copy link

p-e-w commented Jul 21, 2024

Creator of DRY here. Great to see this happening, I only stumbled upon this PR this morning and it's already merged!

This appears to be a brand new, highly sophisticated implementation. Out of curiosity, why didn't you use the llama.cpp PR as a base? Kobold is ultimately a llama.cpp fork, right?

FWIW, testing in the text-generation-webui repo has shown that using the Z-algorithm is unnecessary, and quadratic runtime can be avoided by simply capping the maximum sequence length being matched (to 50, in the case of text-generation-webui). The penalty is exponential in the match length, so this makes no difference in practice as the value will be astronomically large regardless. This can simplify the algorithm considerably, and has been tested to perform well even in the worst-case scenario where the input consists entirely of repetitions of a single token.

@pi6am
Copy link
Author

pi6am commented Jul 21, 2024

@p-e-w Thanks for taking a look at this implementation.

Out of curiosity, why didn't you use the llama.cpp PR as a base? Kobold is ultimately a llama.cpp fork, right?

koboldcpp is a fork of llama.cpp but it has its own implementation for some samplers, in particular sample_rep_pen in gpttype_adapter.cpp which has diverged significantly from llama_sample_repetition_penalties in llama.cpp. The DRY sampler needs to operate on context tokens in a similar way, which means koboldcpp would need to implement its own version of the sampler even if the llama.cpp PR were merged. The other factor is that I wanted to support multi-token sequence breakers and use a linear algorithm for repetition scanning.

FWIW, testing in the text-generation-webui repo has shown that using the Z-algorithm is unnecessary, and quadratic runtime can be avoided by simply capping the maximum sequence length being matched.

That's a fair point. On the other hand, the core of the z-algorithm is only about 35 lines of code, and it's not necessary to understand exactly how it works in order to understand the rest of the DRY scoring process. I used length caps to limit the amount of time koboldcpp spends processing excessively long sequence breakers, but for the DRY sampler it seemed worth using an asymptotically faster algorithm.

@p-e-w
Copy link

p-e-w commented Jul 27, 2024

@pi6am

I had some time to look at your code in detail this morning, and I must praise your handling of sequence breakers with GetOverlappingTokenSequences. This approach is clearly better than the tokenization hack from my original implementation, it will work correctly with even future tokenizers that might do strange things, it can handle compound breakers like complex character names etc. Any subsequent implementations of DRY should follow your approach.

/cc @belladoreai: You might be interested in this!

p-e-w added a commit to p-e-w/mistral.rs that referenced this pull request Jul 29, 2024
Most tokenizers encode punctuation tokens differently depending on where they occur in the input, and which tokens surround them. With the default sequence breakers, the appropriate encoding usually corresponds to the encoding produced when the token occurs after a word, rather than by itself. To emulate this, prefix the token with "a" before encoding, and extract the final token of the result.

See LostRuins/koboldcpp#982 for a correct solution to this problem.
EricLBuehler pushed a commit to EricLBuehler/mistral.rs that referenced this pull request Jul 29, 2024
* Silence bogus Clippy warning

Clippy's suggestion cannot be implemented because of borrowing issues

* Get rid of unnecessary type annotations

Interesting that Clippy doesn't catch this

* Store default sequence breakers in a slice

It's nicer when the length is not hardcoded

* Make default sequence breakers private

No need to leak this as it's not used elsewhere

* Limit match length

Avoids quadratic runtime and potential DoS with adversarial inputs

Ref oobabooga/text-generation-webui#6047

* "Fix" sequence breaker tokenization

Most tokenizers encode punctuation tokens differently depending on where they occur in the input, and which tokens surround them. With the default sequence breakers, the appropriate encoding usually corresponds to the encoding produced when the token occurs after a word, rather than by itself. To emulate this, prefix the token with "a" before encoding, and extract the final token of the result.

See LostRuins/koboldcpp#982 for a correct solution to this problem.
EricLBuehler added a commit to EricLBuehler/mistral.rs that referenced this pull request Aug 27, 2024
* Implement dry penalty

* Add dry sampling params to requests

* Handle it

* Clippy

* Review: "Implement DRY penalty" (#645)

* Silence bogus Clippy warning

Clippy's suggestion cannot be implemented because of borrowing issues

* Get rid of unnecessary type annotations

Interesting that Clippy doesn't catch this

* Store default sequence breakers in a slice

It's nicer when the length is not hardcoded

* Make default sequence breakers private

No need to leak this as it's not used elsewhere

* Limit match length

Avoids quadratic runtime and potential DoS with adversarial inputs

Ref oobabooga/text-generation-webui#6047

* "Fix" sequence breaker tokenization

Most tokenizers encode punctuation tokens differently depending on where they occur in the input, and which tokens surround them. With the default sequence breakers, the appropriate encoding usually corresponds to the encoding produced when the token occurs after a word, rather than by itself. To emulate this, prefix the token with "a" before encoding, and extract the final token of the result.

See LostRuins/koboldcpp#982 for a correct solution to this problem.

* Nicer

* Even better

* Complete merge

* Fix saturating sub

* Handle when no context

* Make context the entire sequence and refactor

* Remove slicing for all

* Fix the bug with penalty

Credit to @p-e-w for finding this!

Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>

* Add custom logits processor API (#702)

* Add custom logits processor api

* Typos

* Nicer interface and update example

* Fix doctest

* Update docs

* Update exports

* Add Gemma 2 PagedAttention support (#704)

* Add gemma2 paged attn support

* Non cuda support?

* Remove error

* It works

* Faster RmsNorm in gemma/gemma2 (#703)

* Fix bug in metal isq (#706)

* Support GGUF BF16 tensors (#691)

* Support GGUF bf16 tensors

* Fix loading of bf16 ggml tensor

* Fix dequant of bf16

* Use merged rev

* Softcapping, real batching + sliding window support for Flash Attention  (#707)

* Flash attention varlen kind of works

* Seems to work

* Now it's nice

* Sliding window support and clippy

* Remove warning

* Support smollm

* Update rev to match merged

* Remove some usages of 'pub' in models (#708)

* Support the Phi 3.5 V model (#710)

* Update image_seq_len

* Update the examples

* Format

* Implement the Phi 3.5 MoE model (#709)

* Copy the model

* Add most of it

* Add the blocksparse moe parts

* Clippy

* Fix mscales

* A batch of fixes

* Correctly cast it

* Handle isq on gate

* Even more progress

* Runs now

* Clippy

* Fix to use layernorm

* Remove unused

* Add docs

* Add more docs

* Apply review comments

* Update readme

---------

Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>
@wwoodsTM
Copy link

wwoodsTM commented Oct 15, 2024

@pi6am

Just FYI that I ported your impressive DRY implementation code to work with llama.cpp and submitted it recently as a PR here.

Due to licensing differences between the two projects (llama.cpp being MIT and koboldcpp being AGPL),in order to actually incorporate your great work, my understanding is that we would need your express permission for it to be dual-licensed (that is, assuming you did not sign some kind of separate copyright agreement for your contribution here. And @LostRuins if I am misunderstanding something about how things are structured licensing wise, please correct me here as I don't want to step on any toes and have great respect for your project.)

But @pi6am, if that is something you are open to, I was wondering whether you might join our discussion over at that PR's page.

Thank you!

@LostRuins
Copy link
Owner

@wwoodsTM you are correct, for DRY you'd need to get permission @pi6am as they are the author and hold copyright to it.

fuchinoko pushed a commit to fuchinoko/SillyTavern that referenced this pull request Oct 29, 2024
Enable DRY repetition penalty parameters for koboldcpp. This should
only be merged after: LostRuins/koboldcpp#982
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants