Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added implementation of DRY sampler (post-refactor) #9702

Merged
merged 4 commits into from
Oct 25, 2024

Conversation

wwoodsTM
Copy link
Contributor

@wwoodsTM wwoodsTM commented Oct 1, 2024

This is a continuation of the previous PR #6839 to implement DRY sampling (and I think a pretty faithful port of Koboldcpp’s impressive implementation, authored by @pi6am [who is also credited as co-author for this PR], which in turn is originally based on @p-e-w‘s design and ideas as explained here), but now overhauled to use the new sampler API. I added @l3utterfly as co-author in a recent commit of this PR.

Here are some important notes (EDIT: all have been addressed at this point, and in the case of the first point, this was moved to a separate PR, #10019):

  • I may have also fixed Bug: uncached prompt is not used for penalty #8971 (or it may turn out to be more of a bandaid), but by moving/forcing sampling to happen over the entire prompt at the conclusion of prompt processing in server.cpp, I was able to get consistent sampling of the prompt, regardless of caching or whether it is the first generation or not.
  • I also fixed a recent breaking change that was made, though I am not sure when this change happened. For quite a while it was possible (and indeed I have heavily relied on the feature in my own software for a long while now) to be able to submit a mixed array of strings and token IDs as a singleton prompt. This I believe was originally implemented from PR server: allow json array in prompt or content for direct token input #2306. Apologies if this is considered out of scope here, and I can remove it or move it to a separate PR if need be, but to be honest this feature was the main reason why I converted to using llama.cpp's server. It also seems this fix restores the feature and doesn't stop the other prompting options from working either, but if I missed something here, please let me know.
  • I added a context size parameter to gpt_sampler_init(). If there is a smarter way to handle this, I am happy to do it in a less drastic way, but I couldn't immediately figure that out.
  • The classic issue that I have dealt with from day one with this where I can't figure out how to get the vector of strings for the sequence breakers from server.cpp/common/API land to src/llama-sampling land is still here. I again "cheated" with a C++ section at the bottom of llama.h and made a separate function for setting the sequence breakers apart from the regular init function. There is probably some clever way to avoid this or to use a void or extern that I am not as experienced with. I welcome any help or suggestions with that or any of these other issues.
  • Finally, I did not implement a CLI argument for sequence breakers. If anybody wants to take a whack at that, please feel free. My thought was that maybe it could be done similarly to how the reverse/anti prompt is done (otherwise known as stopping strings) or how JSON schema is done in the CLI, but I wasn't really sure how to pull it off correctly.

@wwoodsTM
Copy link
Contributor Author

wwoodsTM commented Oct 2, 2024

@compilade You had kindly offered possibly doing a code review for the previous iteration of this PR before the sampler API overhaul. If you still have availability for that, I would definitely love to get your feedback and insights on this newest iteration.

@compilade compilade self-requested a review October 3, 2024 00:20
examples/server/server.cpp Outdated Show resolved Hide resolved
@wwoodsTM
Copy link
Contributor Author

wwoodsTM commented Oct 4, 2024

@compilade Thanks for taking a look! If it’s any help the koboldcpp version of this has a lot of extra comments that go more in-depth on the intricacies of the Z-algorithm and the clever way they implement it. Most of the core implementation here is exactly the same except for the bits I had to change due to the ring_buffer accessing from the end.

@p-e-w
Copy link

p-e-w commented Oct 5, 2024

KoboldCPP's handling of sequence breakers is excellent, and it's great to see it ported here.

However, the Z-algorithm is actually unnecessary complexity as I had remarked in the Kobold PR. Quadratic runtime can be avoided by simply capping the maximum sequence length being matched (to 50, in the case of text-generation-webui). The penalty is exponential in the match length, so this makes no difference in practice as the value will be astronomically large regardless. This can simplify the algorithm considerably, and has been tested to perform well even in the worst-case scenario where the input consists entirely of repetitions of a single token.

Personally, I prefer simplicity over notions of algorithmic elegance if performance is demonstrably equivalent as in this case, although I accept and respect that not everyone shares that opinion.

Either way, great to see this happening! Make sure you run some basic smoke tests before merging this in, e.g. using the input from the "Demonstration" section of my original PR, to verify the system is actually doing what it is supposed to, since the implementation is fairly complex at this point.

@wwoodsTM
Copy link
Contributor Author

wwoodsTM commented Oct 5, 2024

Personally, I prefer simplicity over notions of algorithmic elegance if performance is demonstrably equivalent as in this case, although I accept and respect that not everyone shares that opinion.

@p-e-w I really appreciate your feedback and am a fan of not just this sampler but also XTC, which I was happy to see at least two people beating me to the punch on as far as implementing with this new sampler API (albeit one of the submissions being made through reddit rather than a PR).

As for the issues you raise about complexity, you bring up very fair points. I will admit that I do tend to be a sucker for "elegant" algorithms or ones that seem particularly clever to me. Having said that, I would not have been opposed to implementing it the less complex way but didn't realize you weren't a fan of the z-algorithm portion until I was already pretty far into the process and saw your comments while digging deeper into all the PRs.

Either way, great to see this happening! Make sure you run some basic smoke tests before merging this in, e.g. using the input from the "Demonstration" section of my original PR, to verify the system is actually doing what it is supposed to, since the implementation is fairly complex at this point.

In regard to your demonstration text, that was indeed one of the first things I tested it on! For anybody else looking at this, here is the first part of it that is fed to the LLM (in this case, I fed it to Nemo):

Detective: Where were you last night at 6 PM?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Did you know the victim personally?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Do you have money problems?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Do you have a criminal record?

Suspect: On the advice of my attorneys, I invoke

Here is what Nemo gave me with DRY off when I ran it just now:

 my Fifth Amendment right to not answer that question.

Detective: What did you do for a living?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Have you ever been arrested?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Are you married?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

...plus 12 more successive repetitions of this pattern as I had banned EOS...

Here is what Nemo gave me with this implementation of DRY turned on (multiplier: 0.8, base: 1.75, allowed_length: 2, default sequence breakers which include ":"):

 —

Detective: Wait, you're not going to invoke your Fifth Amendment rights on that question?

Suspect: No. I don't have a criminal record.

...plus many more lines, none having that pattern...

Here is some relevant debug output showing the final tokens of the prompt, as well as the penalty itself and a comparison of logits before and after:

109 | 72859  | Sus        
110 | 2117   | pect       
111 | 1058   | :          
112 | 3249   |  On        
113 | 1278   |  the       
114 | 19199  |  advice    
115 | 1307   |  of        
116 | 2036   |  my        
117 | 101642 |  attorneys 
118 | 1044   | ,          
119 | 1362   |  I         
120 | 45589  |  invoke    

Applied penalty 40.2121 to token 2036 ( my) (repeat length 9)

Top 10 candidates before penalties:
Token 2036 ( my): logit = 23.2354
Token 1674 (—): logit = 14.6016
Token 9283 (...\n\n): logit = 14.5702
Token 1278 ( the): logit = 14.3120
Token 38659 ( ...\n\n): logit = 13.9730
Token 2251 ( —): logit = 13.8171
Token 2880 (...): logit = 12.9741
Token 57725 ( Fifth): logit = 12.7038
Token 2247 ( .): logit = 12.5666
Token 10260 (…\n\n): logit = 12.5319

Top 10 candidates after penalties:
Token 1674 (—): logit = 14.6016
Token 9283 (...\n\n): logit = 14.5702
Token 1278 ( the): logit = 14.3120
Token 38659 ( ...\n\n): logit = 13.9730
Token 2251 ( —): logit = 13.8171
Token 2880 (...): logit = 12.9741
Token 57725 ( Fifth): logit = 12.7038
Token 2247 ( .): logit = 12.5666
Token 10260 (…\n\n): logit = 12.5319
Token 1536 ( by): logit = 12.3046

@p-e-w
Copy link

p-e-w commented Oct 6, 2024

@wwoodsTM

Fair enough, as long as everything works it should be fine, and since you've already implemented it I guess there's little point in changing it now. The elephant in the room with such decisions is usually maintainability, but to be frank I doubt this code will need to be touched very often.

BTW, I love the output you got with the demonstration example. It shows not only that DRY is capable of stopping repetition (which is a mathematical certainty in principle), but that the model recovers organically from the intervention, creating a natural transition to a completely different train of thought. Many people have told me that they expect DRY to just make the model paraphrase the previous output instead of repeating it verbatim, or introduce spelling mistakes, but your output demonstrates that this usually doesn't happen with current-gen models. And it's even less likely to happen in actual conversations, because in such scenarios, DRY gradually increases the penalty rather than just slamming the door because the user already imposed a 9-token repetition like in this example.

@wwoodsTM
Copy link
Contributor Author

wwoodsTM commented Oct 7, 2024

@compilade I went ahead and made several small fixes that hopefully should address the strict compile warnings/errors I saw from the workflows. Incidentally, I also removed a "cstring" include in sampler.cpp that I had added earlier but that is no longer needed, and also removed the redundant line in server.cpp I had commented on earlier (and resolved that comment).

Thanks again and let me know if I can do anything else to help the process.

@wwoodsTM
Copy link
Contributor Author

wwoodsTM commented Oct 7, 2024

It looks like @compilade has been showing a busy status since the weekend and I’m guessing has a pretty full plate at the moment. I look forward to seeing any feedback he may have, but I also wanted to extend an invitation for feedback to any other maintainers who may have a moment as well?

@slaren or @ggerganov, with your recent sampling-focused work/review, would either of you have a chance to take a look? I am excited about this being so close to the finish line and I have seen great results from all the testing I have done. Thank you!

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2024

There is a lot of code here that makes this change hard to review. It would be easier to do so if all the debug code was removed. Functions like llama_tokenize in llama-sampling.cpp should probably be removed as well, or moved elsewhere.

@wwoodsTM
Copy link
Contributor Author

wwoodsTM commented Oct 11, 2024

@slaren Thank you for the feedback!

  • I just removed all of the debugging code that had been between the #ifdefs.
  • I also moved the tokenize/detokenize functions to llama-impl.h

Despite it not being a strict interface function, I left the GetOverlappingTokenSequences in llama-sampling.cpp since it's an integral part to processing the sequence breakers. If you think I should also move that out or rename it to make the connection to DRY more obvious though, I can do that too.

About an hour ago I resolved the recent conflicts that arose from #9805. Prior to that, all of the CI workflows that @compilade triggered were showing as passing.

Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like @compilade has been showing a busy status since the weekend and I’m guessing has a pretty full plate at the moment.

@wwoodsTM Correct, I've had much less free time that I'd like in the last weeks. But I still want to give you feedback.

I really appreciate the efforts you've put into documenting your changes in the PR description.

examples/server/server.cpp Outdated Show resolved Hide resolved
include/llama.h Outdated Show resolved Hide resolved
common/sampling.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Show resolved Hide resolved
src/llama-sampling.cpp Outdated Show resolved Hide resolved
@wwoodsTM
Copy link
Contributor Author

We now have the explicit permission from @pi6am to base this version on his implementation from koboldcpp. I also added him as a co-author on earlier commits for this PR.

I believe I had addressed all of the concerns from @compilade's code review in my latest commit, but I of course welcome any and all additional feedback. @ggerganov @slaren

Thanks.

src/llama-sampling.cpp Outdated Show resolved Hide resolved
src/llama-sampling.cpp Outdated Show resolved Hide resolved
src/llama-sampling.cpp Show resolved Hide resolved
src/llama-sampling.cpp Outdated Show resolved Hide resolved
@wwoodsTM
Copy link
Contributor Author

wwoodsTM commented Oct 19, 2024

@slaren I implemented your suggestions and fixed a conflict with the latest version of master.

Continuing the conversation on the sequence breaker functions: While I had only meant these to be a "part 2" of the init process, I can see how this was not clear and how regardless of intention they don't mesh well with the design.

To address this, I came up with a way of getting rid of the sequence breaker functions altogether which I just implemented in the latest commit. This new approach uses one main init function which includes the sequence breakers as a parameter, and then I made two overloaded wrapper versions, one for the C-interface and another for test-sampling.cpp.

Does this approach seem workable?

@slaren
Copy link
Collaborator

slaren commented Oct 19, 2024

Why no take char** instead and make it part of the public API?

@wwoodsTM
Copy link
Contributor Author

@slaren

Why no take char** instead and make it part of the public API?

So as @compilade mentioned before, I preferred the flexibility of C++ strings to allow for the possibility of NUL characters as I wasn't sure whether there might be some contexts where this literal "sequence breaker" for C-strings might be used as a sequence breaker for DRY. I also wasn't sure whether there might be any other weird encoding things that might crop up with C-strings. But maybe the chances are quite remote of this?

Also, in my particular use case, I use llama-server and pass sequence breakers through JSON, and so this probably biases me a bit toward thinking of that as kind of the principal interface versus the C-interface.

As a side note, I just made an additional commit to address warnings-made-errors from the pedantic runs that I had missed.

@slaren
Copy link
Collaborator

slaren commented Oct 19, 2024

I am not familiar with this sampler, and I am not sure what kinds of strings are used as sequence breakers, but it seems very unlikely that allowing NUL characters is important. If it is really necessary, it can be done by passing an additional array with the sizes of each string.

@wwoodsTM
Copy link
Contributor Author

@slaren

Do you feel this is a deal-breaker in its current form? My feeling is that at present it is a decent compromise to allow for keeping things in pure C++ string form where possible, such as when the breakers are submitted through JSON and tokenized in get_overlapping_token_sequences, as opposed to converting to C strings and then converting back to C++ strings every time. In the current implementation, sequence breakers would only be converted from C-strings on an as-needed basis when they are specifically submitted through the C-interface.

@slaren
Copy link
Collaborator

slaren commented Oct 19, 2024

My opinion is that it is not worth the maintenance cost of having "secret" API functions, especially for something as minor as this. It takes two lines of code to convert a vector of std::string to a vector of const char *, and it only needs to be done once in sampling.cpp.

@wwoodsTM
Copy link
Contributor Author

I hear you, it’s just that C-strings have a history of sending shivers down my spine and making me curse in repetitive ways that would probably be penalized by this sampler. I was hoping to avoid converting the overlap code from kobold to use them, but honestly it shouldn’t be hard since we’re not really manipulating the strings or doing anything complicated with them, so I will get over my fear of them and make the change a bit later today when I get home.

Thanks again for your feedback.

@slaren
Copy link
Collaborator

slaren commented Oct 19, 2024

I am not sure if I understand correctly, but it is ok to store the strings in the sampler state as std::string, there is no cost to doing that, since you need to make a copy of the strings anyway to store them (can't just steal the pointers from the application). So hopefully changing this does not require converting the overlap code.

src/llama-impl.h Outdated Show resolved Hide resolved
@wwoodsTM
Copy link
Contributor Author

I am not sure if I understand correctly, but it is ok to store the strings in the sampler state as std::string, there is no cost to doing that, since you need to make a copy of the strings anyway to store them (can't just steal the pointers from the application). So hopefully changing this does not require converting the overlap code.

Ok, I have gone ahead and eliminated the C++ version of llama_sampler_init_dry. @slaren, as you suggested, I left the overlap function alone and basically did a std::string --> C-string --> std::string conversion process to smuggle the breakers through the C-interface.

@wwoodsTM
Copy link
Contributor Author

wwoodsTM commented Oct 20, 2024

I have implemented all suggestions so far into the latest commit, including the use of llama_tokenize_internal as well as eliminating the support for JSON-within-JSON for passing sequence breakers. I also posted a message on SillyTavern's code-contribution channel in Discord and @Cohee1207 confirmed what @ngxson also pointed out here concerning TextGen WebUI being the origin of this format. @Cohee1207 expressed openness to a llama.cpp specific conversion so that ST would be compatible with this stricter implementation.

Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>

* Implementation of DRY Sampling (post-sampling-refactor)

* Better merge of latest changes to master in server.cpp

* Added co-authors, testing, fix for consistent prompt sampling

* Misc. cleanup

* Remove debug define

* Fixing strict compile errors, cleaning up

* Removed unnecessary include

* Removed debugging code, moved (de)tokenizing functions

* Bug fixes to z-alg implementation

* small fix for string_format

* restoring Koboldcpp comments, adding attribution in llama.h and llama-sampling.cpp

* fixes, adjustments from code review

* Removing sequence breaker functions

* Addressed warnings from pedantic compile

* Trying alternative ways to pull in external tokenize/detoneize functions

* Removed C++ version of llama_sampler_init_dry

* Implemented testing code

* Switched to using llama_tokenize_internal

* Removed support for JSON-within-JSON for seq breakers

* Removed old llama_sampler_init_dry declaration

* Trying ggerganov's suggestion with llama_n_ctx_train

* Using vocab instead of model

* Restored common_sampler_init signature, using llama_n_ctx_train for context, updated testing code

* Minor cleanup

* Added --dry-sequence-breaker arg for single breaker, fixed docs and default values

* Removed mixed arrays for prompt
Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
@wwoodsTM
Copy link
Contributor Author

Not sure what I did wrong with the "co-authored" part in the first force-pushes as GitHub didn't seem to want to recognize that; maybe it deals with force-pushes differently in that respect? Anyway, I removed an extra empty line that I probably should've removed originally anyway and did a normal commit to make sure that @l3utterfly and @pi6am were restored as co-authors on this PR.

@wwoodsTM
Copy link
Contributor Author

wwoodsTM commented Oct 23, 2024

@ggerganov @slaren @ngxson

Is it possible my force-pushes and commit all in quick succession threw off the tests and that's why we're getting failures?

@slaren
Copy link
Collaborator

slaren commented Oct 24, 2024

Some of the builds have warnings as errors, it fails to build here:

examples/server/server.cpp:908:18: error: unused variable 'dry_sequence_breakers' [-Werror,-Wunused-variable]

@wwoodsTM
Copy link
Contributor Author

Some of the builds have warnings as errors, it fails to build here:

examples/server/server.cpp:908:18: error: unused variable 'dry_sequence_breakers' [-Werror,-Wunused-variable]

Thank you! Just recommitted.

I thought I had my warnings and pedantic settings all cranked to the max but somehow that one didn't show for me.

@wwoodsTM
Copy link
Contributor Author

I am able to reproduce the crash on my own machine coming from running this:
./bin/llama-cli -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256

I am now trying to track down the issue.

@wwoodsTM
Copy link
Contributor Author

@ggerganov @ngxson I resolved the crash issue, which was due to DRY being in the default sampler chain but not being initialized when disabled. Now DRY will properly initialize with empty data structures even when disabled.

@wwoodsTM
Copy link
Contributor Author

I believe this thing may be finally ready! 😅 @ngxson Would you be ok with removing the block for merging at this point?

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup thanks, the code looks much cleaner now.

@wwoodsTM
Copy link
Contributor Author

Yup thanks, the code looks much cleaner

Thank you, it was greatly helped by your guidance and that of everyone else who has offered feedback.

@ggerganov Do you feel this is ready to merge?

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wwoodsTM Good job!

src/llama-vocab.cpp Show resolved Hide resolved
@ExtReMLapin
Copy link
Contributor

damn, what a ride

thanks @wwoodsTM for your work !

@wwoodsTM
Copy link
Contributor Author

wwoodsTM commented Oct 25, 2024

@wwoodsTM Good job!

@ggerganov Thank you, I learned a lot throughout this process and have appreciated your feedback throughout.

I am really excited to see this merge, let me know if I need to do anything else on my end for that step. I also hope if there is a squash that the “co-authored” sections of the commits survive better than when I did my force-pushes.

@ggerganov ggerganov merged commit ff252ea into ggerganov:master Oct 25, 2024
53 checks passed
@MaggotHATE
Copy link
Contributor

Sorry for barging in, but what's the reason for DRY being a sampler in queue and not a sampler before the queue, like penalties? It used to go right after penalties and before the queue in the first PR, and I'm not sure there is a reason to have DRY anywhere except at the beginning of the queue.

@wwoodsTM
Copy link
Contributor Author

wwoodsTM commented Nov 6, 2024

Sorry for barging in, but what's the reason for DRY being a sampler in queue and not a sampler before the queue, like penalties? It used to go right after penalties and before the queue in the first PR, and I'm not sure there is a reason to have DRY anywhere except at the beginning of the queue.

@MaggotHATE You are right that I originally implemented it as occurring between traditional penalties and the sampler chain. The change to add it into the chain stemmed from @compilade's request in this review comment. I also like you wonder whether there is much applicability in other places, but I felt that I could address compilade's request and expand flexibility a bit without hurting anything if I put it first in the chain.

@MaggotHATE
Copy link
Contributor

from @compilade's request in this review comment.

Oh, sorry, looks like I've missed this comment. If it is by request, then it's ok. I'm still not sure where else it might be used, but more flexibility is definitely for better.

Cohee1207 pushed a commit to SillyTavern/SillyTavern that referenced this pull request Nov 6, 2024
The new PR that was merged needs an array instead of a str

ggerganov/llama.cpp#9702
Cohee1207 added a commit to SillyTavern/SillyTavern that referenced this pull request Nov 6, 2024
* llama.cpp Enable dry w/ array convert

The new PR that was merged needs an array instead of a str

ggerganov/llama.cpp#9702

* Safe sequence breakers parse

* Support comma-separated list of llama.cpp sequence breakers #3026

---------

Co-authored-by: Beinsezii <beinsezii@gmail.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* sampling : add DRY sampler (post-refactor)

* DRY: Trying to fix coauthors, removed unneeded line

* DRY: Fixed redundant code

* DRY: Fixed crash issue due to DRY being in chain but uninitialized

---------

Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* sampling : add DRY sampler (post-refactor)

* DRY: Trying to fix coauthors, removed unneeded line

* DRY: Fixed redundant code

* DRY: Fixed crash issue due to DRY being in chain but uninitialized

---------

Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: uncached prompt is not used for penalty