llama : combined beam search + grammar sampling strategy #2923

ggerganov · 2023-08-31T06:29:29Z

This feature was proposed by @spion in #2813 (comment)

In some cases, its useful to do constrained evaluation of logits based on a union of possible text values, then pick the sum { logits } (i.e. product(probabilities)) that gives the most probable outcome overall.

E.g. template (using MS guidance)

{{#select 'armor'}}leather{{or}}chainmail{{or}}plate{{/select}}

To definitely make the best choice, we'd need to calculate the probability of all 3 token sequences. Its easy if all the choices map to a single token, but with multiple tokens we'd need not just parallel generation but parallel logit evaluation of multiple possible paths.

If we go greedy, we might get suboptimal results in cases multiple choices start with the same logit.

It should be possible to implement this by combining the existing beam search and grammar sampling features. See the discussion in the referenced comment for more info

nhhung1810 · 2023-09-03T06:25:38Z

Hi team, have some small experience with beam search and I think that I can help, can I work on this PR @ggerganov

ggerganov · 2023-09-03T07:35:44Z

Sure, let us know if you make any progress and make sure to check the comments in the referenced issue

nhhung1810 · 2023-09-03T07:37:52Z

Sure @ggerganov , beside that, is there anything I have to notice? I'm using an Apple Silicon for development

ggerganov · 2023-09-03T07:44:57Z

Nothing specific comes to mind. I'll recommend writing a separate C++ example, similar to #2926 with extensive logging so we can debug what is being generated. If you open a draft PR, we can give recommendations during the development, but you don't have to if you prefer it that way. Also, get familiar with the existing beam-search example and the grammar-related sampling options

nhhung1810 · 2023-09-03T13:38:13Z

Hi @ggerganov, to be honest, it's quite hard to start config and debug the project. Can we contact on some channel to discuss about how to start? If it does not existed, it'd like to write that document down to, so it will benefit new contributors.

FYI, I know how to code C++, but not many experience on building and shipping C++ project, maybe that's also an issue, too.

kiratp · 2023-09-08T17:58:50Z

@ggerganov a bunch of these cool thee toys (speculative exec, beam search) seem to be landing in either main or separate executables in examples.

Do you intend to push for some consolidation of this functionality all into main at some point?

viantirreau · 2023-10-25T00:29:28Z

Hi!
What do you think I should do (implementation wise) to speed up grammar-constrained generation? I am currently generating a specific JSON object with fixed keys using Mistral, and was wondering if I could somehow use the model to only predict the unknown values, based on some prompt+input, exactly like MS guidance. In my specific use case, this would approximately halve the amount of tokens that need to be generated.

I was thinking about looking ahead one character at a time and, as long as there is exactly one option, accept that character and continue forward. This is to say, give the control back to the model as soon as we need to branch (which tends to happen when filling the values of this specific JSON).

I didn't feel like opening another issue, as this one seemed closely related. Also, this has already been tangentially discussed in e.g. #1773 (comment) (Tobias Lütke)

I wonder if there is an optimization where the next token is already known form the grammar we could skip the inference and just add it? In many types of grammars like json or html that could really speed up generation

The thing with "the next token is already known" is that some tokens share prefixes, so many of them could be valid simultaneously under some grammar, thus I think it would be better to iterate one character at a time until there's a branching, and just then tokenize and batch decode.

Thanks in advance for any thoughts or suggestions!

ggerganov · 2023-10-25T07:31:10Z

One simple workaround is to use the speculative example with a very tiny draft model (< 1B) and constrain it with grammar. This is not optimal, but if the tiny model drafting is negligible it will serve as a way to "queue" the deterministic tokens from the grammar.

ExtReMLapin · 2024-07-21T11:39:04Z

Any news on this ? Also what @viantirreau suggested would be top notch.

ggerganov added good first issue Good for newcomers generation quality Quality of model output research 🔬 labels Aug 31, 2023

ggerganov mentioned this issue Aug 31, 2023

llama : add support for batched inference #2813

Closed

ggerganov changed the title ~~llama : combined beam_search + grammar sampling strategy~~ llama : combined beam search + grammar sampling strategy Aug 31, 2023

ggerganov assigned nhhung1810 Sep 3, 2023

mattpulver mentioned this issue Sep 5, 2023

Add llama_beam_search(). #2267

Merged

ggerganov unassigned nhhung1810 Sep 8, 2023

HanClinto mentioned this issue Nov 1, 2024

llama : speed-up grammar sampling #4218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : combined beam search + grammar sampling strategy #2923

llama : combined beam search + grammar sampling strategy #2923

ggerganov commented Aug 31, 2023 •

edited

Loading

nhhung1810 commented Sep 3, 2023

ggerganov commented Sep 3, 2023

nhhung1810 commented Sep 3, 2023

ggerganov commented Sep 3, 2023

nhhung1810 commented Sep 3, 2023

kiratp commented Sep 8, 2023

viantirreau commented Oct 25, 2023 •

edited

Loading

ggerganov commented Oct 25, 2023

ExtReMLapin commented Jul 21, 2024

llama : combined beam search + grammar sampling strategy #2923

llama : combined beam search + grammar sampling strategy #2923

Comments

ggerganov commented Aug 31, 2023 • edited Loading

nhhung1810 commented Sep 3, 2023

ggerganov commented Sep 3, 2023

nhhung1810 commented Sep 3, 2023

ggerganov commented Sep 3, 2023

nhhung1810 commented Sep 3, 2023

kiratp commented Sep 8, 2023

viantirreau commented Oct 25, 2023 • edited Loading

ggerganov commented Oct 25, 2023

ExtReMLapin commented Jul 21, 2024

ggerganov commented Aug 31, 2023 •

edited

Loading

viantirreau commented Oct 25, 2023 •

edited

Loading