Contrastive search : A feature to strive for alongside speculative decoding? #3450

IridiumMaster · 2023-10-03T01:23:46Z

IridiumMaster
Oct 3, 2023

#3278
logikstate originally wrote this issue, and @KerfuffleV2 suggested moving it over to a discussion.

Will paste in some of the previous discussion back here, but the gist is:
-Could Llama.cpp support contrastive search alongside speculative decoding? It seems like this technique has the potential to radically improve model performance

Previous Discussion
logikstate
This paper has a method similar to speculative sampling that improves models by sampling the lower quality model for tokens to avoid thus increasing the quality of the output of the higher quality model. Allegedly leading to LLaMA-65B outperforming LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark.

https://arxiv.org/abs/2309.09117

"We demonstrate that Contrastive Decoding -- a simple, computationally light, and training-free text generation method proposed by Li et al 2022 -- achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks. Originally shown to improve the perceived quality of long-form text generation, Contrastive Decoding searches for strings that maximize a weighted difference in likelihood between strong and weak models. We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark, and to outperform LLaMA 2, GPT-3.5 and PaLM-540B on the GSM8K math word reasoning benchmark, in addition to improvements on a collection of other tasks. Analysis suggests that Contrastive Decoding improves over existing methods by preventing some abstract reasoning errors, as well as by avoiding simpler modes such as copying sections of the input during chain-of-thought. Overall, Contrastive Decoding outperforms nucleus sampling for long-form generation and greedy decoding for reasoning tasks, making it a powerful general purpose method for generating text from language models."

KerfuffleV2 commented Sep 22, 2023

I was looking at this a few days ago, but it seems pretty complicated. Unlike the other samplers that you can just give the last tokens + current logits to, it seems like contrastive decoding requires a different approach. (Correct me if I'm wrong.)

I tried to find a simple example of implementing it but wasn't successful.
__
IridiumMaster
Here's what they list in their appendix:
A.1 CODE IMPLEMENTATION
We include PyTorch implementations of contrastive decoding in Algorithm 1 and Algorithm 2
Algorithm1: Originalformulation
expert logits - unnormalized scores from the expert model
amateur logits - unnormalized scores from the amateur model # amateur temp - temperature to normalize amateur distribution # alpha - masking threshold

expert probs = softmax(expert logits, dim=-1)
amateur probs = softmax(amateur logits / amateur temp, dim=-1) cutoff = alpha*expert probs.max(dim=-1, keepdim=True).values
diffs = log(expert probs) - log(amateur probs)
cd logits = diffs.masked fill(expert probs < cutoff, -float(’inf’))
Algorithm2: Ourformulation
expert logits - unnormalized scores from the expert model
amateur logits - unnormalized scores from the amateur model # alpha - masking threshold
beta - expert-amateur tradeoff parameter

cutoff = log(alpha) + expert logits.max(dim=-1, keepdim=True).values diffs = (1 + beta)expert logits - betaamateur logits
cd logits = diffs.masked fill(expert logits < cutoff, -float(’inf’))

And here is GPT 3.5 16k Turbo's take on the approach required:

    Prepare your expert and amateur language models. These models should be pre-trained and capable of generating text.

    Calculate the unnormalized scores (logits) for each token from both the expert and amateur models.

    Choose a hyperparameter alpha (α) to determine the masking threshold. This will be used to mask out tokens that have lower probability assigned by the expert model.

    Calculate the weighted differences in likelihood (diffs) between the expert and amateur models. This can be done by subtracting the amateur logits from the expert logits and applying weights.

    Apply the alpha-mask to filter out tokens with lower probability assigned by the expert model. This can be done by comparing the expert logits to a threshold obtained from alpha.

    Apply the final CD logits by replacing the expert logits with the masked logits. Tokens that are below the masking threshold will be replaced with -inf to avoid selecting them during decoding.

    Use the CD logits to generate text by selecting tokens with higher probabilities in the CD distribution. Greedy decoding or sampling techniques can be used based on your preference.

By following these steps, you can implement contrastive decoding to improve text generation from your language models.

And here's what it has to say about your statement:

In the context of the paper, the statement holds true. Contrastive decoding does require a different approach compared to other sampling methods. Contrastive decoding involves searching for tokens that maximize a weighted difference in likelihood between a stronger expert model and a weaker amateur model. This requires calculating the differences in probabilities between the expert and amateur models, and then applying a masking threshold to filter out low-probability tokens. The resulting contrastive logits are used for text generation.

In contrast, other sampling methods like top-k sampling or nucleus sampling only require the last tokens and current logits to select the next token for text generation. These methods do not involve comparing probabilities between different models or applying specific masking techniques.

Therefore, contrastive decoding does require a distinct approach that considers the differences between the expert and amateur models, making it distinct from other sampling techniques.

It seems like something that could be enabled as speculative decoding with smaller models is implemented, @KerfuffleV2 ?

KerfuffleV2 commented Oct 1, 2023

Yes, it does kind of sound like something that could at least reuse parts of the existing speculative stuff. You might not even need a completely separate model: https://arxiv.org/abs/2309.08168

By the way, you might get more responses if you created this as a discussion rather than an issue.

Mihaiii · 2023-10-09T11:51:21Z

Mihaiii
Oct 9, 2023

Why not do DoLa instead? It should be easier to implement and we wouldn't need 2 different models.

"A similar concept to ours is Contrastive Decoding (CD) (Li et al., 2022), aimed at enhancing fluency and coherence by contrasting expert (strong) and amateur (weak) LMs. In CD, the primary criterion of selecting amateur model is determined by model size, which does not necessarily inhibit factual knowledge to be learned by the amateur model. Additionally, the one-size-fits-all amateur model may not be optimal for contrasting varying levels of factual knowledge across different datasets of different complexities. Unlike CD, which uses a static amateur LM, our DoLa dynamically selects early layers for less factual predictions based on token difficulty, as outlined in Section 2.2. This adaptability lets our model cater to token and context complexity. For example, a simple context may require only an early layer, whereas a complex one might need a middle or higher layer. Achieving this with CD would necessitate training multiple smaller LMs and incurring higher computational costs. In contrast, DoLa requires just one forward pass with efficient early exiting, adding minimal latency from ×1.01 to ×1.08."

Source: https://arxiv.org/abs/2309.03883

0 replies

Mihaiii · 2023-10-09T20:54:48Z

Mihaiii
Oct 9, 2023

I keep thinking about this: DoLa might not work great for story telling, but it should enhance the accuracy of grammar constraints outputs.

Any thoughts?

I noticed that nobody requested to (or talked about) integrate it into llamacpp (although DoLa it's pretty popular) - how come?

0 replies

OldishCoder · 2024-03-19T11:35:37Z

OldishCoder
Mar 19, 2024

Contrastive Search worked very well for me in a couple of cases when I was using Transformers or CTransformers with small full-rank or quantized models respectively. Contrastive search made the models much more "coherent" and made the difference between ... yeah it runs, sort-of to... wow! With Transformers, the main thing was access to the hyperparameters penalty_alpha and top_k. With llama.cpp, would it be as easy as exposing penalty_alpha? Or would there be a lot of other work behind the scenes?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contrastive search : A feature to strive for alongside speculative decoding? #3450

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Contrastive search : A feature to strive for alongside speculative decoding? #3450

IridiumMaster Oct 3, 2023

Replies: 3 comments

Mihaiii Oct 9, 2023

Mihaiii Oct 9, 2023

OldishCoder Mar 19, 2024

IridiumMaster
Oct 3, 2023

Mihaiii
Oct 9, 2023

Mihaiii
Oct 9, 2023

OldishCoder
Mar 19, 2024