Contrastive search : A feature to strive for alongside speculative decoding? #3450
Replies: 3 comments
-
Why not do DoLa instead? It should be easier to implement and we wouldn't need 2 different models. "A similar concept to ours is Contrastive Decoding (CD) (Li et al., 2022), aimed at enhancing fluency and coherence by contrasting expert (strong) and amateur (weak) LMs. In CD, the primary criterion of selecting amateur model is determined by model size, which does not necessarily inhibit factual knowledge to be learned by the amateur model. Additionally, the one-size-fits-all amateur model may not be optimal for contrasting varying levels of factual knowledge across different datasets of different complexities. Unlike CD, which uses a static amateur LM, our DoLa dynamically selects early layers for less factual predictions based on token difficulty, as outlined in Section 2.2. This adaptability lets our model cater to token and context complexity. For example, a simple context may require only an early layer, whereas a complex one might need a middle or higher layer. Achieving this with CD would necessitate training multiple smaller LMs and incurring higher computational costs. In contrast, DoLa requires just one forward pass with efficient early exiting, adding minimal latency from ×1.01 to ×1.08." Source: https://arxiv.org/abs/2309.03883 |
Beta Was this translation helpful? Give feedback.
-
I keep thinking about this: DoLa might not work great for story telling, but it should enhance the accuracy of grammar constraints outputs. Any thoughts? I noticed that nobody requested to (or talked about) integrate it into llamacpp (although DoLa it's pretty popular) - how come? |
Beta Was this translation helpful? Give feedback.
-
Contrastive Search worked very well for me in a couple of cases when I was using Transformers or CTransformers with small full-rank or quantized models respectively. Contrastive search made the models much more "coherent" and made the difference between ... yeah it runs, sort-of to... wow! With Transformers, the main thing was access to the hyperparameters penalty_alpha and top_k. With llama.cpp, would it be as easy as exposing penalty_alpha? Or would there be a lot of other work behind the scenes? |
Beta Was this translation helpful? Give feedback.
-
#3278
logikstate originally wrote this issue, and @KerfuffleV2 suggested moving it over to a discussion.
Will paste in some of the previous discussion back here, but the gist is:
-Could Llama.cpp support contrastive search alongside speculative decoding? It seems like this technique has the potential to radically improve model performance
Previous Discussion
logikstate
This paper has a method similar to speculative sampling that improves models by sampling the lower quality model for tokens to avoid thus increasing the quality of the output of the higher quality model. Allegedly leading to LLaMA-65B outperforming LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark.
https://arxiv.org/abs/2309.09117
"We demonstrate that Contrastive Decoding -- a simple, computationally light, and training-free text generation method proposed by Li et al 2022 -- achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks. Originally shown to improve the perceived quality of long-form text generation, Contrastive Decoding searches for strings that maximize a weighted difference in likelihood between strong and weak models. We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark, and to outperform LLaMA 2, GPT-3.5 and PaLM-540B on the GSM8K math word reasoning benchmark, in addition to improvements on a collection of other tasks. Analysis suggests that Contrastive Decoding improves over existing methods by preventing some abstract reasoning errors, as well as by avoiding simpler modes such as copying sections of the input during chain-of-thought. Overall, Contrastive Decoding outperforms nucleus sampling for long-form generation and greedy decoding for reasoning tasks, making it a powerful general purpose method for generating text from language models."
KerfuffleV2 commented Sep 22, 2023
I was looking at this a few days ago, but it seems pretty complicated. Unlike the other samplers that you can just give the last tokens + current logits to, it seems like contrastive decoding requires a different approach. (Correct me if I'm wrong.)
I tried to find a simple example of implementing it but wasn't successful.
__
IridiumMaster
Here's what they list in their appendix:
A.1 CODE IMPLEMENTATION
We include PyTorch implementations of contrastive decoding in Algorithm 1 and Algorithm 2
Algorithm1: Originalformulation
expert logits - unnormalized scores from the expert model
amateur logits - unnormalized scores from the amateur model # amateur temp - temperature to normalize amateur distribution # alpha - masking threshold
expert probs = softmax(expert logits, dim=-1)
amateur probs = softmax(amateur logits / amateur temp, dim=-1) cutoff = alpha*expert probs.max(dim=-1, keepdim=True).values
diffs = log(expert probs) - log(amateur probs)
cd logits = diffs.masked fill(expert probs < cutoff, -float(’inf’))
Algorithm2: Ourformulation
expert logits - unnormalized scores from the expert model
amateur logits - unnormalized scores from the amateur model # alpha - masking threshold
beta - expert-amateur tradeoff parameter
cutoff = log(alpha) + expert logits.max(dim=-1, keepdim=True).values diffs = (1 + beta)expert logits - betaamateur logits
cd logits = diffs.masked fill(expert logits < cutoff, -float(’inf’))
And here is GPT 3.5 16k Turbo's take on the approach required:
And here's what it has to say about your statement:
It seems like something that could be enabled as speculative decoding with smaller models is implemented, @KerfuffleV2 ?
KerfuffleV2 commented Oct 1, 2023
Yes, it does kind of sound like something that could at least reuse parts of the existing speculative stuff. You might not even need a completely separate model: https://arxiv.org/abs/2309.08168
By the way, you might get more responses if you created this as a discussion rather than an issue.
Beta Was this translation helpful? Give feedback.
All reactions