Implement DRY penalty #637

EricLBuehler · 2024-07-27T17:29:11Z

@p-e-w, could you please give the implementation a quick check? I'm not sure if you are familiar with Rust, but I ported the algorithm from the oobabooga implemenation you linked.

Refs #635.

github-actions · 2024-07-27T17:30:17Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                   11          102          101            0            1
 Python                 46         2018         1718           62          238
 TOML                   20          619          546           11           62
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          196          169            1           26
 (Total)                            273          201           32           40
-------------------------------------------------------------------------------
 Markdown               29         2063            0         1568          495
 |- BASH                 5          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               5           92           82            0           10
 |- Rust                 6          408          365           19           24
 |- TOML                 2           75           63            0           12
 (Total)                           2751          620         1587          544
-------------------------------------------------------------------------------
 Rust                  198        61913        56251         1123         4539
 |- Markdown           102          946           13          881           52
 (Total)                          62859        56264         2004         4591
===============================================================================
 Total                 315        67247        59057         2766         5424
===============================================================================

p-e-w · 2024-07-29T04:59:31Z

Thank you for implementing this so quickly! I have submitted my review in the form of a pull request into this branch: #645.

* Silence bogus Clippy warning Clippy's suggestion cannot be implemented because of borrowing issues * Get rid of unnecessary type annotations Interesting that Clippy doesn't catch this * Store default sequence breakers in a slice It's nicer when the length is not hardcoded * Make default sequence breakers private No need to leak this as it's not used elsewhere * Limit match length Avoids quadratic runtime and potential DoS with adversarial inputs Ref oobabooga/text-generation-webui#6047 * "Fix" sequence breaker tokenization Most tokenizers encode punctuation tokens differently depending on where they occur in the input, and which tokens surround them. With the default sequence breakers, the appropriate encoding usually corresponds to the encoding produced when the token occurs after a word, rather than by itself. To emulate this, prefix the token with "a" before encoding, and extract the final token of the result. See LostRuins/koboldcpp#982 for a correct solution to this problem.

EricLBuehler · 2024-07-29T19:11:11Z

Hey @p-e-w! Do you think this is ready to merge (that is, the implementation is done & correct)?

p-e-w · 2024-07-30T11:48:12Z

@EricLBuehler

Give me a few days to test and verify this, I will let you know once I'm sure. Looks fine at first glance though!

polarathene

Just some optional suggestions, I've not really done a proper review.

mistralrs-core/src/sampler.rs

EricLBuehler · 2024-08-01T22:07:28Z

@polarathene thanks for the review - nice to see you back! I've merged them, thanks for the suggestions!

p-e-w · 2024-08-03T07:23:02Z

@EricLBuehler

I'm having a hard time testing this properly because of #666.

p-e-w · 2024-08-16T12:49:00Z

mistralrs-core/src/sampler.rs

@@ -488,7 +488,7 @@ impl Sampler {
            let match_indices = toks
                .par_iter()
                .enumerate()
-                .take(toks.len() - 1)
+                .take(toks.len().saturating_sub(1))


Just ran into this problem before I noticed your commit. But why can toks.len() be 0 here in the first place? That doesn't make sense to me. If I enter a prompt (e.g. "test"), then apply_dry_penalty is called with an empty context. Why?

So this can only be caused when sampling the token resulting from prompt processing. We discard all prompt tokens from the penalty context:

https://github.com/EricLBuehler/mistral.rs/blob/master/mistralrs-core/src/pipeline/sampling.rs#L290

Does DRY require those tokens to be included?

I'm afraid I don't understand. Why is sampling needed during prompt processing?

Sure, the transformer will by construction always output logits for each token position. But samplers are only getting involved once a token is actually drawn from the resulting distribution. And that only happens when new tokens are generated, right? In which case the context should always be non-empty after a prompt has been entered.

Why is sampling needed during prompt processing?

After the model processes the prompt, it produces a token distribution as a result and we sample that. The Sampler::apply_penalties method has inherent support for the case where the generated tokens do not exist yet, as we just iterate over that context. Perhaps we need a case here to handle this?

I feel bad for taking so much of your time with this, but I just don't get it. Why does sampling ever happen with an empty context? The only purpose of sampling (i.e., drawing from the probability distribution, rather than merely generating the distribution) is to generate a new token, no?

Let's say the user enters the prompt "Hello", and then runs generation. Why is Sampler::apply_penalties called with an empty context, rather than with context "Hello"? The program doesn't need to sample during prompt processing (since no tokens are being generated at previous positions), so why is this happening?

No problem, sorry for any confusion.

Let's say the user enters the prompt "Hello", and then runs generation. Why is Sampler::apply_penalties called with an empty context, rather than with context "Hello"? The program doesn't need to sample during prompt processing (since no tokens are being generated at previous positions), so why is this happening?

We process the prompt and then sample the distribution of the last token to get the next token. So if we don't exclude the prompt, the context would be the prompt here. This is intentional, but do you think we should remove the part where we exclude the prompt (below)?

https://github.com/EricLBuehler/mistral.rs/blob/master/mistralrs-core/src/pipeline/sampling.rs#L290

Thank you, I understand now. So this is intended to be a feature?

It seems that this breaks many assumptions made by frontends (and users) about how things work. The most problematic effect is that it makes sampling a stateful affair, where the original prompt range is somehow "remembered" during generation, and the context given to the samplers doesn't match the input given to the transformer.

In a chat interface, this means repetition penalties cannot take into account previous messages while generating a new one... which is the exact opposite of what we want, since models tend to repeat previous messages. But it also seems wrong philosophically. If you cancel the generation process midway, and then restart it from that position, you might get different tokens than you would have if you had allowed it to complete, because now the prompt includes the partially generated output.

I don't believe other loaders do this, and IMO this mechanism should indeed be removed completely, for all samplers.

Thanks for pointing this out, I agree that the statefulness makes this incorrect!

I don't believe other loaders do this, and IMO this mechanism should indeed be removed completely, for all samplers.

Sounds good! I've removed this functionality completely now.

EricLBuehler · 2024-08-20T13:44:46Z

@p-e-w, does this look correct?

cargo run --features cuda -- -i --isq q4k plain -m microsoft/Phi-3-mini-128k-instruct -a phi3

2024-08-20T13:42:48.556788Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-08-20T13:42:48.556837Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-20T13:42:48.556848Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-08-20T13:42:48.556957Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `microsoft/Phi-3-mini-128k-instruct`
2024-08-20T13:42:48.556993Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `microsoft/Phi-3-mini-128k-instruct`
2024-08-20T13:42:48.722918Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
2024-08-20T13:42:48.767432Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `microsoft/Phi-3-mini-128k-instruct`
2024-08-20T13:42:48.873727Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `microsoft/Phi-3-mini-128k-instruct`
2024-08-20T13:42:48.874018Z  INFO mistralrs_core::pipeline::normal: Loading model `microsoft/Phi-3-mini-128k-instruct` on cuda[0].
2024-08-20T13:42:48.935684Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.9
2024-08-20T13:42:49.023637Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2024-08-20T13:42:49.023743Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 32064, hidden_act: Silu, hidden_size: 3072, intermediate_size: 8192, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 32, rms_norm_eps: 1e-5, rope_theta: 10000.0, bos_token_id: Some(1), eos_token_id: Some(32000), rope_scaling: Some({"type": Phi3RopeScaling(Right("longrope")), "long_factor": Phi3RopeScaling(Left([1.0700000524520874, 1.1200000047683716, 1.149999976158142, 1.4199999570846558, 1.569999933242798, 1.7999999523162842, 2.129999876022339, 2.129999876022339, 3.009999990463257, 5.910000324249268, 6.950000286102295, 9.070000648498535, 9.93000030517578, 10.710000038146973, 11.130000114440918, 14.609999656677246, 15.409998893737791, 19.809999465942383, 37.279998779296875, 38.279998779296875, 38.599998474121094, 40.12000274658203, 46.20000457763672, 50.94000625610352, 53.66000747680664, 54.9373893737793, 56.89738845825195, 57.28738784790039, 59.98738479614258, 60.86738586425781, 60.88738632202149, 61.71739196777344, 62.91739273071289, 62.957393646240234, 63.41739273071289, 63.8173942565918, 63.83739471435547, 63.89739608764648, 63.93739700317383, 64.06739807128906, 64.11434936523438, 64.12435150146484, 64.15435028076172, 64.19435119628906, 64.24435424804688, 64.57435607910156, 64.69000244140625, 64.76000213623047])), "short_factor": Phi3RopeScaling(Left([1.1, 1.1, 1.1, 1.3000000000000005, 1.3500000000000003, 1.3500000000000003, 1.4000000000000004, 1.5500000000000005, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.0500000000000007, 2.0500000000000007, 2.0500000000000007, 2.0500000000000007, 2.0500000000000007, 2.0500000000000007, 2.1000000000000005, 2.1000000000000005, 2.1500000000000004, 2.25, 2.25, 2.25, 2.25, 2.25, 2.3999999999999995, 2.4499999999999993, 2.499999999999999, 2.6999999999999984, 2.6999999999999984, 2.7499999999999982, 2.799999999999998, 2.8999999999999977, 3.049999999999997]))}), max_position_embeddings: 131072, use_flash_attn: false, sliding_window: Some(262144), original_max_position_embeddings: 4096, quantization_config: None }
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:01<00:00, 77.49it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:02<00:00, 23.66it/s]
2024-08-20T13:42:52.401371Z  INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Q4K to 129 tensors.
2024-08-20T13:42:52.403947Z  INFO mistralrs_core::pipeline::isq: Applying ISQ on 22 threads.
2024-08-20T13:43:02.482772Z  INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Q4K to 129 tensors out of 129 total tensors. Took 10.08s
2024-08-20T13:43:02.483047Z  INFO mistralrs_core::paged_attention: Allocating 2379 MB for PagedAttention KV cache
2024-08-20T13:43:02.483083Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 198 GPU blocks: available context length is 6336 tokens
2024-08-20T13:43:02.570913Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "<|endoftext|>", "<|assistant|>", "<|end|>", unk_tok = <unk>
2024-08-20T13:43:02.575197Z  INFO mistralrs_server: Model loaded.
2024-08-20T13:43:02.577222Z  INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2024-08-20T13:43:02.842321Z  INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2024-08-20T13:43:02.844652Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2024-08-20T13:43:02.844801Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", "\\", "*"], multiplier: 1.0, base: 1.75, allowed_length: 2 }) }
> hi
Hello! How can I assist you today?
> what is graphene
Graphene is a single layer of carbon atoms arranged in a two-dimensional honeycomb lattice. It is renowned for its exceptional strength, flexibility, electrical conductivity, and transparency. Discovered in 2004 by Andre Geim and Konstantin Novoselov, who later won the Nobel Prize in Physics for their work, graphene has potential applications in various fields such as electronics, energy storage, and materials science.
> write an essay
Title: The Revolutionary Potential of Graphene


Introduction:

Graphene, a material composed of a single atomic layer of sp2-bonded carbon atoms, has emerged as one of the most promising materials of the 21st century. Its discovery has sparked a revolution in material science, with potential applications that span across multiple industries. This essay explores the unique properties of graphene and its transformative potential in revolutionizing technology and industry.

Body:
1. Historical Context:
   - Discuss the discovery of graphite and its limitations.
   – Highlight the groundbreaking work of Andre Geib and Konstantin Novosolov.
   
2. Unique Properties:
    - Elaborate on the exceptional mechanical strength, which is over 100 times stronger than steel.
     - Discus the remarkable electrical and thermal conductivity that surpasses copper.
      - Explain the transparence and flexibility that make it suitable for transparent conductive films.
      
3. Applications:
     a. Electronics:
        - Describe how graphene's high electron mobility could lead to faster and more efficient transistors.
        – Explore potential uses in flexible displays and touchscreens.
        
     b. Energy Storage:
       - Discover how graphite's surface area can enhance battery performance in energy storage devices.
          - Predict future advancements in supercapacitors and lithium-ion batteries.
          
     c. Material Science:
      – Examine how grapheme's strength could lead the development of ultra-lightweight materials for aerospace applications.
            - Consider its role in improving the durability and performance of sports equipment.
            
4. Challenges and Future Prospects:
  - Address the current challenges in mass production and integration into existing technologies.
  – Speculate on future research directions that could overcome these obstacles.
- Conclude by emphasizing the transformative impact that graphene could have on technology and society at large, urging continued investment and research in this field.
>

mistralrs-core/src/sampler.rs

@p-e-w

Credit to @p-e-w for finding this! Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>

* Add custom logits processor api * Typos * Nicer interface and update example * Fix doctest * Update docs

* Add gemma2 paged attn support * Non cuda support? * Remove error * It works

* Support GGUF bf16 tensors * Fix loading of bf16 ggml tensor * Fix dequant of bf16 * Use merged rev

…on (#707) * Flash attention varlen kind of works * Seems to work * Now it's nice * Sliding window support and clippy * Remove warning * Support smollm * Update rev to match merged

* Update image_seq_len * Update the examples * Format

* Copy the model * Add most of it * Add the blocksparse moe parts * Clippy * Fix mscales * A batch of fixes * Correctly cast it * Handle isq on gate * Even more progress * Runs now * Clippy * Fix to use layernorm * Remove unused * Add docs

p-e-w

@EricLBuehler

Verified that the code is equivalent to the original Python implementation.
Instrumented apply_dry_penalty and checked that it produces the expected penalties.
Tested with repetitive output and validated that this version of DRY actually prevents repetition.

As far as I'm concerned, this is now ready to be merged (after applying the two changes above).

One thing you might consider is to disable DRY completely if multiplier is 0, which is what other implementations are doing. Currently, matching is still performed in this case, but has no effect because the resulting penalty is zero. That's a lot of unnecessary work that could be skipped by just not invoking apply_dry_penalty in the first place (and the same optimization could be applied for apply_freq_presc_penalty I think).

mistralrs-core/src/sampler.rs

EricLBuehler · 2024-08-27T22:31:47Z

@p-e-w @polarathene thank you for your reviews! I'll merge this PR as it looks good and generation is great with it!

Implement dry penalty

9e4aee6

EricLBuehler added the new feature New feature or request label Jul 27, 2024

EricLBuehler added 4 commits July 28, 2024 06:08

Merge branch 'master' into dry_penalty

7e9399d

Add dry sampling params to requests

28e4f0a

Handle it

3a41bb1

Clippy

8650d9c

p-e-w mentioned this pull request Jul 29, 2024

Review: "Implement DRY penalty" #645

Merged

polarathene reviewed Aug 1, 2024

View reviewed changes

mistralrs-core/src/sampler.rs Outdated Show resolved Hide resolved

mistralrs-core/src/sampler.rs Outdated Show resolved Hide resolved

EricLBuehler added 2 commits August 1, 2024 07:55

Nicer

07fe7ef

Even better

8652c73

EricLBuehler added 2 commits August 1, 2024 18:08

Merge branch 'master' into dry_penalty

601815f

Complete merge

df5583e

EricLBuehler added 2 commits August 8, 2024 09:43

Merge branch 'master' into dry_penalty

37d8e83

Fix saturating sub

70e74f8

p-e-w reviewed Aug 16, 2024

View reviewed changes

EricLBuehler added 6 commits August 16, 2024 09:09

Merge branch 'master' into dry_penalty

b9dc359

Merge branch 'master' into dry_penalty

41ec8db

Handle when no context

ddc7182

Merge branch 'master' into dry_penalty

170ed33

Make context the entire sequence and refactor

3cc19fd

Remove slicing for all

2009754

Merge branch 'master' into dry_penalty

4124aba

p-e-w reviewed Aug 24, 2024

View reviewed changes

mistralrs-core/src/sampler.rs Outdated Show resolved Hide resolved

EricLBuehler and others added 13 commits August 24, 2024 08:09

Fix the bug with penalty

e1b19aa

Credit to @p-e-w for finding this! Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>

Add custom logits processor API (#702)

5700704

* Add custom logits processor api * Typos * Nicer interface and update example * Fix doctest * Update docs

Update exports

15feb93

Add Gemma 2 PagedAttention support (#704)

276f635

* Add gemma2 paged attn support * Non cuda support? * Remove error * It works

Faster RmsNorm in gemma/gemma2 (#703)

3968f6e

Fix bug in metal isq (#706)

182e281

Support GGUF BF16 tensors (#691)

bca949e

* Support GGUF bf16 tensors * Fix loading of bf16 ggml tensor * Fix dequant of bf16 * Use merged rev

Softcapping, real batching + sliding window support for Flash Attenti…

98d001a

…on (#707) * Flash attention varlen kind of works * Seems to work * Now it's nice * Sliding window support and clippy * Remove warning * Support smollm * Update rev to match merged

Remove some usages of 'pub' in models (#708)

2c382e0

Support the Phi 3.5 V model (#710)

d17c151

* Update image_seq_len * Update the examples * Format

Implement the Phi 3.5 MoE model (#709)

6c0628b

* Copy the model * Add most of it * Add the blocksparse moe parts * Clippy * Fix mscales * A batch of fixes * Correctly cast it * Handle isq on gate * Even more progress * Runs now * Clippy * Fix to use layernorm * Remove unused * Add docs

Add more docs

1c22f4b

Merge into master

a32e2b1

p-e-w approved these changes Aug 26, 2024

View reviewed changes

mistralrs-core/src/sampler.rs Outdated Show resolved Hide resolved

mistralrs-core/src/sampler.rs Outdated Show resolved Hide resolved

EricLBuehler added 3 commits August 27, 2024 18:11

Merge branch 'master' into dry_penalty

c9f8990

Apply review comments

3eaaba0

Update readme

5202c6a

EricLBuehler linked an issue Aug 27, 2024 that may be closed by this pull request

Add DRY repetition penalty #635

Closed

EricLBuehler merged commit d35f62e into master Aug 27, 2024
17 checks passed

EricLBuehler deleted the dry_penalty branch August 27, 2024 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement DRY penalty #637

Implement DRY penalty #637

EricLBuehler commented Jul 27, 2024

github-actions bot commented Jul 27, 2024 •

edited

Loading

p-e-w commented Jul 29, 2024

EricLBuehler commented Jul 29, 2024 •

edited

Loading

p-e-w commented Jul 30, 2024

polarathene left a comment

EricLBuehler commented Aug 1, 2024

p-e-w commented Aug 3, 2024

p-e-w Aug 16, 2024

EricLBuehler Aug 16, 2024

p-e-w Aug 17, 2024

EricLBuehler Aug 17, 2024

p-e-w Aug 19, 2024

EricLBuehler Aug 19, 2024

p-e-w Aug 20, 2024

EricLBuehler Aug 20, 2024

EricLBuehler commented Aug 20, 2024

p-e-w left a comment

EricLBuehler commented Aug 27, 2024

Implement DRY penalty #637

Implement DRY penalty #637

Conversation

EricLBuehler commented Jul 27, 2024

github-actions bot commented Jul 27, 2024 • edited Loading

p-e-w commented Jul 29, 2024

EricLBuehler commented Jul 29, 2024 • edited Loading

p-e-w commented Jul 30, 2024

polarathene left a comment

Choose a reason for hiding this comment

EricLBuehler commented Aug 1, 2024

p-e-w commented Aug 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EricLBuehler commented Aug 20, 2024

p-e-w left a comment

Choose a reason for hiding this comment

EricLBuehler commented Aug 27, 2024

github-actions bot commented Jul 27, 2024 •

edited

Loading

EricLBuehler commented Jul 29, 2024 •

edited

Loading