Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DRY penalty #637

Merged
merged 35 commits into from
Aug 27, 2024
Merged

Implement DRY penalty #637

merged 35 commits into from
Aug 27, 2024

Conversation

EricLBuehler
Copy link
Owner

@p-e-w, could you please give the implementation a quick check? I'm not sure if you are familiar with Rust, but I ported the algorithm from the oobabooga implemenation you linked.

Refs #635.

@EricLBuehler EricLBuehler added the new feature New feature or request label Jul 27, 2024
Copy link

github-actions bot commented Jul 27, 2024

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                   11          102          101            0            1
 Python                 46         2018         1718           62          238
 TOML                   20          619          546           11           62
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          196          169            1           26
 (Total)                            273          201           32           40
-------------------------------------------------------------------------------
 Markdown               29         2063            0         1568          495
 |- BASH                 5          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               5           92           82            0           10
 |- Rust                 6          408          365           19           24
 |- TOML                 2           75           63            0           12
 (Total)                           2751          620         1587          544
-------------------------------------------------------------------------------
 Rust                  198        61913        56251         1123         4539
 |- Markdown           102          946           13          881           52
 (Total)                          62859        56264         2004         4591
===============================================================================
 Total                 315        67247        59057         2766         5424
===============================================================================
  

@p-e-w
Copy link
Contributor

p-e-w commented Jul 29, 2024

Thank you for implementing this so quickly! I have submitted my review in the form of a pull request into this branch: #645.

* Silence bogus Clippy warning

Clippy's suggestion cannot be implemented because of borrowing issues

* Get rid of unnecessary type annotations

Interesting that Clippy doesn't catch this

* Store default sequence breakers in a slice

It's nicer when the length is not hardcoded

* Make default sequence breakers private

No need to leak this as it's not used elsewhere

* Limit match length

Avoids quadratic runtime and potential DoS with adversarial inputs

Ref oobabooga/text-generation-webui#6047

* "Fix" sequence breaker tokenization

Most tokenizers encode punctuation tokens differently depending on where they occur in the input, and which tokens surround them. With the default sequence breakers, the appropriate encoding usually corresponds to the encoding produced when the token occurs after a word, rather than by itself. To emulate this, prefix the token with "a" before encoding, and extract the final token of the result.

See LostRuins/koboldcpp#982 for a correct solution to this problem.
@EricLBuehler
Copy link
Owner Author

EricLBuehler commented Jul 29, 2024

Hey @p-e-w! Do you think this is ready to merge (that is, the implementation is done & correct)?

@p-e-w
Copy link
Contributor

p-e-w commented Jul 30, 2024

@EricLBuehler

Give me a few days to test and verify this, I will let you know once I'm sure. Looks fine at first glance though!

Copy link
Contributor

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some optional suggestions, I've not really done a proper review.

mistralrs-core/src/sampler.rs Outdated Show resolved Hide resolved
mistralrs-core/src/sampler.rs Outdated Show resolved Hide resolved
@EricLBuehler
Copy link
Owner Author

@polarathene thanks for the review - nice to see you back! I've merged them, thanks for the suggestions!

@p-e-w
Copy link
Contributor

p-e-w commented Aug 3, 2024

@EricLBuehler

I'm having a hard time testing this properly because of #666.

@@ -488,7 +488,7 @@ impl Sampler {
let match_indices = toks
.par_iter()
.enumerate()
.take(toks.len() - 1)
.take(toks.len().saturating_sub(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just ran into this problem before I noticed your commit. But why can toks.len() be 0 here in the first place? That doesn't make sense to me. If I enter a prompt (e.g. "test"), then apply_dry_penalty is called with an empty context. Why?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this can only be caused when sampling the token resulting from prompt processing. We discard all prompt tokens from the penalty context:

https://github.com/EricLBuehler/mistral.rs/blob/master/mistralrs-core/src/pipeline/sampling.rs#L290

Does DRY require those tokens to be included?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't understand. Why is sampling needed during prompt processing?

Sure, the transformer will by construction always output logits for each token position. But samplers are only getting involved once a token is actually drawn from the resulting distribution. And that only happens when new tokens are generated, right? In which case the context should always be non-empty after a prompt has been entered.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is sampling needed during prompt processing?

After the model processes the prompt, it produces a token distribution as a result and we sample that. The Sampler::apply_penalties method has inherent support for the case where the generated tokens do not exist yet, as we just iterate over that context. Perhaps we need a case here to handle this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel bad for taking so much of your time with this, but I just don't get it. Why does sampling ever happen with an empty context? The only purpose of sampling (i.e., drawing from the probability distribution, rather than merely generating the distribution) is to generate a new token, no?

Let's say the user enters the prompt "Hello", and then runs generation. Why is Sampler::apply_penalties called with an empty context, rather than with context "Hello"? The program doesn't need to sample during prompt processing (since no tokens are being generated at previous positions), so why is this happening?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, sorry for any confusion.

Let's say the user enters the prompt "Hello", and then runs generation. Why is Sampler::apply_penalties called with an empty context, rather than with context "Hello"? The program doesn't need to sample during prompt processing (since no tokens are being generated at previous positions), so why is this happening?

We process the prompt and then sample the distribution of the last token to get the next token. So if we don't exclude the prompt, the context would be the prompt here. This is intentional, but do you think we should remove the part where we exclude the prompt (below)?

https://github.com/EricLBuehler/mistral.rs/blob/master/mistralrs-core/src/pipeline/sampling.rs#L290

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I understand now. So this is intended to be a feature?

It seems that this breaks many assumptions made by frontends (and users) about how things work. The most problematic effect is that it makes sampling a stateful affair, where the original prompt range is somehow "remembered" during generation, and the context given to the samplers doesn't match the input given to the transformer.

In a chat interface, this means repetition penalties cannot take into account previous messages while generating a new one... which is the exact opposite of what we want, since models tend to repeat previous messages. But it also seems wrong philosophically. If you cancel the generation process midway, and then restart it from that position, you might get different tokens than you would have if you had allowed it to complete, because now the prompt includes the partially generated output.

I don't believe other loaders do this, and IMO this mechanism should indeed be removed completely, for all samplers.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, I agree that the statefulness makes this incorrect!

I don't believe other loaders do this, and IMO this mechanism should indeed be removed completely, for all samplers.

Sounds good! I've removed this functionality completely now.

@EricLBuehler
Copy link
Owner Author

@p-e-w, does this look correct?

cargo run --features cuda -- -i --isq q4k plain -m microsoft/Phi-3-mini-128k-instruct -a phi3

2024-08-20T13:42:48.556788Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-08-20T13:42:48.556837Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-08-20T13:42:48.556848Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-08-20T13:42:48.556957Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `microsoft/Phi-3-mini-128k-instruct`
2024-08-20T13:42:48.556993Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `microsoft/Phi-3-mini-128k-instruct`
2024-08-20T13:42:48.722918Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
2024-08-20T13:42:48.767432Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `microsoft/Phi-3-mini-128k-instruct`
2024-08-20T13:42:48.873727Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `microsoft/Phi-3-mini-128k-instruct`
2024-08-20T13:42:48.874018Z  INFO mistralrs_core::pipeline::normal: Loading model `microsoft/Phi-3-mini-128k-instruct` on cuda[0].
2024-08-20T13:42:48.935684Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.9
2024-08-20T13:42:49.023637Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2024-08-20T13:42:49.023743Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 32064, hidden_act: Silu, hidden_size: 3072, intermediate_size: 8192, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 32, rms_norm_eps: 1e-5, rope_theta: 10000.0, bos_token_id: Some(1), eos_token_id: Some(32000), rope_scaling: Some({"type": Phi3RopeScaling(Right("longrope")), "long_factor": Phi3RopeScaling(Left([1.0700000524520874, 1.1200000047683716, 1.149999976158142, 1.4199999570846558, 1.569999933242798, 1.7999999523162842, 2.129999876022339, 2.129999876022339, 3.009999990463257, 5.910000324249268, 6.950000286102295, 9.070000648498535, 9.93000030517578, 10.710000038146973, 11.130000114440918, 14.609999656677246, 15.409998893737791, 19.809999465942383, 37.279998779296875, 38.279998779296875, 38.599998474121094, 40.12000274658203, 46.20000457763672, 50.94000625610352, 53.66000747680664, 54.9373893737793, 56.89738845825195, 57.28738784790039, 59.98738479614258, 60.86738586425781, 60.88738632202149, 61.71739196777344, 62.91739273071289, 62.957393646240234, 63.41739273071289, 63.8173942565918, 63.83739471435547, 63.89739608764648, 63.93739700317383, 64.06739807128906, 64.11434936523438, 64.12435150146484, 64.15435028076172, 64.19435119628906, 64.24435424804688, 64.57435607910156, 64.69000244140625, 64.76000213623047])), "short_factor": Phi3RopeScaling(Left([1.1, 1.1, 1.1, 1.3000000000000005, 1.3500000000000003, 1.3500000000000003, 1.4000000000000004, 1.5500000000000005, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.0500000000000007, 2.0500000000000007, 2.0500000000000007, 2.0500000000000007, 2.0500000000000007, 2.0500000000000007, 2.1000000000000005, 2.1000000000000005, 2.1500000000000004, 2.25, 2.25, 2.25, 2.25, 2.25, 2.3999999999999995, 2.4499999999999993, 2.499999999999999, 2.6999999999999984, 2.6999999999999984, 2.7499999999999982, 2.799999999999998, 2.8999999999999977, 3.049999999999997]))}), max_position_embeddings: 131072, use_flash_attn: false, sliding_window: Some(262144), original_max_position_embeddings: 4096, quantization_config: None }
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:01<00:00, 77.49it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:02<00:00, 23.66it/s]
2024-08-20T13:42:52.401371Z  INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Q4K to 129 tensors.
2024-08-20T13:42:52.403947Z  INFO mistralrs_core::pipeline::isq: Applying ISQ on 22 threads.
2024-08-20T13:43:02.482772Z  INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Q4K to 129 tensors out of 129 total tensors. Took 10.08s
2024-08-20T13:43:02.483047Z  INFO mistralrs_core::paged_attention: Allocating 2379 MB for PagedAttention KV cache
2024-08-20T13:43:02.483083Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 198 GPU blocks: available context length is 6336 tokens
2024-08-20T13:43:02.570913Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "<|endoftext|>", "<|assistant|>", "<|end|>", unk_tok = <unk>
2024-08-20T13:43:02.575197Z  INFO mistralrs_server: Model loaded.
2024-08-20T13:43:02.577222Z  INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2024-08-20T13:43:02.842321Z  INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2024-08-20T13:43:02.844652Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2024-08-20T13:43:02.844801Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", "\\", "*"], multiplier: 1.0, base: 1.75, allowed_length: 2 }) }
> hi
Hello! How can I assist you today?
> what is graphene
Graphene is a single layer of carbon atoms arranged in a two-dimensional honeycomb lattice. It is renowned for its exceptional strength, flexibility, electrical conductivity, and transparency. Discovered in 2004 by Andre Geim and Konstantin Novoselov, who later won the Nobel Prize in Physics for their work, graphene has potential applications in various fields such as electronics, energy storage, and materials science.
> write an essay
Title: The Revolutionary Potential of Graphene


Introduction:

Graphene, a material composed of a single atomic layer of sp2-bonded carbon atoms, has emerged as one of the most promising materials of the 21st century. Its discovery has sparked a revolution in material science, with potential applications that span across multiple industries. This essay explores the unique properties of graphene and its transformative potential in revolutionizing technology and industry.

Body:
1. Historical Context:
   - Discuss the discovery of graphite and its limitations.
   – Highlight the groundbreaking work of Andre Geib and Konstantin Novosolov.
   
2. Unique Properties:
    - Elaborate on the exceptional mechanical strength, which is over 100 times stronger than steel.
     - Discus the remarkable electrical and thermal conductivity that surpasses copper.
      - Explain the transparence and flexibility that make it suitable for transparent conductive films.
      
3. Applications:
     a. Electronics:
        - Describe how graphene's high electron mobility could lead to faster and more efficient transistors.
        – Explore potential uses in flexible displays and touchscreens.
        
     b. Energy Storage:
       - Discover how graphite's surface area can enhance battery performance in energy storage devices.
          - Predict future advancements in supercapacitors and lithium-ion batteries.
          
     c. Material Science:
      – Examine how grapheme's strength could lead the development of ultra-lightweight materials for aerospace applications.
            - Consider its role in improving the durability and performance of sports equipment.
            
4. Challenges and Future Prospects:
  - Address the current challenges in mass production and integration into existing technologies.
  – Speculate on future research directions that could overcome these obstacles.
- Conclude by emphasizing the transformative impact that graphene could have on technology and society at large, urging continued investment and research in this field.
>

EricLBuehler and others added 13 commits August 24, 2024 08:09
Credit to @p-e-w for finding this!

Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>
* Add custom logits processor api

* Typos

* Nicer interface and update example

* Fix doctest

* Update docs
* Add gemma2 paged attn support

* Non cuda support?

* Remove error

* It works
* Support GGUF bf16 tensors

* Fix loading of bf16 ggml tensor

* Fix dequant of bf16

* Use merged rev
…on (#707)

* Flash attention varlen kind of works

* Seems to work

* Now it's nice

* Sliding window support and clippy

* Remove warning

* Support smollm

* Update rev to match merged
* Update image_seq_len

* Update the examples

* Format
* Copy the model

* Add most of it

* Add the blocksparse moe parts

* Clippy

* Fix mscales

* A batch of fixes

* Correctly cast it

* Handle isq on gate

* Even more progress

* Runs now

* Clippy

* Fix to use layernorm

* Remove unused

* Add docs
Copy link
Contributor

@p-e-w p-e-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EricLBuehler

  • Verified that the code is equivalent to the original Python implementation.
  • Instrumented apply_dry_penalty and checked that it produces the expected penalties.
  • Tested with repetitive output and validated that this version of DRY actually prevents repetition.

As far as I'm concerned, this is now ready to be merged (after applying the two changes above).

One thing you might consider is to disable DRY completely if multiplier is 0, which is what other implementations are doing. Currently, matching is still performed in this case, but has no effect because the resulting penalty is zero. That's a lot of unnecessary work that could be skipped by just not invoking apply_dry_penalty in the first place (and the same optimization could be applied for apply_freq_presc_penalty I think).

mistralrs-core/src/sampler.rs Outdated Show resolved Hide resolved
mistralrs-core/src/sampler.rs Outdated Show resolved Hide resolved
@EricLBuehler EricLBuehler linked an issue Aug 27, 2024 that may be closed by this pull request
@EricLBuehler
Copy link
Owner Author

@p-e-w @polarathene thank you for your reviews! I'll merge this PR as it looks good and generation is great with it!

@EricLBuehler EricLBuehler merged commit d35f62e into master Aug 27, 2024
17 checks passed
@EricLBuehler EricLBuehler deleted the dry_penalty branch August 27, 2024 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add DRY repetition penalty
3 participants