Merges cannot handle tokens containing spaces. #909

Narsil · 2022-02-16T21:30:42Z

This fixes this while keeping backward support.
We don't want to merge that blindly.

nihirv · 2022-11-22T20:11:24Z

Hi. I found this PR through #566.

How can I use the PR? I wasn't able to see the rust files in my tokenizers site-packages, and the README on the repo doesn't give instructions on building from source

Narsil · 2022-11-23T19:24:21Z

Here are the docs for installing from source:

https://huggingface.co/docs/tokenizers/installation#installation-from-sources

I rebased again main so this PR has all the features developped since. Please report if this suits your use case, it would push the need to merge. I hesitate to merge useless features, so if it has some use cases it's nice to know.

HuggingFaceDocBuilderDev · 2022-11-23T19:35:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

nihirv · 2022-11-24T01:05:41Z

Thanks for the speedy response Narsil.

To clarify, this PR allows us to have merges in our tokenizer.json file which may have multiple whitespaces for each merge?

I'll play around with it tomorrow and let you know 👍

Narsil · 2022-11-24T08:37:37Z

Yes exactly, this allows any number of spaces or any other char in the merges.

nihirv · 2022-11-24T11:50:47Z

Wow that installation process was painless. I guess there is a first time for everything!

Anyway, I'm still facing the error that was there in the original issue: data did not match any variant of untagged enum ModelWrapper at line 24420 column 3. The merges in tokenizer.json are now a length-2 list of strings instead of a string - so I'm pretty sure your branch was used successfully.

The error is getting thrown at this line: fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file) which is located in transformers/tokenization_utils_fast.py

Narsil · 2022-11-24T13:07:15Z

Could you share your serialized file somehow ? I could take a look.

nihirv · 2022-11-24T15:20:44Z

https://pastebin.com/paT6zQhh

nihirv · 2022-11-28T13:51:41Z

Hey man, did you get a chance to take a look?

Narsil · 2022-11-28T14:38:16Z

Thanks for reminding. I just did, it seems the merges you are referrring in your file are not present in the vocabulary making the merges unsound.

How did you create that file ?

Narsil · 2022-11-28T14:40:36Z

Here is how I'm testing this in Rust to get a better error message:

      #[test]
      // Ensure `BPE::from_file` works as expected.
      fn test_bpe_from_file2() {
          // Make sure we can instantiate a BPE model from the files.
          let data = std::fs::read_to_string("test.json").unwrap();
          let bpe: BPE = serde_json::from_str(&data).unwrap();
      }

Just add this in the rust source code and run cargo test to see how it does (update test.json to point to the "model" part of the JSON.

Making better error messages all the way into python is trickier because of the libs we use and how this library is structured (though it would be really cool to have).

nihirv · 2022-11-29T11:37:00Z

I'm trying to get BPE to learn the ideal character sequence to split on instead of splitting on whitespace first.

Something like this:

    tokenizer = Tokenizer(BPE())  # type: ignore
    # data has been preprocessed to add Ġ to the beginning of every word in the corpus
    tokenizer.normalizer = normalizers.Sequence([Lowercase(), normalizers.Replace("ġ", 'Ġ')])
    
    # Customize training
    trainer = BpeTrainer(vocab_size=5000, min_frequency=2,
                    show_progress=True,
                    special_tokens=[
                                    "<pad>",
                                    "<s>",
                                    "</s>",
                                    "<unk>",
                                    "<mask>",
    ])
    tokenizer.train(files=file_list_for_language, trainer=trainer)
    tokenizer_path = f"tokenizers/{language}/"
    os.makedirs(tokenizer_path, exist_ok=True)
    new_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

    new_tokenizer.save_pretrained(tokenizer_path)

I'm also doing some postprocessing which modifies the vocabulary directly. This may be where the discrepancy with the merges is coming from:

    with open(tokenizer_path+"tokenizer.json", "r") as f:
        data = json.load(f)
        model_vocab: Dict[str, int] = data["model"]["vocab"]
        postprocessed_vocab = {}
        for tok, idx in model_vocab.items():
            if idx < 6:
                postprocessed_vocab[tok] = idx
                continue
            tok = tok.strip()
            if tok.endswith("Ġ.##"):
                tok = tok[:-4].strip()
            if tok.endswith("Ġ"):
                tok = tok[:-1].strip()

            postprocessed_vocab[tok] = idx

I'm not entirely sure how I can "clean" my vocabulary and make the relevant changes in the merges file?

Narsil · 2022-11-29T12:58:53Z

Yes it's your postprocessing that is most likely causing the issue.

ashutoshbsathe · 2024-04-08T15:44:51Z

Hi, can this please be merged as a different (de)serialization format? Several issues through years have encountered this issue and have used this PR to fix it. Making sure that this is a separate serialization format would be fully backwards compatible and explicitly alert people to use the different functions to (de)serialize the tokenizer.

ArthurZucker · 2024-06-11T12:40:48Z

Hey @ashutoshbsathe if there are more issue that are not linked / any particular use cases could you link them here? I'll see what I can do!

This fixes this while keeping backward support. We don't want to merge that blindly.

ArthurZucker

Would just add a non legacy test with spaces inside the merges

ArthurZucker · 2024-08-07T10:21:40Z

Also this would potentially mean support spaces in added tokens 🍭

Narsil · 2024-08-07T10:25:41Z

Done.

Narsil marked this pull request as draft February 16, 2022 21:30

Narsil mentioned this pull request Aug 1, 2022

Exception upon attempting to load a Tokenizer from file #566

Closed

Narsil force-pushed the support_bpe_with_whitespace_tokens branch from 74a4c9e to 0f617cc Compare November 23, 2022 19:20

Narsil mentioned this pull request Mar 6, 2023

is it possible to use Metaspace pretokenizer without split them? #1168

Closed

delgermurun mentioned this pull request Jul 18, 2023

Error when loading tokenizer from a file: data did not match any variant of untagged enum ModelWrapper #1297

Closed

david-waterworth mentioned this pull request Sep 20, 2023

tokenizers.processors is not optional #1342

Closed

github-actions bot added the Stale label Feb 29, 2024

github-actions bot closed this Mar 5, 2024

henrycharlesworth mentioned this pull request Mar 21, 2024

Issue merging across whitespaces #1475

Closed

mcognetta mentioned this pull request May 25, 2024

Deserializing BPE tokenizer failure #1541

Closed

ArthurZucker reopened this Jun 11, 2024

github-actions bot removed the Stale label Jun 12, 2024

github-actions bot added the Stale label Jul 18, 2024

github-actions bot closed this Jul 23, 2024

Narsil reopened this Aug 5, 2024

github-actions bot removed the Stale label Aug 6, 2024

Narsil added 2 commits August 6, 2024 13:39

Merges cannot handle tokens containing spaces.

378e4e4

This fixes this while keeping backward support. We don't want to merge that blindly.

Update the tests.

5026d82

Narsil force-pushed the support_bpe_with_whitespace_tokens branch from 11f84e7 to 5026d82 Compare August 6, 2024 11:45

Narsil marked this pull request as ready for review August 6, 2024 11:45

Fixing clippy.

d76c0cd

ArthurZucker approved these changes Aug 7, 2024

View reviewed changes

Add a test with spaces in the token/merge.

6fe1889

Narsil merged commit 6a5fce9 into main Aug 7, 2024
13 checks passed

Narsil deleted the support_bpe_with_whitespace_tokens branch August 7, 2024 10:34

Vaibhavs10 mentioned this pull request Sep 30, 2024

Bug: cannot find tokenizer merges in model file ggerganov/llama.cpp#9692

Closed

robertknight mentioned this pull request Oct 21, 2024

Parsing of merges field in tokenizer.json fails robertknight/rten#391

Closed

cg123 mentioned this pull request Dec 7, 2024

Mergekit produces broken Tokenizers arcee-ai/mergekit#469

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merges cannot handle tokens containing spaces. #909

Merges cannot handle tokens containing spaces. #909

Narsil commented Feb 16, 2022

nihirv commented Nov 22, 2022

Narsil commented Nov 23, 2022

HuggingFaceDocBuilderDev commented Nov 23, 2022

nihirv commented Nov 24, 2022 •

edited

Loading

Narsil commented Nov 24, 2022

nihirv commented Nov 24, 2022 •

edited

Loading

Narsil commented Nov 24, 2022

nihirv commented Nov 24, 2022

nihirv commented Nov 28, 2022

Narsil commented Nov 28, 2022

Narsil commented Nov 28, 2022

nihirv commented Nov 29, 2022

Narsil commented Nov 29, 2022

ashutoshbsathe commented Apr 8, 2024

ArthurZucker commented Jun 11, 2024

ArthurZucker left a comment

ArthurZucker commented Aug 7, 2024

Narsil commented Aug 7, 2024

Merges cannot handle tokens containing spaces. #909

Merges cannot handle tokens containing spaces. #909

Conversation

Narsil commented Feb 16, 2022

nihirv commented Nov 22, 2022

Narsil commented Nov 23, 2022

HuggingFaceDocBuilderDev commented Nov 23, 2022

nihirv commented Nov 24, 2022 • edited Loading

Narsil commented Nov 24, 2022

nihirv commented Nov 24, 2022 • edited Loading

Narsil commented Nov 24, 2022

nihirv commented Nov 24, 2022

nihirv commented Nov 28, 2022

Narsil commented Nov 28, 2022

Narsil commented Nov 28, 2022

nihirv commented Nov 29, 2022

Narsil commented Nov 29, 2022

ashutoshbsathe commented Apr 8, 2024

ArthurZucker commented Jun 11, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Aug 7, 2024

Narsil commented Aug 7, 2024

nihirv commented Nov 24, 2022 •

edited

Loading

nihirv commented Nov 24, 2022 •

edited

Loading