"the" token is splitted to "t" "h" "e" in large scale corpus #437

guolinke · 2020-09-25T01:31:28Z

I train the bpe (BERTWordPiece) by my own, using the RoBERTa 160GB data. However, I found all "the" is broken.
I check the learned vocab.txt, and found "the" is not in there too.

The parameters used in training bpe.

bpe.train(args.inputs, vocab_size=32768, min_frequency=20)

I guess the count of the may overflow in the large-scale corpus.

an example

john . who . like me . is a writer with an interest in t ##h ##e interp ##lay of words and images . assumed i ' d hidden a cryptic message in t ##h ##e random sequence of little pictures . " write a package . spider earth cutting ? "
he asked . i only responded with five more random emoji . " t ##h ##e cheese running bear got caught in a rains ##how ##er looking for diamonds . " he muse ##d .
and thus began a game that has continued to this day .
one of us texts five random emoji to t ##h ##e other . and t ##h ##e recipient has to write a story ( roughly of tweet length ) to explain each of t ##h ##e emoji in order .
below are eight of our collaborative creations . 
james hannah ##am ' s most recent novel . delicious foods . won t ##h ##e pen / faulk ##ner and hurst ##on / wright legacy awards . and was a new york times notable book .
he teaches in t ##h ##e writing program at t ##h ##e pratt institute

The text was updated successfully, but these errors were encountered:

Narsil · 2020-09-25T09:22:24Z

It's possible, I made a temporary branch with u64 everywhere instead of u32:

https://github.com/huggingface/tokenizers/tree/u64_branch

It's a temporary fix, we probably need to think a bit more about how to go about it in the general case.

You can install from source by going into bindings/python, and then pip install -e .

guolinke · 2020-09-25T09:30:51Z

Thanks for the quick response. Actually, I made the as a special token as a workaround, since other high-frequency (like and) tokens seems ok.

Narsil · 2020-09-25T09:39:06Z

If "th" and "he" are not tokens either you're probably going to have a subpar tokenizer

guolinke · 2020-09-25T09:57:02Z

@Narsil Thanks, I just check it, and seems both th and he is there.
I think it is better to fix the overflow problem, in case we have the larger corpus.

Narsil · 2020-09-25T09:58:43Z

I didn't mean to close.

guolinke · 2020-09-26T03:23:04Z

seems u64 branch cannot fix the problem.
the error

thread '<unnamed>' panicked at 'Trainer should know how to build BPE: MergeTokenOutOfVocabulary("##h##e")', /home/xxx/tokenizers/tokenizers/src/models/bpe/trainer.rs:573:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "train_wp.py", line 55, in <module>
    main()
  File "train_wp.py", line 50, in main
    bpe.train(args.inputs, vocab_size=args.vocab_size, special_tokens=special_tokens,
  File "/home/xxx/tokenizers/bindings/python/py_src/tokenizers/implementations/bert_wordpiece.py", line 117, in train
    self._tokenizer.train(trainer, files)
pyo3_runtime.PanicException: Trainer should know how to build BPE: MergeTokenOutOfVocabulary("##h##e")

guolinke · 2020-09-26T09:17:42Z

It seems the master branch also failed with the same error.
the same code in 0.8.1 could be run.

    bpe = BertWordPieceTokenizer(clean_text=True, strip_accents=True, lowercase=True)
    bpe.train(args.inputs)

updated:
it seems this commit (95cc8c4) causes the problem.

guolinke · 2020-09-26T10:11:08Z

master branch seems is much slower in file reading.

0.8.1

[00:03:13] Reading files (19521 Mo)                 ███████████████████████████████████████████████████████████████████████████████████████████████████████████                 100
[00:00:17] Tokenize words                           ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316  /  6593316
[00:00:14] Count pairs                              ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316  /  6593316
[00:00:58] Compute merges                           ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 31116    /    31116

master

[00:17:46] Reading files (19521 Mo)                 ███████████████████████████████████████████████████████████████████████████████████████████████████████████                 100
[00:00:18] Tokenize words                           ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316  /  6593316
[00:00:15] Count pairs                              ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316  /  6593316
[00:01:13] Compute merges                           ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 31117    /    31117

Narsil · 2020-09-28T10:29:35Z

So there is definitely a slowdown with master that is supposed to be expected, that's linked to better (and correct) offsets tracking. However it should not be that bad, I'm going to try and replicate this. (It does not seem to be the case on smaller files ~500Mo, the slowdown is only ~20%)

Narsil · 2020-09-28T11:56:00Z

This is the sort of variance I'm getting:

master:

[00:01:37] Reading files (1048 Mo)                  █████████████████████████████████████████████████████████████████████                 100
[00:00:02] Tokenize words                           █████████████████████████████████████████████████████████████████████ 1882027  /  1882027
[00:00:09] Count pairs                              █████████████████████████████████████████████████████████████████████ 1882027  /  1882027
[00:00:10] Compute merges                           █████████████████████████████████████████████████████████████████████ 28375    /    28375

v.0.8.1

[00:01:20] Reading files (1048 Mo)                  █████████████████████████████████████████████████████████████████████                 100
[00:00:03] Tokenize words                           █████████████████████████████████████████████████████████████████████ 1882027  /  1882027
[00:00:10] Count pairs                              █████████████████████████████████████████████████████████████████████ 1882027  /  1882027
[00:00:11] Compute merges                           █████████████████████████████████████████████████████████████████████ 28374    /    28374

Are you sure your master is up to date ? This kind of slowdown you're experiencing looks like debug build kind of problem.

Code used:

from tokenizers import BertWordPieceTokenizer

bpe = BertWordPieceTokenizer(clean_text=True, strip_accents=True, lowercase=False)
bpe.train(["out1.txt"])

print(bpe.encode("I love this [SEP]").ids)
bpe.save("tokenizer.json", pretty=True)

guolinke · 2020-09-28T13:14:23Z

I used this commit 36832bf, as the latter of it cannot run.
I also set RAYON_RS_NUM_CPUS=16 , maybe the multi-threading performance is down?

gpauloski · 2022-03-31T17:52:44Z

Has there been any progress on integrating this into main? We are running into the same issue with the count of many two-character tokens overflowing and not ending up in the final vocab with a sufficiently large text corpus.

Narsil · 2022-04-01T07:32:26Z

@gpauloski Not yet, the tentative branch proposed here is super old.

You're welcome to attempt a PR, but making everything templated for both u32 and u64 is not that trivial in Rust (and we need the logic to use u64 only when necessary.

So far the recommendation has been to use something like spm which can handle u64 for training, and converting it to tokenizers (this is what happened for bigscience)

gpauloski · 2022-04-01T15:59:40Z

Thanks, @Narsil! I'll try spm.

Narsil · 2022-04-04T07:52:47Z

Actually keeping the tokenizer training data under u32 is the final route that was taken for bigscience (but the spm u64 was also used when trying to figure out the best tokenizer)

Narsil added a commit that referenced this issue Sep 25, 2020

Do a temp u64 for #437

1db25c9

Narsil closed this as completed Sep 25, 2020

Narsil reopened this Sep 25, 2020

Narsil mentioned this issue Sep 28, 2020

There's a bug with building BPE #439

Closed

stephenroller mentioned this issue Jan 17, 2024

Convert word counts to u64 #1433

Merged

ArthurZucker closed this as completed in #1433 Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"the" token is splitted to "t" "h" "e" in large scale corpus #437

"the" token is splitted to "t" "h" "e" in large scale corpus #437

guolinke commented Sep 25, 2020

Narsil commented Sep 25, 2020

guolinke commented Sep 25, 2020

Narsil commented Sep 25, 2020

guolinke commented Sep 25, 2020

Narsil commented Sep 25, 2020

guolinke commented Sep 26, 2020

guolinke commented Sep 26, 2020 •

edited

Loading

guolinke commented Sep 26, 2020

Narsil commented Sep 28, 2020

Narsil commented Sep 28, 2020 •

edited

Loading

guolinke commented Sep 28, 2020

gpauloski commented Mar 31, 2022

Narsil commented Apr 1, 2022

gpauloski commented Apr 1, 2022

Narsil commented Apr 4, 2022

"the" token is splitted to "t" "h" "e" in large scale corpus #437

"the" token is splitted to "t" "h" "e" in large scale corpus #437

Comments

guolinke commented Sep 25, 2020

Narsil commented Sep 25, 2020

guolinke commented Sep 25, 2020

Narsil commented Sep 25, 2020

guolinke commented Sep 25, 2020

Narsil commented Sep 25, 2020

guolinke commented Sep 26, 2020

guolinke commented Sep 26, 2020 • edited Loading

guolinke commented Sep 26, 2020

Narsil commented Sep 28, 2020

Narsil commented Sep 28, 2020 • edited Loading

guolinke commented Sep 28, 2020

gpauloski commented Mar 31, 2022

Narsil commented Apr 1, 2022

gpauloski commented Apr 1, 2022

Narsil commented Apr 4, 2022

guolinke commented Sep 26, 2020 •

edited

Loading

Narsil commented Sep 28, 2020 •

edited

Loading