Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"the" token is splitted to "t" "h" "e" in large scale corpus #437

Closed
guolinke opened this issue Sep 25, 2020 · 15 comments · Fixed by #1433
Closed

"the" token is splitted to "t" "h" "e" in large scale corpus #437

guolinke opened this issue Sep 25, 2020 · 15 comments · Fixed by #1433

Comments

@guolinke
Copy link

I train the bpe (BERTWordPiece) by my own, using the RoBERTa 160GB data. However, I found all "the" is broken.
I check the learned vocab.txt, and found "the" is not in there too.

The parameters used in training bpe.

bpe.train(args.inputs, vocab_size=32768, min_frequency=20)

I guess the count of the may overflow in the large-scale corpus.

an example

john . who . like me . is a writer with an interest in t ##h ##e interp ##lay of words and images . assumed i ' d hidden a cryptic message in t ##h ##e random sequence of little pictures . " write a package . spider earth cutting ? "
he asked . i only responded with five more random emoji . " t ##h ##e cheese running bear got caught in a rains ##how ##er looking for diamonds . " he muse ##d .
and thus began a game that has continued to this day .
one of us texts five random emoji to t ##h ##e other . and t ##h ##e recipient has to write a story ( roughly of tweet length ) to explain each of t ##h ##e emoji in order .
below are eight of our collaborative creations . 
james hannah ##am ' s most recent novel . delicious foods . won t ##h ##e pen / faulk ##ner and hurst ##on / wright legacy awards . and was a new york times notable book .
he teaches in t ##h ##e writing program at t ##h ##e pratt institute
Narsil added a commit that referenced this issue Sep 25, 2020
@Narsil
Copy link
Collaborator

Narsil commented Sep 25, 2020

It's possible, I made a temporary branch with u64 everywhere instead of u32:

https://github.com/huggingface/tokenizers/tree/u64_branch

It's a temporary fix, we probably need to think a bit more about how to go about it in the general case.

You can install from source by going into bindings/python, and then pip install -e .

@guolinke
Copy link
Author

Thanks for the quick response. Actually, I made the as a special token as a workaround, since other high-frequency (like and) tokens seems ok.

@Narsil
Copy link
Collaborator

Narsil commented Sep 25, 2020

If "th" and "he" are not tokens either you're probably going to have a subpar tokenizer

@Narsil Narsil closed this as completed Sep 25, 2020
@guolinke
Copy link
Author

@Narsil Thanks, I just check it, and seems both th and he is there.
I think it is better to fix the overflow problem, in case we have the larger corpus.

@Narsil Narsil reopened this Sep 25, 2020
@Narsil
Copy link
Collaborator

Narsil commented Sep 25, 2020

I didn't mean to close.

@guolinke
Copy link
Author

seems u64 branch cannot fix the problem.
the error

thread '<unnamed>' panicked at 'Trainer should know how to build BPE: MergeTokenOutOfVocabulary("##h##e")', /home/xxx/tokenizers/tokenizers/src/models/bpe/trainer.rs:573:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "train_wp.py", line 55, in <module>
    main()
  File "train_wp.py", line 50, in main
    bpe.train(args.inputs, vocab_size=args.vocab_size, special_tokens=special_tokens,
  File "/home/xxx/tokenizers/bindings/python/py_src/tokenizers/implementations/bert_wordpiece.py", line 117, in train
    self._tokenizer.train(trainer, files)
pyo3_runtime.PanicException: Trainer should know how to build BPE: MergeTokenOutOfVocabulary("##h##e")

@guolinke
Copy link
Author

guolinke commented Sep 26, 2020

It seems the master branch also failed with the same error.
the same code in 0.8.1 could be run.

    bpe = BertWordPieceTokenizer(clean_text=True, strip_accents=True, lowercase=True)
    bpe.train(args.inputs)

updated:
it seems this commit (95cc8c4) causes the problem.

@guolinke
Copy link
Author

master branch seems is much slower in file reading.

0.8.1

[00:03:13] Reading files (19521 Mo)                 ███████████████████████████████████████████████████████████████████████████████████████████████████████████                 100
[00:00:17] Tokenize words                           ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316  /  6593316
[00:00:14] Count pairs                              ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316  /  6593316
[00:00:58] Compute merges                           ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 31116    /    31116

master

[00:17:46] Reading files (19521 Mo)                 ███████████████████████████████████████████████████████████████████████████████████████████████████████████                 100
[00:00:18] Tokenize words                           ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316  /  6593316
[00:00:15] Count pairs                              ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316  /  6593316
[00:01:13] Compute merges                           ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 31117    /    31117

@Narsil
Copy link
Collaborator

Narsil commented Sep 28, 2020

So there is definitely a slowdown with master that is supposed to be expected, that's linked to better (and correct) offsets tracking. However it should not be that bad, I'm going to try and replicate this. (It does not seem to be the case on smaller files ~500Mo, the slowdown is only ~20%)

@Narsil
Copy link
Collaborator

Narsil commented Sep 28, 2020

This is the sort of variance I'm getting:

master:

[00:01:37] Reading files (1048 Mo)                  █████████████████████████████████████████████████████████████████████                 100
[00:00:02] Tokenize words                           █████████████████████████████████████████████████████████████████████ 1882027  /  1882027
[00:00:09] Count pairs                              █████████████████████████████████████████████████████████████████████ 1882027  /  1882027
[00:00:10] Compute merges                           █████████████████████████████████████████████████████████████████████ 28375    /    28375

v.0.8.1

[00:01:20] Reading files (1048 Mo)                  █████████████████████████████████████████████████████████████████████                 100
[00:00:03] Tokenize words                           █████████████████████████████████████████████████████████████████████ 1882027  /  1882027
[00:00:10] Count pairs                              █████████████████████████████████████████████████████████████████████ 1882027  /  1882027
[00:00:11] Compute merges                           █████████████████████████████████████████████████████████████████████ 28374    /    28374

Are you sure your master is up to date ? This kind of slowdown you're experiencing looks like debug build kind of problem.

Code used:

from tokenizers import BertWordPieceTokenizer

bpe = BertWordPieceTokenizer(clean_text=True, strip_accents=True, lowercase=False)
bpe.train(["out1.txt"])

print(bpe.encode("I love this [SEP]").ids)
bpe.save("tokenizer.json", pretty=True)

@guolinke
Copy link
Author

I used this commit 36832bf, as the latter of it cannot run.
I also set RAYON_RS_NUM_CPUS=16 , maybe the multi-threading performance is down?

@gpauloski
Copy link

Has there been any progress on integrating this into main? We are running into the same issue with the count of many two-character tokens overflowing and not ending up in the final vocab with a sufficiently large text corpus.

@Narsil
Copy link
Collaborator

Narsil commented Apr 1, 2022

@gpauloski Not yet, the tentative branch proposed here is super old.

You're welcome to attempt a PR, but making everything templated for both u32 and u64 is not that trivial in Rust (and we need the logic to use u64 only when necessary.

So far the recommendation has been to use something like spm which can handle u64 for training, and converting it to tokenizers (this is what happened for bigscience)

@gpauloski
Copy link

Thanks, @Narsil! I'll try spm.

@Narsil
Copy link
Collaborator

Narsil commented Apr 4, 2022

Actually keeping the tokenizer training data under u32 is the final route that was taken for bigscience (but the spm u64 was also used when trying to figure out the best tokenizer)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants