-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"the" token is splitted to "t" "h" "e" in large scale corpus #437
Comments
It's possible, I made a temporary branch with u64 everywhere instead of u32: https://github.com/huggingface/tokenizers/tree/u64_branch It's a temporary fix, we probably need to think a bit more about how to go about it in the general case. You can install from source by going into |
Thanks for the quick response. Actually, I made |
If "th" and "he" are not tokens either you're probably going to have a subpar tokenizer |
@Narsil Thanks, I just check it, and seems both |
I didn't mean to close. |
seems
|
It seems the master branch also failed with the same error.
updated: |
master branch seems is much slower in file reading. 0.8.1
master
|
So there is definitely a slowdown with master that is supposed to be expected, that's linked to better (and correct) offsets tracking. However it should not be that bad, I'm going to try and replicate this. (It does not seem to be the case on smaller files ~500Mo, the slowdown is only ~20%) |
This is the sort of variance I'm getting: master:
v.0.8.1
Are you sure your master is up to date ? This kind of slowdown you're experiencing looks like debug build kind of problem. Code used: from tokenizers import BertWordPieceTokenizer
bpe = BertWordPieceTokenizer(clean_text=True, strip_accents=True, lowercase=False)
bpe.train(["out1.txt"])
print(bpe.encode("I love this [SEP]").ids)
bpe.save("tokenizer.json", pretty=True) |
I used this commit 36832bf, as the latter of it cannot run. |
Has there been any progress on integrating this into main? We are running into the same issue with the count of many two-character tokens overflowing and not ending up in the final vocab with a sufficiently large text corpus. |
@gpauloski Not yet, the tentative branch proposed here is super old. You're welcome to attempt a PR, but making everything templated for both So far the recommendation has been to use something like |
Thanks, @Narsil! I'll try |
Actually keeping the tokenizer training data under |
I train the bpe (BERTWordPiece) by my own, using the RoBERTa 160GB data. However, I found all "the" is broken.
I check the learned vocab.txt, and found "the" is not in there too.
The parameters used in training bpe.
I guess the count of
the
may overflow in the large-scale corpus.an example
The text was updated successfully, but these errors were encountered: