-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : tokenizer unicode codepoint categories #8606
base: master
Are you sure you want to change the base?
Conversation
- Compare tokenizer vocab tokens. - Bruteforce byte token generator. - Find minimal mismatched substring.
Nice! This should also help fix (at least part of) Falcon's tokenization, because the (ref: https://github.com/huggingface/tokenizers/blob/4ea2f235b0430f5db09f867b65306d6c0a5ec7ed/tokenizers/src/pre_tokenizers/punctuation.rs#L8, which uses Rust's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Implement unicode regex collapse trick for all subcategories.
Do you expect any problems with this? Probably we will run out of ASCII characters for the k_ucat_cpt
map:
Lines 647 to 651 in 50e0535
static const std::map<int, int> k_ucat_cpt = { | |
{ codepoint_flags::NUMBER, 0xD1 }, | |
{ codepoint_flags::LETTER, 0xD2 }, | |
{ codepoint_flags::PUNCTUATION, 0xD3 }, | |
}; |
Though we could dynamically generate the map based only on the used subcategories in the current regex
The |
- Add all unicode categories. - Fix \s with non-ASCII problem.
More problems than I thought:
|
I tested (subset of the brute-force tests) all available BPE models, including The reimplementation is not very understandable without context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the assert
in unicode.cpp
should be changed to GGML_ABORT
or GGML_ASSERT
- Faster failing text range selection. - Show unique failing texts differences. - Add more recent models.
- Reorganize category/subcategory bits. - Regex flags for \s \w \d.
- Using std::basic_regex. - Custom std::ctype specialization for 32bits codepoints. - Custom std::regex_traits specialization for 32bits codepoints. - Implementing custom 'character class expression' for \p{Xx}. - Single pass regex preparation.
Add all unicode categories to
unicode-data.cpp
.Currently we are limited to high categories:
This PR allows access to subcategories:
Related PR: #8579, regex using Lu, Lt, Lm, Lo, etc.
TODO: Add more comments to explain the unicode regex collapse trick for all subcategories.