llama : tokenizer unicode codepoint categories #8606

jaime-m-p · 2024-07-20T21:52:33Z

Add all unicode categories to unicode-data.cpp.

Currently we are limited to high categories:

C, L, M, N, P, S, Z.

This PR allows access to subcategories:

Cn, Cc, Cf, Co, Cs, Ll, Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, Pe, Pf, Pi, Po, Ps, Sc, Sk, Sm, So, Zl, Zp, Zs.

Related PR: #8579, regex using Lu, Lt, Lm, Lo, etc.

TODO: Add more comments to explain the unicode regex collapse trick for all subcategories.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

- Compare tokenizer vocab tokens. - Bruteforce byte token generator. - Find minimal mismatched substring.

compilade · 2024-07-21T01:27:54Z

Nice! This should also help fix (at least part of) Falcon's tokenization, because the Punctuation pre-tokenizer type uses the Po category and not the broader P one.

(ref: https://github.com/huggingface/tokenizers/blob/4ea2f235b0430f5db09f867b65306d6c0a5ec7ed/tokenizers/src/pre_tokenizers/punctuation.rs#L8, which uses Rust's is_ascii_punctuation and is_punctuation)

ggerganov

TODO: Implement unicode regex collapse trick for all subcategories.

Do you expect any problems with this? Probably we will run out of ASCII characters for the k_ucat_cpt map:

llama.cpp/src/unicode.cpp

Lines 647 to 651 in 50e0535

    
           static const std::map<int, int> k_ucat_cpt = { 
        
               { codepoint_flags::NUMBER,        0xD1 }, 
        
               { codepoint_flags::LETTER,        0xD2 }, 
        
               { codepoint_flags::PUNCTUATION,   0xD3 }, 
        
           };

Though we could dynamically generate the map based only on the used subcategories in the current regex

ggerganov · 2024-07-23T10:11:32Z

The src/llama.cpp conflict should be easy to resolve - just accept the new src/llama.cpp and apply the same changes to src/llama-vocab.cpp instead

- Add all unicode categories. - Fix \s with non-ASCII problem.

jaime-m-p · 2024-07-25T23:05:14Z

TODO: Implement unicode regex collapse trick for all subcategories.

Do you expect any problems with this?

More problems than I thought:

Need +29 collapse codepoints for subcategories.
Ranges of collapse codepoints, ie: \p{L} --> \p{Ll} to \p{Lu} (Ll, Lm, Lo, Lt, Lu).
Collapse codepoint for unicode whitespaces to fix the \s problem (std::regex ignores non-ASCII \s).
- Take care of \S and regex lookaheads, ie: (?!\S).

jaime-m-p · 2024-07-25T23:19:30Z

I tested (subset of the brute-force tests) all available BPE models, including tekken. Same results as before this PR.
Also tested the original tekken regex and seems correct too.

The reimplementation is not very understandable without context.
I want to add more comments and try to explain all steps/blocks of code.

ggerganov

Most of the assert in unicode.cpp should be changed to GGML_ABORT or GGML_ASSERT

src/unicode.cpp

- Faster failing text range selection. - Show unique failing texts differences. - Add more recent models.

- Reorganize category/subcategory bits. - Regex flags for \s \w \d.

- Using std::basic_regex. - Custom std::ctype specialization for 32bits codepoints. - Custom std::regex_traits specialization for 32bits codepoints. - Implementing custom 'character class expression' for \p{Xx}. - Single pass regex preparation.

jaime-m-p added 6 commits July 20, 2024 22:57

Update bruteforce test:

3d16f64

- Compare tokenizer vocab tokens. - Bruteforce byte token generator. - Find minimal mismatched substring.

Store all unicode codepoint categories

5ceab90

Reimplement 'codepoint_flags' as 'codepoint_categ'

ba4bbbd

Update unicode data

8f9f05b

Decode unicode data categories

2636cb6

Replace 'codepoint_flags' with 'codepoint_categ'

23cf064

github-actions bot added script Script related testing Everything test related python python script changes labels Jul 20, 2024

compilade mentioned this pull request Jul 21, 2024

Add support for Chameleon #8543

Merged

4 tasks

ggerganov mentioned this pull request Jul 22, 2024

llama : move vocab, grammar and sampling into separate files #8508

Merged

7 tasks

ggerganov approved these changes Jul 22, 2024

View reviewed changes

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jul 22, 2024

jaime-m-p added 4 commits July 26, 2024 00:16

Update unicode data: sorted whitespaces

ecebfc0

Fix codepoint_categ return types

8c8e1af

Add unicode_data helper functions

8f7d56e

Reimplement 'collapsed' unicode categories:

1cd7ac0

- Add all unicode categories. - Fix \s with non-ASCII problem.

jaime-m-p added 2 commits August 4, 2024 23:22

Add more comments

aeac342

Merge commit '978ba3d8' into tokenizer-codepoint-categs

8bd3749

ggerganov reviewed Aug 5, 2024

View reviewed changes

src/unicode.cpp Outdated Show resolved Hide resolved

src/unicode.cpp Outdated Show resolved Hide resolved

jaime-m-p added 6 commits August 5, 2024 20:52

minor: remove trailing whitespaces and extra semicolons

85c59df

Use GGML_ASSERT and GGML_ABORT

735105e

Update bruteforce test: fix pyright complaints

fd6d9b9

Update bruteforce test:

3b36703

- Faster failing text range selection. - Show unique failing texts differences. - Add more recent models.

Binary constants are a C++14 feature

d558c73

Fix copy/paste wrong variable

674f0fa

jaime-m-p added 12 commits August 5, 2024 23:55

Fix compiler complaints

2ca3138

Update bruteforce test: fix binary search

80f4123

Unicode data whitespaces as ranges

7afe6df

Reimplement unicode_regex_split()

c240638

Remove invalid assert

312c432

Update codepoint_categ:

b565148

- Reorganize category/subcategory bits. - Regex flags for \s \w \d.

Reimplement unicode_regex_split():

5a93d2e

- Using std::basic_regex. - Custom std::ctype specialization for 32bits codepoints. - Custom std::regex_traits specialization for 32bits codepoints. - Implementing custom 'character class expression' for \p{Xx}. - Single pass regex preparation.

Original regex for 'tekken'

7ff916e

Remove unused function

50e1b1e

Using 32bit wchar_t by default, uint32_t on Windows

dcac747

Fix previous commit

b67c81d

Fix compiler complaints

db78320

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : tokenizer unicode codepoint categories #8606

llama : tokenizer unicode codepoint categories #8606

jaime-m-p commented Jul 20, 2024 •

edited

Loading

compilade commented Jul 21, 2024 •

edited

Loading

ggerganov left a comment •

edited

Loading

ggerganov commented Jul 23, 2024

jaime-m-p commented Jul 25, 2024 •

edited

Loading

jaime-m-p commented Jul 25, 2024 •

edited

Loading

ggerganov left a comment

	static const std::map<int, int> k_ucat_cpt = {
	{ codepoint_flags::NUMBER, 0xD1 },
	{ codepoint_flags::LETTER, 0xD2 },
	{ codepoint_flags::PUNCTUATION, 0xD3 },
	};

llama : tokenizer unicode codepoint categories #8606

Are you sure you want to change the base?

llama : tokenizer unicode codepoint categories #8606

Conversation

jaime-m-p commented Jul 20, 2024 • edited Loading

compilade commented Jul 21, 2024 • edited Loading

ggerganov left a comment • edited Loading

Choose a reason for hiding this comment

ggerganov commented Jul 23, 2024

jaime-m-p commented Jul 25, 2024 • edited Loading

jaime-m-p commented Jul 25, 2024 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

jaime-m-p commented Jul 20, 2024 •

edited

Loading

compilade commented Jul 21, 2024 •

edited

Loading

ggerganov left a comment •

edited

Loading

jaime-m-p commented Jul 25, 2024 •

edited

Loading

jaime-m-p commented Jul 25, 2024 •

edited

Loading