Supporting SentencePiece tokenization algorithms other than BPE #7732

fairydreaming · 2024-06-04T09:26:42Z

fairydreaming
Jun 4, 2024
Collaborator

I'd like to discuss what changes are needed in llama.cpp to support SentencePiece tokenization algorithms other than BPE.
Let me start with a short introduction to the problem:

There are four tokenization algorithms available in SentencePiece library: BPE, unigram, char, word.
In llama.cpp there is a llm_tokenizer_spm tokenizer that is used for LLAMA_VOCAB_TYPE_SPM. While its name sounds like a kind of "generic" sentencepiece tokenizer, from my understanding it implements only the BPE tokenization algorithm.
Information about the tokenization algorithm used to create SentencePiece model is currently not extracted by convert-hf, so it's not stored in a model header and llama.cpp has no way to know about the used SentencePiece tokenization algorithm.
This will cause the llama.cpp to silently use the incorrect tokenization algorithm if a non-BPE tokenization algorithm (unigram, char, word) was used to create the SentencePiece tokenizer model.

I know that non-BPE tokenization algorithms are rarely used, but for example T5 model uses SentencePiece unigram tokenization algorithm and adding T5 support is on llama.cpp roadmap. There was also at least one bug report caused by this issue: #6717.

I think the following steps are needed to fix this problem:

Add reading information about used tokenization algorithm in convert-hf. This information seems to be stored in TrainerSpec model_type field in SentencePiece tokenizer model file.
Add information about detected tokenization algorithm to the model header. There is a model header field tokenizer.ggml.model that is currently used to store the information about the tokenizer type. I noticed that the set of currently used values of this field is somewhat unfortunate ("llama", "bert", "gpt2"). Is there any particular reason why we have model names here instead of actual tokenization algorithm names like "bpe", "wordpiece" etc?
Add new values in llama_vocab_type enum for four SentencePiece tokenization algorithms (BPE, unigram, char, word) instead of current single LLAMA_VOCAB_TYPE_SPM.
Add support for SentencePiece unigram, char and word tokenization algorithms in llama.cpp. Initially it could simply fail with information that given tokenization algorithm is not yet supported.

Let me know what do you think about this.

ggerganov · 2024-06-04T09:55:06Z

ggerganov
Jun 4, 2024
Maintainer

I noticed that the set of currently used values of this field is somewhat unfortunate ("llama", "bert", "gpt2"). Is there any particular reason why we have model names here instead of actual tokenization algorithm names like "bpe", "wordpiece" etc?

I guess we didn't fully understand the tokenization variants back when we designed this spec:

ggerganov/ggml#302 (comment)

For a long time, I thought that SPM and BPE are two different tokenization algorithms.

We should be able to extend tokenizer.ggml.model and pass proper tokenizer type strings (such as "bpe", "unigram", "char", etc.) - we just need to handle the existing strings as well for backwards compatibility.

Would be great to support more tokenizer types, especially for various embedding and T5 models that we don't currently support. Contributions in that direction are welcome

12 replies

teleprint-me Jun 5, 2024

Honestly, I haven't decided yet as I'm still working my way there. As time passes and more users become reliant upon llama.cpp, breakage becomes more damaging.

The current implementation in my PR and experimental repo is simply to play around with different potential implementations before committing to anything. I'm open to alternative approaches.

I'm not sure why is it needed. Is it any different from general.architecture field?

No, it's not any different than the general architecture field as it's simply to know which tokenizer is associated with which model architecture. The idea here was to enable future compatibility for training tokenizers in isolation. This would allow users to create custom tokenizers with llama.cpp and then later train a language model in llama.cpp with that tokenizer.

Are going to use a combination of model and type values to determine what llama.cpp tokenizer class shall be used?

I had considered using a combination of model and type values, yes. I thought it might make it easier to identify the tokenizer class to use, but that's yet to be proven. I won't feel confident in this approach until I have it more fleshed out. I have less confidence in theory than I do in application.

fairydreaming Jun 5, 2024
Collaborator Author

Are going to use a combination of model and type values to determine what llama.cpp tokenizer class shall be used?

I had considered using a combination of model and type values, yes. I thought it might make it easier to identify the tokenizer class to use, but that's yet to be proven. I won't feel confident in this approach until I have it more fleshed out. I have less confidence in theory than I do in application.

@teleprint-me I see some problems with this approach, for example currently:

"llama" tokenizer.ggml.model value used in LLaMA and LLaMA-2 maps to LLAMA_VOCAB_TYPE_SPM which maps to llm_tokenizer_spm (SentencePiece BPE tokenizer)
"gpt2" tokenizer.ggml.model value used in LLaMA-3 maps to LLAMA_VOCAB_TYPE_BPE which maps to llm_tokenizer_bpe (HuggingFace tokenizers BPE tokenizer)

In your approach tokenizer.model value in both these cases will be "llama" and tokenizer.type in both cases will be "bpe". So you will have to use other means to distinguish between the two.

teleprint-me Jun 5, 2024

In your approach tokenizer.model value in both these cases will be "llama" and tokenizer.type in both cases will be "bpe". So you will have to use other means to distinguish between the two.

Yes, I'm aware of these issues. That's what I'm currently working on figuring out. I see an opportunity for improvement, clarity, and a potentially robust implementation.

@jaime-m-p has already laid out some potential ground work towards this end.

jaime-m-p Jun 5, 2024
Collaborator

It can get too complicated if we start supporting different implementations. Probably stick to the HF tokenizer as it is the most widely used one?

I'm not an expert in tokenizers neither and I don't know the differences between implementation variants, but having the base algorithms and variables/flags probably we only need to add some ifs to handle the differences.

I agree to start miming HF tokenizers and then add the variants once we have a solid tested base.

teleprint-me Jun 6, 2024

My current thoughts are:

The unicode and regex are really the most difficult components; these are in place already.
The front-end API seems to have generic interfaces at the moment; This is in progress.
I think the BPE implementation is partial and it's the easiest to implement. Paper here.

Some stemming thoughts are:

This tackles two tasks at once; BPE and HF.
Word Piece does differ and this will take some time.
We can use HF as a control to see if we're on the right track.

The implementation should be straightforward once the metadata is set. A possible approach for the metadata is to partition it in a two-fold approach.

Consider what information is needed for the current SPM implementation.
Consider what information is needed for the in progress HF implementation.

Then partition the Tokenizer metadata accordingly with a consistent interface. We can follow the HF setup and use a mapping in that regard. The tools are there already, e.g. common/json.hpp.

teleprint-me · 2024-06-11T05:41:08Z

teleprint-me
Jun 11, 2024

Mentioning Iss #7763 here as it's connected. @ggerganov @slaren and @fairydreaming might find the matmul free paper I mentioned there interesting. The algorithm is included in the paper. Almost forgot, the quant algorithm is in the appendix.

tagging @jart and @JohannesGaessler for matmul free to get some extra eyes on it. I have no idea how valid the math is. Wouldn't mind some feedback on it.

I think this is all interconnected simply because having a vocab free and matmul free llm should be obviously advantageous.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting SentencePiece tokenization algorithms other than BPE #7732

{{title}}

Replies: 2 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Supporting SentencePiece tokenization algorithms other than BPE #7732

fairydreaming Jun 4, 2024 Collaborator

Replies: 2 comments · 12 replies

ggerganov Jun 4, 2024 Maintainer

teleprint-me Jun 5, 2024

fairydreaming Jun 5, 2024 Collaborator Author

teleprint-me Jun 5, 2024

jaime-m-p Jun 5, 2024 Collaborator

teleprint-me Jun 6, 2024

teleprint-me Jun 11, 2024

fairydreaming
Jun 4, 2024
Collaborator

Replies: 2 comments 12 replies

ggerganov
Jun 4, 2024
Maintainer

fairydreaming Jun 5, 2024
Collaborator Author

jaime-m-p Jun 5, 2024
Collaborator

teleprint-me
Jun 11, 2024