Supporting SentencePiece tokenization algorithms other than BPE #7732
Replies: 2 comments 12 replies
-
I guess we didn't fully understand the tokenization variants back when we designed this spec: For a long time, I thought that SPM and BPE are two different tokenization algorithms. We should be able to extend Would be great to support more tokenizer types, especially for various embedding and T5 models that we don't currently support. Contributions in that direction are welcome |
Beta Was this translation helpful? Give feedback.
-
Mentioning Iss #7763 here as it's connected. @ggerganov @slaren and @fairydreaming might find the matmul free paper I mentioned there interesting. The algorithm is included in the paper. Almost forgot, the quant algorithm is in the appendix. tagging @jart and @JohannesGaessler for matmul free to get some extra eyes on it. I have no idea how valid the math is. Wouldn't mind some feedback on it. I think this is all interconnected simply because having a vocab free and matmul free llm should be obviously advantageous. |
Beta Was this translation helpful? Give feedback.
-
I'd like to discuss what changes are needed in llama.cpp to support SentencePiece tokenization algorithms other than BPE.
Let me start with a short introduction to the problem:
I know that non-BPE tokenization algorithms are rarely used, but for example T5 model uses SentencePiece unigram tokenization algorithm and adding T5 support is on llama.cpp roadmap. There was also at least one bug report caused by this issue: #6717.
I think the following steps are needed to fix this problem:
Let me know what do you think about this.
Beta Was this translation helpful? Give feedback.
All reactions