feat(tokenizers): Add a vendored copy of the unicode support from lla…

…ma.cpp This is a much more efficient way to get this functionality working than a raw port. The original code caries MIT licensing, so the license is kept with a reference at the top of each file. This does introduce a bit of a redundancy in the regex support since the llama.cpp code relies on the STL versus RE2. This seems ok since it does not introduce an additional depencency, but a future optimization could be to refactor the llama.cpp code to leverage the (faster) RE2 implementation. The tradeoff would be a change in which regexes are supported. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
gabe-l-hart · Nov 15, 2024 · 0f1ba98 · 0f1ba98
1 parent 7fbfb6b
commit 0f1ba98
Show file tree

Hide file tree

Showing 5 changed files with 8,059 additions and 1 deletion.
diff --git a/tokenizer/CMakeLists.txt b/tokenizer/CMakeLists.txt
@@ -13,7 +13,10 @@ add_library(
     sentencepiece.cpp
     tiktoken.cpp
     tokenizers.cpp
-    pre_tokenizer.cpp)
+    pre_tokenizer.cpp
+    # llama.cpp unicode
+    unicode-data.cpp
+    unicode.cpp)
 
 target_include_directories(
     tokenizer PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}