Skip to content

Commit

Permalink
feat(tokenizers): Add a vendored copy of the unicode support from lla…
Browse files Browse the repository at this point in the history
…ma.cpp

This is a much more efficient way to get this functionality working than a
raw port. The original code caries MIT licensing, so the license is kept
with a reference at the top of each file.

This does introduce a bit of a redundancy in the regex support since the
llama.cpp code relies on the STL versus RE2. This seems ok since it does
not introduce an additional depencency, but a future optimization could be
to refactor the llama.cpp code to leverage the (faster) RE2 implementation.
The tradeoff would be a change in which regexes are supported.

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
  • Loading branch information
gabe-l-hart committed Nov 15, 2024
1 parent 7fbfb6b commit 0f1ba98
Show file tree
Hide file tree
Showing 5 changed files with 8,059 additions and 1 deletion.
5 changes: 4 additions & 1 deletion tokenizer/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ add_library(
sentencepiece.cpp
tiktoken.cpp
tokenizers.cpp
pre_tokenizer.cpp)
pre_tokenizer.cpp
# llama.cpp unicode
unicode-data.cpp
unicode.cpp)

target_include_directories(
tokenizer PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}
Expand Down
Loading

0 comments on commit 0f1ba98

Please sign in to comment.