Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(tokenizers): Add a vendored copy of the unicode support from lla…
…ma.cpp This is a much more efficient way to get this functionality working than a raw port. The original code caries MIT licensing, so the license is kept with a reference at the top of each file. This does introduce a bit of a redundancy in the regex support since the llama.cpp code relies on the STL versus RE2. This seems ok since it does not introduce an additional depencency, but a future optimization could be to refactor the llama.cpp code to leverage the (faster) RE2 implementation. The tradeoff would be a change in which regexes are supported. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
- Loading branch information