Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add jieba tokenizer for Chinese #209

Merged
merged 1 commit into from
Nov 7, 2021

Conversation

vincascm
Copy link
Contributor

@vincascm vincascm commented May 8, 2020

add jieba tokenizer for Chinese

@valeriansaliou
Copy link
Owner

Hello there! Your code is being Chinese-specific, which is not fine by me in terms of code hygiene. Can you also explain the purpose of this PR? Not familiar with the Jieba tokenizer.

@vincascm
Copy link
Contributor Author

vincascm commented May 9, 2020

@valeriansaliou

  1. Chinese do not have a trivial word segmentation process . https://en.wikipedia.org/wiki/Text_segmentation

  2. jieba is a widely used Chinese tokenizer. https://github.com/fxsjy/jieba#jieba-1

  3. may add special feature for Chinese.

@sftblw
Copy link

sftblw commented May 15, 2020

It is common to use tokenizer in CJK, even in elasticsearch (google search of es + mecab tokenizer). As an example, NLP library spaCy requires external tokenizer installed in system for...

chinese japanese korean
jieba mecab via fugashi mecab-ko via natto-py

@dzcpy
Copy link

dzcpy commented Jun 3, 2021

Have you guys found a solution for Chinese tokenizing?
Maybe we can borrow some ideas from https://github.com/tantivy-search/tantivy, like make the tokenizer configurable in order to support CJK languages?

@hajiuxbz
Copy link

Hope to merge

@rcy17
Copy link

rcy17 commented Nov 6, 2021

One and a half years later...... Sincerely hope to merge.

@valeriansaliou valeriansaliou merged commit 8793dae into valeriansaliou:master Nov 7, 2021
@valeriansaliou
Copy link
Owner

Thanks for the PR, sorry it took so much time. Currently updating the jieba library to latest and refactoring code to make it an optional feature, as not all Sonic users need Chinese tokenization support (the library adds some size overhead).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants