-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocess Text: add Chinese segmentation module #536
Comments
Hi orange text team. May I recommend a third option? I'm maintaining an open source multilingual NLP package called HanLP backed by the start-of-the-art deep learning techniques and also efficient traditional ML models. HanLP has been widely used in academia and production environment (see our citations and projects using HanLP). Recently our user told me you are planning for Chinese support, so I'd like to suggest for a more advanced option. If you're interested you can try out our demo. |
@hankcs Thanks for the suggestion. I've already heard about HanLP. I'd love to try your demo, but I simply cannot make sense of it 😆 (I don't speak any Chinese). Would you perhaps be interested in submitting a PR? Namely, adding HanLP to Preprocess Text (perhaps it can even be a separate preprocessor)? We would need a Chinese speaker to write tests at least. |
Sure, glad to help. Let's decide the version first since new package means new depdendencies. What kind of dependencies would you like to introduce?
|
Sorry this got on hold for such a long time. :( Not sure how we managed to forget about this issue. I vote for HanLPerceptron as using TF would add a large dependency for a single task. |
Great, HanLPerceptron is a good choice. Let's see what needs to be done. |
Basically, this would be a new Preprocessor, let's call it HanTokenization (feel free to come up with a more sensible name). It is added to The most important part, tests should be added to I believe the task is quite trivial, but good tests need to be written to ensure the results make sense. |
Sounds good. I'll work together with the author of |
Related to #781 issue. |
Hope this feature can be implemented ASAP. It's vital for Chinese text-processing! |
We are happy to accept contributions from the community. If you are willing to add a PR, we will review it with priority. |
Chinese texts need a special kind of tokenization. Their texts cannot be simply split by whitespace or characters. It would be nice to add a separate module for segmenting Chinese texts.
Option 1: NLTK with Stanford segmenter.
Option 2: Jieba.
I would try with NLTK first to avoid introducing new dependencies, then fallback to Jieba if NTLK proves insufficient.
The text was updated successfully, but these errors were encountered: