Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess Text: add Chinese segmentation module #536

Open
ajdapretnar opened this issue May 28, 2020 · 10 comments
Open

Preprocess Text: add Chinese segmentation module #536

ajdapretnar opened this issue May 28, 2020 · 10 comments

Comments

@ajdapretnar
Copy link
Collaborator

Chinese texts need a special kind of tokenization. Their texts cannot be simply split by whitespace or characters. It would be nice to add a separate module for segmenting Chinese texts.

Option 1: NLTK with Stanford segmenter.

Option 2: Jieba.

I would try with NLTK first to avoid introducing new dependencies, then fallback to Jieba if NTLK proves insufficient.

@hankcs
Copy link

hankcs commented Nov 26, 2020

Hi orange text team. May I recommend a third option? I'm maintaining an open source multilingual NLP package called HanLP backed by the start-of-the-art deep learning techniques and also efficient traditional ML models. HanLP has been widely used in academia and production environment (see our citations and projects using HanLP). Recently our user told me you are planning for Chinese support, so I'd like to suggest for a more advanced option. If you're interested you can try out our demo.

@ajdapretnar
Copy link
Collaborator Author

@hankcs Thanks for the suggestion. I've already heard about HanLP. I'd love to try your demo, but I simply cannot make sense of it 😆 (I don't speak any Chinese). Would you perhaps be interested in submitting a PR? Namely, adding HanLP to Preprocess Text (perhaps it can even be a separate preprocessor)? We would need a Chinese speaker to write tests at least.

@hankcs
Copy link

hankcs commented Nov 26, 2020

Sure, glad to help. Let's decide the version first since new package means new depdendencies. What kind of dependencies would you like to introduce?

  • HanLP 2.x is using TensorFlow which might be too heavy if you don't plan to do deep learning (DL). But 2.x delivers the best accuracy and the most functionalities.
  • HanLP 1.x is natively written using Java and pyhanlp is its Python wrapper using JPype. 1.x has been stabilized for 5 years but we are ambitious NLP people and we've decided to challenge the state-of-the-art DL techniques in 2.x.
  • HanLPerceptron is a native Python reimplementation of HanLP1.x tokenizer by @fann1993814 , which is neat and fast.

@ajdapretnar
Copy link
Collaborator Author

Sorry this got on hold for such a long time. :( Not sure how we managed to forget about this issue.

I vote for HanLPerceptron as using TF would add a large dependency for a single task.

@hankcs
Copy link

hankcs commented Aug 3, 2021

Great, HanLPerceptron is a good choice. Let's see what needs to be done.

@ajdapretnar
Copy link
Collaborator Author

Basically, this would be a new Preprocessor, let's call it HanTokenization (feel free to come up with a more sensible name). It is added to orangecontrib.text.preprocess and inherits from Preprocessor. I would not add it to Tokenizer, but make a separate special tokenizer. What do you think @PrimozGodec?
One downside is that HanLPerceptron doesn't seem to have wheels. We need to make sure it can be installed on all platforms (Win, OSX, Linux). If not, the user is responsible for installing it herself and when the dependency is present, the preprocessor can be used.
The preprocessor should simply set the corresponding functions and/or properties if necessary.

The most important part, tests should be added to orangecontrib.text.tests.test_preprocess.py to make sure the widget returns sensible results. Also, perhaps check the tokenizer in combination with different preprocessors, such as filtering, lowercasing and such, to make sure it is again sensible for Chinese.

I believe the task is quite trivial, but good tests need to be written to ensure the results make sense.

@hankcs
Copy link

hankcs commented Aug 21, 2021

Sounds good. I'll work together with the author of HanLPerceptron to have the wheels built and tested first.

@ajdapretnar
Copy link
Collaborator Author

Related to #781 issue.

@fishfree
Copy link

Hope this feature can be implemented ASAP. It's vital for Chinese text-processing!

@ajdapretnar
Copy link
Collaborator Author

We are happy to accept contributions from the community. If you are willing to add a PR, we will review it with priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants