Preprocess Text: add Chinese segmentation module #536

ajdapretnar · 2020-05-28T07:51:51Z

Chinese texts need a special kind of tokenization. Their texts cannot be simply split by whitespace or characters. It would be nice to add a separate module for segmenting Chinese texts.

Option 1: NLTK with Stanford segmenter.

Option 2: Jieba.

I would try with NLTK first to avoid introducing new dependencies, then fallback to Jieba if NTLK proves insufficient.

hankcs · 2020-11-26T02:05:20Z

Hi orange text team. May I recommend a third option? I'm maintaining an open source multilingual NLP package called HanLP backed by the start-of-the-art deep learning techniques and also efficient traditional ML models. HanLP has been widely used in academia and production environment (see our citations and projects using HanLP). Recently our user told me you are planning for Chinese support, so I'd like to suggest for a more advanced option. If you're interested you can try out our demo.

ajdapretnar · 2020-11-26T07:54:22Z

@hankcs Thanks for the suggestion. I've already heard about HanLP. I'd love to try your demo, but I simply cannot make sense of it 😆 (I don't speak any Chinese). Would you perhaps be interested in submitting a PR? Namely, adding HanLP to Preprocess Text (perhaps it can even be a separate preprocessor)? We would need a Chinese speaker to write tests at least.

hankcs · 2020-11-26T17:16:28Z

Sure, glad to help. Let's decide the version first since new package means new depdendencies. What kind of dependencies would you like to introduce?

HanLP 2.x is using TensorFlow which might be too heavy if you don't plan to do deep learning (DL). But 2.x delivers the best accuracy and the most functionalities.
HanLP 1.x is natively written using Java and pyhanlp is its Python wrapper using JPype. 1.x has been stabilized for 5 years but we are ambitious NLP people and we've decided to challenge the state-of-the-art DL techniques in 2.x.
HanLPerceptron is a native Python reimplementation of HanLP1.x tokenizer by @fann1993814 , which is neat and fast.

ajdapretnar · 2021-08-03T09:57:39Z

Sorry this got on hold for such a long time. :( Not sure how we managed to forget about this issue.

I vote for HanLPerceptron as using TF would add a large dependency for a single task.

hankcs · 2021-08-03T13:07:55Z

Great, HanLPerceptron is a good choice. Let's see what needs to be done.

ajdapretnar · 2021-08-03T13:38:11Z

Basically, this would be a new Preprocessor, let's call it HanTokenization (feel free to come up with a more sensible name). It is added to orangecontrib.text.preprocess and inherits from Preprocessor. I would not add it to Tokenizer, but make a separate special tokenizer. What do you think @PrimozGodec?
One downside is that HanLPerceptron doesn't seem to have wheels. We need to make sure it can be installed on all platforms (Win, OSX, Linux). If not, the user is responsible for installing it herself and when the dependency is present, the preprocessor can be used.
The preprocessor should simply set the corresponding functions and/or properties if necessary.

The most important part, tests should be added to orangecontrib.text.tests.test_preprocess.py to make sure the widget returns sensible results. Also, perhaps check the tokenizer in combination with different preprocessors, such as filtering, lowercasing and such, to make sure it is again sensible for Chinese.

I believe the task is quite trivial, but good tests need to be written to ensure the results make sense.

hankcs · 2021-08-21T03:55:38Z

Sounds good. I'll work together with the author of HanLPerceptron to have the wheels built and tested first.

ajdapretnar · 2022-04-01T08:21:31Z

Related to #781 issue.

fishfree · 2024-05-20T11:49:28Z

Hope this feature can be implemented ASAP. It's vital for Chinese text-processing!

ajdapretnar · 2024-05-20T12:32:00Z

We are happy to accept contributions from the community. If you are willing to add a PR, we will review it with priority.

ajdapretnar added the enhancement label May 28, 2020

ajdapretnar mentioned this issue May 28, 2020

Add support for other languages #306

Closed

ajdapretnar mentioned this issue Oct 9, 2020

improved Segmentation function requesting biolab/orange3#5025

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess Text: add Chinese segmentation module #536

Preprocess Text: add Chinese segmentation module #536

ajdapretnar commented May 28, 2020

hankcs commented Nov 26, 2020

ajdapretnar commented Nov 26, 2020

hankcs commented Nov 26, 2020

ajdapretnar commented Aug 3, 2021

hankcs commented Aug 3, 2021

ajdapretnar commented Aug 3, 2021

hankcs commented Aug 21, 2021

ajdapretnar commented Apr 1, 2022

fishfree commented May 20, 2024

ajdapretnar commented May 20, 2024

Preprocess Text: add Chinese segmentation module #536

Preprocess Text: add Chinese segmentation module #536

Comments

ajdapretnar commented May 28, 2020

hankcs commented Nov 26, 2020

ajdapretnar commented Nov 26, 2020

hankcs commented Nov 26, 2020

ajdapretnar commented Aug 3, 2021

hankcs commented Aug 3, 2021

ajdapretnar commented Aug 3, 2021

hankcs commented Aug 21, 2021

ajdapretnar commented Apr 1, 2022

fishfree commented May 20, 2024

ajdapretnar commented May 20, 2024