Add support for other languages #306

ajdapretnar · 2017-08-09T12:45:27Z

Text version

3.5.dev-

Orange version

0.2.5

Expected behavior

Orange supports many key languages.

Actual behavior

No support for Latin, (old) Greek, Polish... Poor support for French, German, Spanish, Portuguese. Think about Chinese, Hindu and Arabic as well.

Steps to reproduce the behavior

Additional info (worksheets, data, screenshots, ...)

gengyabc · 2020-03-27T03:37:39Z

I wish this addon can support Chinese, and you may use https://github.com/fxsjy/jieba or https://github.com/hankcs/HanLP as the backend

ajdapretnar · 2020-03-27T08:11:05Z

It does support Chinese (UDPipe lemmatization and sentiment analysis). Word embedding is upcoming.

ajdapretnar · 2020-03-27T08:13:45Z

Ah, do you mean for tokenization? We have absolutely no clue how to work with Chinese, so someone else would have to set this up (especially for tests). HanLP doesn't have English documentation, but Jieba seems nice!

gengyabc · 2020-03-27T12:44:35Z

English version of HanLP is here: https://github.com/hankcs/HanLP/tree/master.

I have forked this repo, translated some widgets to chinese and added some Chinese tokenization support. but I have no idea of what nlp is, so I don't know how to do next.

(I am a teacher who teaches machine learning in China, and my students know little English, so I translated it)

ajdapretnar · 2020-05-28T07:52:07Z

Ok, I've checked both.
HanLP is in my opinion a no-go, because it depends on Tensorflow, which is (from what I know) an big dependency. Also, Text already have NLTK, which covers most of HanLP features.

From what I gathered, the only real issue in Chinese text processing is word segmentation. So we would need a specialized Chinese segmenter. This could be added to the new Preprocess Text as a separate option.
We could go with Jieba, but if possible, I'd like to avoid another dependency. Looks like there's a segmenter available also in NLTK. Perhaps we could check that first, then fallback to Jieba if NLTK is insufficient.

I would close this issue, as it is too broad and open a separate specific one.

Continued in #536.

ajdapretnar mentioned this issue Oct 8, 2019

[WIP] OWSentimentAnalysis: Add new languages #462

Merged

5 tasks

ajdapretnar closed this as completed May 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for other languages #306

Add support for other languages #306

ajdapretnar commented Aug 9, 2017

gengyabc commented Mar 27, 2020

ajdapretnar commented Mar 27, 2020

ajdapretnar commented Mar 27, 2020

gengyabc commented Mar 27, 2020 •

edited

Loading

ajdapretnar commented May 28, 2020

Add support for other languages #306

Add support for other languages #306

Comments

ajdapretnar commented Aug 9, 2017

Text version

Orange version

Expected behavior

Actual behavior

Steps to reproduce the behavior

Additional info (worksheets, data, screenshots, ...)

gengyabc commented Mar 27, 2020

ajdapretnar commented Mar 27, 2020

ajdapretnar commented Mar 27, 2020

gengyabc commented Mar 27, 2020 • edited Loading

ajdapretnar commented May 28, 2020

gengyabc commented Mar 27, 2020 •

edited

Loading