Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for other languages #306

Closed
ajdapretnar opened this issue Aug 9, 2017 · 5 comments
Closed

Add support for other languages #306

ajdapretnar opened this issue Aug 9, 2017 · 5 comments

Comments

@ajdapretnar
Copy link
Collaborator

Text version

3.5.dev-

Orange version

0.2.5

Expected behavior

Orange supports many key languages.

Actual behavior

No support for Latin, (old) Greek, Polish... Poor support for French, German, Spanish, Portuguese. Think about Chinese, Hindu and Arabic as well.

Steps to reproduce the behavior
Additional info (worksheets, data, screenshots, ...)
@gengyabc
Copy link

I wish this addon can support Chinese, and you may use https://github.com/fxsjy/jieba or https://github.com/hankcs/HanLP as the backend

@ajdapretnar
Copy link
Collaborator Author

It does support Chinese (UDPipe lemmatization and sentiment analysis). Word embedding is upcoming.

@ajdapretnar
Copy link
Collaborator Author

Ah, do you mean for tokenization? We have absolutely no clue how to work with Chinese, so someone else would have to set this up (especially for tests). HanLP doesn't have English documentation, but Jieba seems nice!

@gengyabc
Copy link

gengyabc commented Mar 27, 2020

English version of HanLP is here: https://github.com/hankcs/HanLP/tree/master.

I have forked this repo, translated some widgets to chinese and added some Chinese tokenization support. but I have no idea of what nlp is, so I don't know how to do next.

(I am a teacher who teaches machine learning in China, and my students know little English, so I translated it)

@ajdapretnar
Copy link
Collaborator Author

Ok, I've checked both.
HanLP is in my opinion a no-go, because it depends on Tensorflow, which is (from what I know) an big dependency. Also, Text already have NLTK, which covers most of HanLP features.

From what I gathered, the only real issue in Chinese text processing is word segmentation. So we would need a specialized Chinese segmenter. This could be added to the new Preprocess Text as a separate option.
We could go with Jieba, but if possible, I'd like to avoid another dependency. Looks like there's a segmenter available also in NLTK. Perhaps we could check that first, then fallback to Jieba if NLTK is insufficient.

I would close this issue, as it is too broad and open a separate specific one.

Continued in #536.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants