Does Stork support CJK languages? #294
-
Does Stork support CJK languages? (Chinese, Japanese, and Korean) I am interested in using Stork for Zola. I proposed it here: getzola/zola#1849 It was mentioned that there may not be specific stemmers/stopword lists for languages other than English? EDIT: (I researched this through some of the open issues, please correct me if I am wrong on any of this, thank you.) Stopword lists: not implemented yet: #250 Stemmers: multilingual is already supported by snowball stem: #48 but it seems that CJK languages are not on the list for stemmers: https://snowballstem.org/algorithms/ Next I see that maybe stemmers are not applicable to CJK?:
even if stemming is not applicable to CJK, it seems it can still be analyzed and improved with tokenization? https://www.microfocus.com/documentation/starteam/163/en/Help/SvrAdmin/GUID-DAC55170-60DC-490B-BC4F-42F4F45F6029.html |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
I'm going to migrate this to a Discussion and continue there - hope that's okay. |
Beta Was this translation helpful? Give feedback.
-
I'd love for Stork to support CJK languages. Unfortunately, I think there are a few areas where it falls short today. I've discussed this previously with another user, @YikSanChan, in issue #191. My understanding is that there are two main hurdles. The first, as you mentioned, is the list of stopwords in different languages, which would be easy enough to procure. Second, though, and likely more complicated, is the assumption that Stork makes today that a bit of text is made up of words separated by spaces. That's not the case in Chinese (and perhaps Japanese and Korean as well?) so searching Chinese text is blocked until that assumption is unwound. @YikSanChan mentioned that there are algorithms written to parse text and extract words, which would be useful. I haven't done enough recent research to determine if there's a Rust crate available that exposes that algorithm. Unfortunately, I only speak English, so I'm not going to be the best person to add CJK support, mostly because I won't be able to verify if it's working. I'm more than happy to accept contributions or more closely collaborate with someone to successfully implement this, and I can do some research (maybe looking back at the Meillisearch PRs that @YikSanChan linked me to) to see if I'm missing something that would make it easier to implement. On another note, thank you for suggesting Stork in the Zola project! I'm honored to be considered and happy to help get it set up in that project, if you need any support from me. -James |
Beta Was this translation helpful? Give feedback.
-
I implemented stork as an option in my zola theme abridge, you can see the demo here: https://jieiku.github.io/abridge-stork/ (search for zola) |
Beta Was this translation helpful? Give feedback.
I'd love for Stork to support CJK languages. Unfortunately, I think there are a few areas where it falls short today.
I've discussed this previously with another user, @YikSanChan, in issue #191. My understanding is that there are two main hurdles. The first, as you mentioned, is the list of stopwords in different languages, which would be easy enough to procure. Second, though, and likely more complicated, is the assumption that Stork makes today that a bit of text is made up of words separated by spaces. That's not the case in Chinese (and perhaps Japanese and Korean as well?) so searching Chinese text is blocked until that assumption is unwound.
@YikSanChan mentioned that there are alg…