Chinese Languages support #503
Replies: 11 comments 38 replies
-
is this will support in next version? |
Beta Was this translation helpful? Give feedback.
-
About Jieba segmenting too-long words, I have looked into the examples from the linked issues. For instance, in this issue, I would suggest checking if the HMM is enabled or disabled and, if enabled, disabling it. At least I always disable the HMM in my own use of Jieba. |
Beta Was this translation helpful? Give feedback.
-
Pinyin system is suitable for representing the phonology of (Standard) Mandarin only while IPA can be used to represent basically any language and dialect. Pinyin is not "less precise" -- it reflects only one variation of Chinese, but precisely. It is also the most commonly spoken variant of Chinese. Pinyin is accurate in representing both for the standardized China mainland dialect of Mandarin and the standardized Taiwan dialect of Mandarin. Different Sinitic languages have substantially different phonology systems, and their dialects also have variations. If your target is country and/or regions as per ISO 3166, you can get away with only implementing phonological representation in Standard Mandarin (both China mainland and Taiwan) and Standard Cantonese. You might also want to implement Taiwanese Hokkien or Taigi, where there are two standardized dialects, as well as Taiwanese Hakka, both of which are official languages of Taiwan in addition to Mandarin. Unihan DB provides standardized readings for Mandarin (in pinyin) and Cantonese (in Jyutping). The choice of romanization system is not important as long as it accurately reflects the phonology. Note: due to the huge number of homophones, you likely want to do this only after segmentation, i.e. 法治 fǎ zhì also matches 法制 fǎ zhì, but you don't want someone searching 治 itself to match 制. (These two phrases are not pronounced identically in Cantonese.) |
Beta Was this translation helpful? Give feedback.
-
I do not suggest using the IDS list for identifying visually similar characters. You should use https://github.com/hfhchan/irg/blob/master/kVariants.txt instead, which provides mappings for (1) visually similar and identical meaning (i.e. z-variants), (2) semantic variants, (3) simplified (both standard and non-standard forms), as well as (4) other erroneous forms. Note this list only contains characters which are identical in meaning -- this is somewhat prescriptive, and intentionally ignores all irregular simplifications. For visually similar characters only, you will want to look at kSpoofingVariant and kZVariant of Unihan too. Note this list excludes similarly looking common characters where someone from primary school grade should be able to distinguish, e.g. 土 vs 士. These are both very very common characters and normalizing them would bring more false positives than help. For actually useful search for Chinese users, use the mappings in kTraditionalVariant, kSimplifiedVariant, kSemanticVariant and kSpecializedSemanticVariant of Unihan, which converts between characters of identical meaning (in all or certain contexts). Note in Unihan, 温 and 溫 and etc are considered traditional/simplified pairs but normal native speakers treat them more or less identical. That's why in my kVariants list they are considered z-variants instead. You should also look into the MSR mappings for Chinese (used by ICANN in domain names). Whatever is blocked means that the characters have been determined to be (somewhat) identical in meaning, or high probability of spoofing (i.e. indistinguishable a first glance). This list should be roughly identical to all the previous k* properties in Unihan. |
Beta Was this translation helpful? Give feedback.
-
Hello all!
All these issues are open to external contributions during the whole month, so don't hesitate to contribute! 🧑💻 This is another step in enhancing Chinese Language support, depending on future feedback, we will be able to go further. Thanks for all your feedback! ✍️ 🇨🇳 |
Beta Was this translation helpful? Give feedback.
-
Excuse me, Is it possible to support traditional Chinese display for synonym output? |
Beta Was this translation helpful? Give feedback.
-
Hello everyone here 📣 📣 Meilisearch has just released its first RC (Release Candidate) for v1.0.0! This new version of Meilisearch will contain changes for the Chinese language support, so you might want to test it How do we improve support for Chinese language?
Please let us know here, in the thread, how these changes impact your usage 👇 👇 Thanks in advance for your help! 🙏 |
Beta Was this translation helpful? Give feedback.
-
It is now the era of AI. We use AI-based NLP for Chinese word segmentation to achieve better industry-specific segmentation results. If we use this method to pre-segment Chinese into strings separated by whitespaces, how can we disable Charabia’s Jieba segmentation and only allow Unicode-segmentation to perform segmentation by whitespaces? |
Beta Was this translation helpful? Give feedback.
-
I notice that when use jieba.cu,Chinese word |
Beta Was this translation helpful? Give feedback.
-
I had the same issue, "新" ==> “芯” Use version: meilisearch-windows-amd64_1.5.0 I think this Chinese language issue is very important. If it's not solved, Meilisearch cannot be applied in a production environment. Data List: [{
"id": 1763204760840700001,
"create_date": "2023/9/8 17:20:23",
"search_text": "新中",
"search_times": 10
}, {
"id": 1763204760840700002,
"create_date": "2023/9/8 17:20:23",
"search_text": "浮雕装饰画",
"search_times": 155
}, {
"id": 1763204760840700003,
"create_date": "2023/9/8 17:20:23",
"search_text": "新中背景墙",
"search_times": 20
}, {
"id": 1763204760840700004,
"create_date": "2023/9/8 17:20:23",
"search_text": "中式",
"search_times": 50
}, {
"id": 1763204760840700005,
"create_date": "2023/9/8 17:20:23",
"search_text": "画芯",
"search_times": 100
}, {
"id": 1763204760840700006,
"create_date": "2023/9/8 17:20:23",
"search_text": "抽象装饰画",
"search_times": 550
}, {
"id": 1763204760840700007,
"create_date": "2023/9/8 17:20:23",
"search_text": "中式背景墙",
"search_times": 120
}
] Search Request: {
"q": "\"新\"",
"matchingStrategy": "all",
"attributesToSearchOn": [
"search_text"
],
"attributesToHighlight": [
"search_text"
],
"attributesToRetrieve": [
"search_text"
],
"limit": 10,
"offset": 0,
"showRankingScore": true
} Search Result: {
"hits": [
{
"search_text": "新中",
"_formatted": {
"search_text": "<em>新</em>中"
},
"_rankingScore": 0.49242424242424243
},
{
"search_text": "新中背景墙",
"_formatted": {
"search_text": "<em>新</em>中背景墙"
},
"_rankingScore": 0.49242424242424243
},
{
"search_text": "画芯",
"_formatted": {
"search_text": "画<em>芯</em>"
},
"_rankingScore": 0.4621212121212121
}
],
"query": "\"新\"",
"processingTimeMs": 0,
"limit": 10,
"offset": 0,
"estimatedTotalHits": 3
} Moreover, I found that words added to the stop_words list still seem to be searchable. localhost:7700/indexes/ithome_search_suggestions/settings/stop-words Response: [
"画芯"
] Test Request: {
"q": "\"画芯\"",
"matchingStrategy": "all",
"attributesToSearchOn": [
"search_text"
],
"attributesToHighlight": [
"search_text"
],
"attributesToRetrieve": [
"search_text"
],
"limit": 10,
"offset": 0,
"showRankingScore": true
} Response Result: you can see "画芯" still seem to be searchable. {
"hits": [
{
"search_text": "画芯",
"_formatted": {
"search_text": "<em>画</em><em>芯</em>"
},
"_rankingScore": 0.5
}
],
"query": "\"画芯\"",
"processingTimeMs": 0,
"limit": 10,
"offset": 0,
"estimatedTotalHits": 1
} |
Beta Was this translation helpful? Give feedback.
-
I agree with @houlang's example of douyin's search: when you're typing chinese in the ime, it's in pinyin first and it will input the pinyin first to the search box meanwhile if you use a character, douyin doesn't seem to have any typo tolerance since they probably figure that if you weren't sure what character it was, you would just not select from your IME and instead leave it in pinyin as is and select one of the search options. This is my guess My suggestion is that the matching algorithm should prioritize exact match on the chinese characters if chinese is entered, if pinyin is entered, it can fuzzy search over the pinyin representations of the chinese (which it appears the pinyin normalization was developed to do). It seems that the pinyin normalization is good but there needs to be part of the code that prevents a cjk character from matching to that pinyin normalization. |
Beta Was this translation helpful? Give feedback.
-
Chinese Languages support
Current behavior and pointed out issues
Segmentation
For Meilisearch, the segmenter's goal is to cut a text into several "words" which will be searchable in Meilisearch during a search query.
To segment Chinese texts, and because Chinese words are not always space-separated, we currently use Jieba instead of the default unicode-segmentation.
Drawbacks
Jieba helps us segment non-space-separated words, but our community reported that the quality of the segmentation was not sufficient:
Normalization
For Meilisearch, the normalizer's goal is to alter the words that can be considered equivalent in order to make them converge to a common representation.
To Normalize Chinese words, we convert traditional characters into simplified ones using character_converter.
Drawbacks
This normalization process allows Meilisearch to find documents containing both traditional and simplified characters,
however, this seems to be insufficient for our community, therefore the typo tolerance struggles to find the real user typo:
Potential enhancements
Segmentation
As written before, Jieba segmentation creates too-long words that decrease the number of relevant documents found by Meilisearch.
However, we need an equivalent tokenizer and we can't just consider that each character is a word because it would make Meilisearch returns a lot of un-relevant documents (meilisearch/meilisearch#2390).
We have to find another tokenizer that cut the provided text into words without creating too-long words.
In a below discussion some contributors suggested to:
Normalization
The current normalization process for Chinese script is meant to unify traditional and simplified Chinese, but, we could change the approach by encoding phonologically or visually the Chinese characters. In Visually and Phonologically Similar Characters in Incorrect Simplified Chinese Words, we can read:
Phonological normalization
In Phonology of Mandarin Chinese: Pinyin vs. IPA we can read:
Because many errors are phonological, and, because Chinese Script is not a phonological writing system, we should normalize tokens into a phonological representation.
The main issue of this is that Chinese dialects don't have always the same pronunciation for the same word, In Visually and Phonologically Similar Characters in Incorrect Simplified Chinese Words we can read:
In his below comment, hfhchan suggested several phonological normalizations:
Visual normalization
Visually and Phonologically Similar Characters in Incorrect Simplified Chinese Words treat this subject, their solution is to encode each character in Cangjie that compose the original character linked to the structure of the original Character, this method would allow Meilisearch to retrieve similar characters via the typo criterion.
I found a promising GitHub repository, maintained by @hfhchan, where we can find Ideographic Description Sequence dictionaries to decompose CJK characters.
Phonological vs Visual normalization, what should we choose could we have both?
TBD @ManyTheFish with the potential help of the community
Contribute!
In Meilisearch, we don't speak nor understand all the Languages in the world, we could be wrong in our interpretation of how to support a new Language in order to provide a relevant search experience.
However, if you are a native speaker, don't hesitate to contribute to enhancing this experience:
Thanks for your help!
Beta Was this translation helpful? Give feedback.
All reactions