Skip to content

Latest commit

 

History

History
74 lines (56 loc) · 3.18 KB

40_ICU_tokenizer.asciidoc

File metadata and controls

74 lines (56 loc) · 3.18 KB

icu_tokenizer

The icu_tokenizer uses the same Unicode Text Segmentation algorithm as the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

For instance, compare the tokens produced by the standard and icu_tokenizers, respectively, when tokenizing ``Hello. I am from Bangkok.'' in Thai:

GET /_analyze?tokenizer=standard
สวัสดี ผมมาจากกรุงเทพฯ

The standard tokenizer produces two tokens, one for each sentence: สวัสดี, ผมมาจากกรุงเทพฯ. That is useful only if you want to search for the whole sentence I am from Bangkok.'', but not if you want to search for just Bangkok.''

GET /_analyze?tokenizer=icu_tokenizer
สวัสดี ผมมาจากกรุงเทพฯ

The icu_tokenizer, on the other hand, is able to break up the text into the individual words (สวัสดี, ผม, มา, จาก, กรุงเทพฯ), making them easier to search.

In contrast, the standard tokenizer ``over-tokenizes'' Chinese and Japanese text, often breaking up whole words into single characters. Because there are no spaces between words, it can be difficult to tell whether consecutive characters are separate words or form a single word. For instance:

  • 向 means facing, 日 means sun, and 葵 means hollyhock. When written together, 向日葵 means sunflower.

  • 五 means five or fifth, 月 means month, and 雨 means rain. The first two characters written together as 五月 mean the month of May, and adding the third character, 五月雨 means continuous rain. When combined with a fourth character, 式, meaning style, the word 五月雨式 becomes an adjective for anything consecutive or unrelenting.

Although each character may be a word in its own right, tokens are more meaningful when they retain the bigger original concept instead of just the component parts:

GET /_analyze?tokenizer=standard
向日葵

GET /_analyze?tokenizer=icu_tokenizer
向日葵

The standard tokenizer in the preceding example would emit each character as a separate token: , , . The icu_tokenizer would emit the single token 向日葵 (sunflower).

Another difference between the standard tokenizer and the icu_tokenizer is that the latter will break a word containing characters written in different scripts (for example, βeta) into separate tokens—`β`, eta—while the former will emit the word as a single token: βeta.