Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-English search support #2393

Open
4 of 9 tasks
Sunshine40 opened this issue May 20, 2024 · 5 comments
Open
4 of 9 tasks

Non-English search support #2393

Sunshine40 opened this issue May 20, 2024 · 5 comments
Labels
C-enhancement Category: Enhancement or feature request

Comments

@Sunshine40
Copy link

Sunshine40 commented May 20, 2024

#1081 has been stuck for a while, so I tried implementing my own version.

Preview the search functionality online

Inspired by #1496.

Major implementation steps:

  • Land Add non-English search support #2392
    • Fix buggy teaser (preview) 1
      • Keyword highlighting doesn't work as expected for Chinese
      • The selected range displayed as teaser sometimes contains no keywords at all (the target chapter does contain the keyword though)
    • Enhance mixed multi language search support
      • Implement a "fallback" strategy for text in mixed languages (detail)
  • Support Emoji search
  • Localize UI rendered by HTML renderer

Unresolved questions:

  • Should search-non-english feature be enabled by default? (Not likely, since it causes severe binary size bloat)
    • Should each supported language have an individual feature flag? Would it make the feature flag list too noisy?
  • Should we consider other search solutions than elasticlunr.js & lunr-languages?
  • Should the new "fallback" search implementation apply to English books as well?

Footnotes

  1. image

@Sunshine40 Sunshine40 added the C-enhancement Category: Enhancement or feature request label May 20, 2024
@duskmoon314
Copy link

Another solution is to use fzf like #2052. I think it is simpler than adding a bunch of JS with different languages. But I haven't compared the search results to see whether it is a good solution.

@Sunshine40
Copy link
Author

Sunshine40 commented Jun 4, 2024

Update

I've fixed the major problems mentioned before.

Now the teaser works well with CJK text without spaces as word splitters.

image

And it can also return the correct result for "multiple words joined together as a single keyword", like this:

image

I added keyword highlighting in the breadcrumbs part, in case the keywords don't occur in the document body at all:

image

(Opinionated) Preserves line-breaks in teasers (but removes indentations, because those are hard to get right)

All these are the behavior of the "fallback" strategy and would currently only be enabled if the book's book.language
is not English and not omitted.

@Sunshine40
Copy link
Author

Sunshine40 commented Jun 5, 2024

How does "fallback" strategy work?

The main reason Chinese search didn't work before is a mixture of 2 facts:

  1. The search index building process of mdBook's html renderer used space character as the only word splitter, and deliberately trimmed off anything other than the 26 English letters (case insensitive) and number 0~9.
  2. Even if the search index is built with Chinese word splitting, one cannot search for a series of words, namely "phrase searching", without adding spaces between words. And with spaces added, the words are no longer required to appear one after another, and that's rarely desired (at least among Chinese users).

So I put my effort into implementing "phrase search", and with that no "Chinese word splitting" algorithm is needed.

Tokenization

Text are basically divided into 4 categories:

  • Ideographs and similar characters (Han Characters, Hangul Syllables)
  • Emoji
  • Other symbols, punctuations, spaces (including line-breaks) and all other characters that do not form words
  • Default (everything not covered by the above categories)

Ideographs

When indexed, each character is treated as a separate word.

When counted during teaser generation, 2 characters are counted as 1 word.

Ideographs like Chinese characters have a wide range of character varieties, that even using single character for index searching could narrow down the result enough to be further processed by offline JS.

Hangul syllables are technically not ideographs, but they share similar characteristics.

Emoji

Emoji Modifier Sequences and Zero Width Joiners are handled so that an emoji icon is treated as 1 word.
(This is currently not working due to upstream issues)

Non Word

These are not indexed, and not counted as words during teaser generation.

Default

These are separated by anything that is not Default text, so 中文English混合 will be tokenized into

中, 文, English, 混, 合

though there aren't spaces between them.

Teaser generation

This is a huge pain so I'll make it brief.

Filtering

For "phrase searching", one thing worth noting is that the result returned by elasticlunr might not be valid, for example with keyword 安全抽象 (auto-splitted into 安, 全, 抽, 象) would match:

……部……取……平

But it's obviously not what we want.

So the new searching strategy would be only using elasticlunr as the first pass of filtering.

It then uses Regex to apply extra filtering on the returned results.

Highlighting

Only the above mentioned Default text would be highlighted as whole words. Ideograph-like and non-word text would be highlighted as is.

Display range selection

One key difference from the existing implementaion is that the range might not be contiguous.
There might be …… in the middle of it.

Another difference which is opinionated is that I choose to force the teaser to include each matched keyword at least once, so with keyword hello world, the following document body:

Hello hello and hello (10000 words omitted) hello (10000 words omitted) world. (10000 words omitted) world better.

would get a teaser like:

Hello hello and hello (14 words omitted) …… (11 words omitted) world.……

By the way, I decided that it's quite pointless to show half a clause (a sentence is composed of clauses) containing a highlighted keyword, and as a result if the book's output.html.search.teaser-word-count is too small, and there're too many matched keywords, the limit might be exceeded.

@Sunshine40
Copy link
Author

Sunshine40 commented Jun 5, 2024

Why doesn't Emoji work?

mdBook uses elasticlunr-rs to generate search index which is then consumed by elasticlunr.js

But there's a flaw in elasticlunr-rs's implementation which makes it inconsistent with the original JS library:

elasticlunr-rs/src/inverted_index.rs
Lines 40 to 42 in 29d97e4

let mut iter = token.chars();
if let Some(character) = iter.next() {
    let mut item = self

During index building, elasticlunr-rs iterates over the token &str's content in Unicode Scalar Values.

While the JS library does it in this way:

elasticlunr.InvertedIndex.prototype.addToken = function (token, tokenInfo, root) {
  var root = root || this.root,
      idx = 0;

  while (idx <= token.length - 1) {
    var key = token[idx];

The JS string is actually iterated in UTF-16 Code Units, which are entire characters for English, most alphabetic text, common Chinese characters; but not Emojis and rare Chinese characters.

And currently mdBook cannot handle these, with or without my patch.

@Sunshine40
Copy link
Author

@ehuss what do you think about these design choices?

fmease added a commit to fmease/rust that referenced this issue Jun 8, 2024
…ish, r=notriddle

Make html rendered by rustdoc allow searching non-English identifier / alias

Fix alias search result showing `undefined` description.

Inspired by rust-lang/mdBook#2393 .

Not sure if it's worth it adding full-text search functionality to rustdoc rendered html.
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Jun 8, 2024
Rollup merge of rust-lang#126057 - Sunshine40:rustdoc-search-non-english, r=notriddle

Make html rendered by rustdoc allow searching non-English identifier / alias

Fix alias search result showing `undefined` description.

Inspired by rust-lang/mdBook#2393 .

Not sure if it's worth it adding full-text search functionality to rustdoc rendered html.
flip1995 pushed a commit to flip1995/rust-clippy that referenced this issue Jun 28, 2024
…triddle

Make html rendered by rustdoc allow searching non-English identifier / alias

Fix alias search result showing `undefined` description.

Inspired by rust-lang/mdBook#2393 .

Not sure if it's worth it adding full-text search functionality to rustdoc rendered html.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: Enhancement or feature request
Projects
None yet
Development

No branches or pull requests

2 participants