Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trimmer #12

Closed
arve0 opened this issue Jun 29, 2015 · 4 comments
Closed

trimmer #12

arve0 opened this issue Jun 29, 2015 · 4 comments

Comments

@arve0
Copy link
Contributor

arve0 commented Jun 29, 2015

When doing

idx.use(lunr.de)

lunr.trimmer is removed from the pipeline, making words including punctation and the like to enter the index. E.g., both "word." and "word" will enter the index.

Adding lunr.trimmer to the pipeline manually is not really a good solution, as lunr.trimmer uses \W to match non word characters (regexp unicode only supported as of ES6).

A solution could be to normalize characters like æøå -> aoa, like done here: https://github.com/cvan/lunr-unicode-normalizer/blob/master/lunr.unicodeNormalizer.js

Thoughts?

@matiasgarciaisaia
Copy link

I've made a sample page for this. Using Spanish, if you search for jubilación or jubilacion (wrongly-written version of the first one), lunr is giving different results - something that shouldn't really happen, lunr being a full-text search engine.

We've discussed a little bit about this in manastech/middleman-search#23 (that's were the example comes from), and I think this should be solved by lunr-languages rather than the user having to load lunr.unicodeNormalizer by itself.

If lunr-languages loads lunr.unicodeNormalizer or if it does a different thing, I'm not sure. But if I'm enabling spanish full-text search, I definitely want accented words to yield the exact same results than a non-accented version of the word.

I can totally try to fix lunr-languages if you give me some pointers about how to do it. It's just that I'm not sure where/how should I do it.

I'm pretty much sure @eemi wants to know about this issue.

@drzraf
Copy link

drzraf commented Apr 28, 2017

about handling accent, see fortnightlabs/snowball-js#2

@drzraf
Copy link

drzraf commented Jun 10, 2017

and back to snowballstem/snowball#55

@saawsan
Copy link

saawsan commented Mar 14, 2019

Hi, any news about that issue?

I'm currently working on an offline & multi-language search client with pouchdb-quick-search and I face the same limitations.

But if I'm enabling spanish full-text search, I definitely want accented words to yield the exact same results than a non-accented version of the word.

I completely agree with @matiasgarciaisaia.
Ignoring all diacritical mark (à, ñ, ç, é, ...) will highly improve the relevancy of the results.

Right now, the only workaround I can think of would be to strip all the diacritical mark before indexing the data.

@arve0 arve0 closed this as completed Apr 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants