trimmer #12

arve0 · 2015-06-29T17:57:07Z

When doing

idx.use(lunr.de)

lunr.trimmer is removed from the pipeline, making words including punctation and the like to enter the index. E.g., both "word." and "word" will enter the index.

Adding lunr.trimmer to the pipeline manually is not really a good solution, as lunr.trimmer uses \W to match non word characters (regexp unicode only supported as of ES6).

A solution could be to normalize characters like æøå -> aoa, like done here: https://github.com/cvan/lunr-unicode-normalizer/blob/master/lunr.unicodeNormalizer.js

Thoughts?

The text was updated successfully, but these errors were encountered:

matiasgarciaisaia · 2017-02-24T21:04:33Z

I've made a sample page for this. Using Spanish, if you search for jubilación or jubilacion (wrongly-written version of the first one), lunr is giving different results - something that shouldn't really happen, lunr being a full-text search engine.

We've discussed a little bit about this in manastech/middleman-search#23 (that's were the example comes from), and I think this should be solved by lunr-languages rather than the user having to load lunr.unicodeNormalizer by itself.

If lunr-languages loads lunr.unicodeNormalizer or if it does a different thing, I'm not sure. But if I'm enabling spanish full-text search, I definitely want accented words to yield the exact same results than a non-accented version of the word.

I can totally try to fix lunr-languages if you give me some pointers about how to do it. It's just that I'm not sure where/how should I do it.

I'm pretty much sure @eemi wants to know about this issue.

drzraf · 2017-04-28T17:37:21Z

about handling accent, see fortnightlabs/snowball-js#2

drzraf · 2017-06-10T00:23:06Z

and back to snowballstem/snowball#55

saawsan · 2019-03-14T16:28:39Z

Hi, any news about that issue?

I'm currently working on an offline & multi-language search client with pouchdb-quick-search and I face the same limitations.

But if I'm enabling spanish full-text search, I definitely want accented words to yield the exact same results than a non-accented version of the word.

I completely agree with @matiasgarciaisaia.
Ignoring all diacritical mark (à, ñ, ç, é, ...) will highly improve the relevancy of the results.

Right now, the only workaround I can think of would be to strip all the diacritical mark before indexing the data.

drzraf mentioned this issue Jun 12, 2017

don't rely upon generateStopWordFilter #36

Closed

arve0 closed this as completed Apr 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trimmer #12

trimmer #12

arve0 commented Jun 29, 2015

matiasgarciaisaia commented Feb 24, 2017

drzraf commented Apr 28, 2017

drzraf commented Jun 10, 2017

saawsan commented Mar 14, 2019

trimmer #12

trimmer #12

Comments

arve0 commented Jun 29, 2015

matiasgarciaisaia commented Feb 24, 2017

drzraf commented Apr 28, 2017

drzraf commented Jun 10, 2017

saawsan commented Mar 14, 2019