Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: Accented characters, ampersands, negative numbers and other special characters #820

Closed
kcondon opened this issue Aug 12, 2014 · 10 comments · Fixed by #7378
Closed

Comments

@kcondon
Copy link
Contributor

kcondon commented Aug 12, 2014

This was suggested by Eleni during discussions around search issues with accented characters.

The suggestion is to allow searching both using the original accented characters and without, in those cases where a user may not have access to accented keyboard, etc.

@kcondon kcondon added this to the In Review - Dataverse 4.0 milestone Aug 12, 2014
@pdurbin
Copy link
Member

pdurbin commented Aug 14, 2014

See also #818 (comment) and the search internationalization ticket at #326.

@pdurbin pdurbin modified the milestones: Beta 10 - Dataverse 4.0, In Review - Dataverse 4.0 Nov 4, 2014
@scolapasta scolapasta modified the milestones: In Review - Long Term, In Review - Short Term May 8, 2015
@pdurbin
Copy link
Member

pdurbin commented Nov 12, 2015

@mheppler pointed out an interesting answer at http://stackoverflow.com/questions/16627062/not-able-to-search-spanish-word-with-accent-in-solr/20657529#20657529

Here it is in full (note downsides though):

You can try using the ASCIIFoldingFilterFactory filter.

It converts characters with ascent into their no-ascent counterpart.
Put this in your schema.xml:

<filter class="solr.ASCIIFoldingFilterFactory"/>

Note: The downside is that words like "cañon" and "canon" are now equivalent and both hit the same documents IIRC.

@pdurbin pdurbin removed their assignment Jan 21, 2016
@scolapasta scolapasta removed this from the Not Assigned to a Release milestone Jan 28, 2016
@pdurbin pdurbin removed the zTriaged label Jun 30, 2017
@pdurbin pdurbin added User Role: Guest Anyone using the system, even without an account and removed zEffort 1: Small labels Jul 12, 2017
@mheppler mheppler added Type: Bug a defect and removed UX & UI: Design This issue needs input on the design of the UI and from the product owner Type: Feature a feature request User Role: Guest Anyone using the system, even without an account labels Feb 3, 2020
@mheppler mheppler changed the title Search: Support searching on accented words by typing unaccented characters instead. Search: Accented characters, ampersands, negative numbers and other special characters Feb 3, 2020
@mheppler
Copy link
Contributor

mheppler commented Feb 3, 2020

Dusted off this oldie but goodie to be a representative issue for many other search bugs/feature requests. I have closed the following issues, to consolidate them here, and moved over any pertinent information in their comments.

@pdurbin
Copy link
Member

pdurbin commented Feb 4, 2020

@mheppler I'm still haunted by the bug report at #1928 (comment) that a search for "Experiment" found datasets with "Experience" but I haven't tested lately. That was nearly five years ago. 😄

@BPeuch
Copy link
Contributor

BPeuch commented Oct 12, 2020

We here in Belgium second this, as our three official languages – Dutch, French and German – all three contains special characters (á, à, â, ç, é, è, ê, ë, í, ó, ö, ú, ù, û, ü…).

Here is an illustration of how accented characters can hinder the search for / the discovery of datasets:

freq1

———————————————————————————————————————————————

freq2

@pdurbin
Copy link
Member

pdurbin commented Oct 13, 2020

@BPeuch in a report from a French installation, switching from text_en to text_fr seems to have helped: https://groups.google.com/g/dataverse-community/c/9sjpBpPRuFk/m/uxH2KKJnAQAJ

Since you have three official languages, however, I'm not sure if this will work for you.

@BPeuch
Copy link
Contributor

BPeuch commented Oct 13, 2020

That's valuable information still. Thank you, @pdurbin!

I fear that indeed it might not work between because some characters are specific to some of these languages (e.g. á and ó for Dutch). It could be worth a try though.

@qqmyers
Copy link
Member

qqmyers commented Oct 13, 2020

FWIW: For QDR, we addressed some of this with changes in the solr schema.xml leveraging some filters:
solr.WordDelimiterGraphFilterFactory
solr.ASCIIFoldingFilterFactory
solr.PatternReplaceFilterFactory

I can dig up that code if it's helpful - I don't know much about solr so I think what we did was mostly to cut/paste from sources I found on the web though, so you might be better off searching for the latest on this issue - perhaps including some of those filter names. (Our interest was primarily in handling characters from other languages and contractions, so it may not be as general as others might want.)

One thing I think is helpful to convey though - I think this can be solved/significantly improved just making solr changes ,versus requiring Dataverse code changes. So looking for answers related to solr or solr expertise at our institutions might be a good approach.

@qqmyers
Copy link
Member

qqmyers commented Oct 29, 2020

FWIW: QDR's solution can be seen in https://github.com/QualitativeDataRepository/dataverse/blame/develop/conf/solr/7.7.2/schema.xml - see the qqmyers changes starting from https://github.com/QualitativeDataRepository/dataverse/blame/develop/conf/solr/7.7.2/schema.xml#L573.
The main thing for non-ASCII characters was to enable the solr.ASCIIFoldingFilterFactory filter during indexing and querying. We also used the solr.PatternReplaceFilterFactory filter to try to recognize contractions (e.g. qu'est que) - you can see from the pattern there that this is fairly limited. The use of the solr.ASCIIFoldingFilterFactory is probably generic and could be pulled into a separate PR if there's interest in having it as the default for the community.

(You'll also see some modifications to the solr.WordDelimiterGraphFilterFactory params - I think those are related to file names containing numbers - basically an unrelated but also potentially useful change).

@poikilotherm
Copy link
Contributor

This might be linked to #6675 and #7375.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants