-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search: Accented characters, ampersands, negative numbers and other special characters #820
Comments
See also #818 (comment) and the search internationalization ticket at #326. |
@mheppler pointed out an interesting answer at http://stackoverflow.com/questions/16627062/not-able-to-search-spanish-word-with-accent-in-solr/20657529#20657529 Here it is in full (note downsides though): You can try using the ASCIIFoldingFilterFactory filter. It converts characters with ascent into their no-ascent counterpart.
Note: The downside is that words like "cañon" and "canon" are now equivalent and both hit the same documents IIRC. |
Dusted off this oldie but goodie to be a representative issue for many other search bugs/feature requests. I have closed the following issues, to consolidate them here, and moved over any pertinent information in their comments.
|
@mheppler I'm still haunted by the bug report at #1928 (comment) that a search for "Experiment" found datasets with "Experience" but I haven't tested lately. That was nearly five years ago. 😄 |
We here in Belgium second this, as our three official languages – Dutch, French and German – all three contains special characters (á, à, â, ç, é, è, ê, ë, í, ó, ö, ú, ù, û, ü…). Here is an illustration of how accented characters can hinder the search for / the discovery of datasets: ——————————————————————————————————————————————— |
@BPeuch in a report from a French installation, switching from text_en to text_fr seems to have helped: https://groups.google.com/g/dataverse-community/c/9sjpBpPRuFk/m/uxH2KKJnAQAJ Since you have three official languages, however, I'm not sure if this will work for you. |
That's valuable information still. Thank you, @pdurbin! I fear that indeed it might not work |
FWIW: For QDR, we addressed some of this with changes in the solr schema.xml leveraging some filters: I can dig up that code if it's helpful - I don't know much about solr so I think what we did was mostly to cut/paste from sources I found on the web though, so you might be better off searching for the latest on this issue - perhaps including some of those filter names. (Our interest was primarily in handling characters from other languages and contractions, so it may not be as general as others might want.) One thing I think is helpful to convey though - I think this can be solved/significantly improved just making solr changes ,versus requiring Dataverse code changes. So looking for answers related to solr or solr expertise at our institutions might be a good approach. |
FWIW: QDR's solution can be seen in https://github.com/QualitativeDataRepository/dataverse/blame/develop/conf/solr/7.7.2/schema.xml - see the qqmyers changes starting from https://github.com/QualitativeDataRepository/dataverse/blame/develop/conf/solr/7.7.2/schema.xml#L573. (You'll also see some modifications to the solr.WordDelimiterGraphFilterFactory params - I think those are related to file names containing numbers - basically an unrelated but also potentially useful change). |
This was suggested by Eleni during discussions around search issues with accented characters.
The suggestion is to allow searching both using the original accented characters and without, in those cases where a user may not have access to accented keyboard, etc.
The text was updated successfully, but these errors were encountered: