Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Case Insensitive search for foreign characters in wildcard field type #95120

Open
jypan0115 opened this issue Apr 11, 2023 · 10 comments
Open
Labels
>enhancement :Search Relevance/Search Catch all for Search Relevance Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@jypan0115
Copy link

Description

In server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function, for now it only works with ASCII characters. May I know why not support foreign characters like Vietnamese? It is not consist with keyword.

@jypan0115 jypan0115 added >enhancement needs:triage Requires assignment of a team area label labels Apr 11, 2023
@astefan astefan added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Apr 11, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Apr 11, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@cbuescher
Copy link
Member

Comment from this closed issue:

If we store it as keyword field and use case insensitive term query to search for Ngô Đức(Uppercase) or ngô đức(lowercase), it works fine. But if it is stored as wildcard field, it failed when I use case insensitive term query.

@cbuescher
Copy link
Member

@jypan0115 as you pointed out in #61596 (comment) we don't apply the logic that handles case-insensitivity for codepoints outside the ASCII range. The reason for this currently isn't clear to me, we'd need to do some digging here to check for the reason to do so before removing this limitation, but it makes sense to me that we should also support characters outside the ASCII range if possible.

@jypan0115
Copy link
Author

@jypan0115 as you pointed out in #61596 (comment) we don't apply the logic that handles case-insensitivity for codepoints outside the ASCII range. The reason for this currently isn't clear to me, we'd need to do some digging here to check for the reason to do so before removing this limitation, but it makes sense to me that we should also support characters outside the ASCII range if possible.

@cbuescher Kindly checking any updates here?

@cbuescher
Copy link
Member

Just for reference, the choice to limit the case_insensitive option at the time of introduction was a deliberate one. While digging a bit further and trying to find history around this decision I found this discussion on a related Lucene PR that adds a case insensitivity flag to the RegExp automaton class there. From a first look at it it seems there is no straight forward 1:1 mapping between lower and upper case letters for unicode in general. While for a lot of cases there might be such an unambiguous mapping, this needs more careful investigation and discussion.

@jypan0115
Copy link
Author

@cbuescher In server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function, by removing the limit of ASCII character if (codepoint > 128) { return case1; }, unicode character will be take care by this line int altCase = Character.isLowerCase(codepoint) ? Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint); while Character.isLowerCase(), Character.toUpperCase(), Character.toLowerCase() can all deal with unicode.

@jypan0115
Copy link
Author

@cbuescher @javanna Kindly checking any updates here?

@nemphys
Copy link

nemphys commented May 10, 2023

+1 on this one, I was trying to figure out why I was not getting the expected results when using the new case_insensitive setting (on greek strings) until I stumbled upon this issue.

@jypan0115
Copy link
Author

@cbuescher @javanna Kindly checking any updates here?

@benwtrent benwtrent added :Search Relevance/Search Catch all for Search Relevance and removed :Search/Search Search-related issues that do not fall into other categories labels Jul 12, 2024
@elasticsearchmachine elasticsearchmachine added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Relevance/Search Catch all for Search Relevance Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

7 participants