-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support case insensitive search on new wildcard field and keyword #53603
Comments
Pinging @elastic/es-search (:Search/Mapping) |
Maybe there's a third solution. |
I opened a PR for this third option. #53814 |
From ECS' POV it's important to preserve the ability to query case sensitively, as well as offer the ability to query case insensitively. Both are important, but my understanding is that case insensitivity is the most important one of the two, especially on a fuzzy kind of search like wildcard. Currently in ECS, most fields are I think in most cases, the same fields that have If we add wildcard fields to the mix, they would likely be as another multi-field (e.g. If the above is correct, then when adding If this is the case, I would like to suggest we flip the behaviours around instead. By default, I think most people would want case insensitive search on a wildcard field. Wildcards are already a fuzzy search. Then only when case sensitivity is needed (and In other words, can we do:
@rw-access was telling me Endgame only supports case insensitive search, and then additional filtering or analysis is done, if case is important. And actually, based on what Ross was telling me, perhaps we could even consider having @neu5ron ping |
I think the differences come down to:
Caveat 1: somewhat slower as doc values retrieved from compressed blocks of 32
That would be faster to search. There's an overhead in my option 3 converting stored mixed-case values to lower-case at query-time. While it minimises disk storage costs required to support CS and non-CS the better trade-off might be to just make the field fast for the primary use case (non-CS). |
I like the fact that the |
I opened #53851 to add the normalizer support. There's no normalisation by default but I would like to make it easier for the simple case of users wanting case-insensitive. Having to declare an |
Hey all, co-author of the ugly regex blog here 🙃 I like the proposed solution of keyword and wildcard case insensitive. Regarding storage - I believe solving a visibility gap is critical and the side affect of increasing storage is an acceptable con/negative. Also:
There should still be a good analyzed field similar to the text analyzer - maybe even an improved one for security use case. Because if we dont have an analyzed field then we miss out on a lot of the additional powers of lucene like fuzzy query and the newer terms query. |
Let me phrase that as a more direct question :-) If Also, can we do aggregations on Understanding whether we can replace |
i think the gist is we need to have keyword and have to wildcard lowercase. if the standard is set, implemented in beats or what not, and communicated then we should not have to worry about such overlap of fields - right @webmat |
All field types are expected to respond to requests for the same set of query types prefix/term/wildcard/range/fuzzy etc. See my feature comparison table here I'm also working on a blog for the 7.7 release to help expand on choosing between field types now that we have wildcard in the mix. |
@markharwood Actually I like your solution 2. It is simple, does not increase storage requirements, and I agree with you that false positives shouldn't increase dramatically. I wouldn't mind adding a I agree with you that case-insensitive search is important. Among all the ways that content may be normalized, I believe that case folding is a bit special, for instance |
Yes, my assumption is that it should be optimising for string equivalence as determined by machines rather than any sloppier equivalence that might be acceptable to humans. In other words the stricter set of normalization rules that are permitted by the OS when referring to files (so I doubt accent removal is required). |
Greetings All - I have been following this thread closely, as it directly impacts the project I work on (@Security-Onion-Solutions). At this point, have any final decisions been made about how this should be handled? |
We're still deciding. The options and their pros/cons here.
It's also possible that option 1 could co-exist with options 2 or 3 but it would become confusing if any index-time choices contradict the query-time choices e.g a case-sensitive mixed-case query is targeting a field which has opted to use a lower-case normalizer. |
Thanks @markharwood I think that increased storage is an acceptable trade-off in this particular situation - my vote would be option 1). There are two parts to solving this issue: 1) Development of the solution 2) Getting people to use the solution. If the solution is optional, it will become yet another esoteric setting that users will need to figure out. TL;DR: It needs to be non-optional, or at least, default to the proposed solution, with a way to disable it if need be. |
That's the dilemma. We could up-front automatically optimise search for every conceivable query type (wildcard, exact-value-match, word-based matches, case sensitive, case-insensitive) but that would require multiple data structures which means more disk space. If users only want to pay for what they need they must opt in to these specialised field configurations or live with the limitations of an unoptimised field for some queries (eg wildcard searches on a keyword field). All of the above is a statement on elasticsearch's general policy to handling string fields. |
I think this has been understated in this thread. There's been some back and forth largely about what the defaults are, but in my opinion this will largely come down to the mappings provided with ECS. Personally, I would lean towards conservative defaults within Elasticsearch and communicating well what those are and how to change them. It isn't necessarily fair to all users to be affected by a bias towards common SIEM use cases. ECS is where it seems most appropriate to define both wildcard and case sensitivity on a per-field basis.
++ for this query-time transformation. It also has the nice property of backwards compatibility. We've been talking about case-sensitivity with EQL as well and I was considering something like this when we want a case-insensitive search on a field indexed with its original case. Being able to do this on the fly and automagically, without the hoops @neu5ron mentions in his post is a big win. |
Worth noting that while this is a win for avoiding reindexing it's a lose for the user attempting a case-sensitive search on a field indexed with lower-case. It's a break with the long-held principle of case-sensitivity being determined by choice of field names, not query flags. That's why I proposed option 3. |
Yes support for 1 and 3 is perfect IMO. With 2 and 3, it looks like its still the same transformation, but just a difference in whether it's applied to one field or all fields in the query. Would the proposed |
That's possible but I'm concerned how other query types (term, terms, prefix) would be expected to behave on a keyword virtual field. With a wildcard field, all supported query types (wildcard/term/prefix etc) have a quick approximation ngram match which must be verified by retrieving the docvalue which is where we get the opportunity to lowercase on the fly if required. Some slowness is a built-in expectation for all query types so the cost of lowercasing on the fly in a wildcard's virtual subfield is comparatively small. However with a |
After further discussion we concluded it would be useful to offer query-time case insensitive search options with the assumption they could be used on both wildcard and keyword fields. There are a number of open questions still at this stage:
Unlike the wildcard field, we can't always guarantee a keyword field will have the original un-normalised strings easily accessible from doc values. If the content was normalised to lowercase and the query is a mixed case string with a case sensitive parameter we should probably error loudly rather than fail to match silently. Errors can be worked around but silent failures go unnoticed and can mislead users. We agreed that a query with a case-sensitivity parameter set should fail when used on a |
In terms of implementing case insensitive regex query on keyword fields - which of these 2 approaches would we use? A is simple to implement but slow. |
I prefer option B, the automaton is intersected with the terms dictionary so the expansion is limited to matching terms. In the worst case, all permutations exist in the dictionary, but the multi-term query handles it smoothly with the
I have a slight preference for a new |
Update - a PR for case insensitive Regex searches is happening in Lucene |
Following some discussion we concluded that
The "legacy" mode would be the default, preserving the current (inconsistent) matching behaviour. The 2 other modes would do exactly as you would expect in relation to matching the search input with the indexed tokens (which can differ from the JSON source) |
A tricky situation, but I like this proposal 👍 |
@markharwood Is the plan first to correct term queries to remove normalization from them? |
@mayya-sharipova that's not clear to me - it will need thrashing out on #25487 For the moment I'm assuming we're not relying on getting that fix because it's a breaking change that will need to wait for 8.0 and we want case insensitive search out in 7x
No - we could choose to keep the provided query string's case if case-sensitive explicitly set in the query's param, but the problem is
I'm not sure if the last "case insensitive" in the above statement should have read "case sensitive". Either way, I took that to mean we are not going to try warn or error if a user picks an inappropriate combination of query and index settings eg a case sensitive search on a |
@markharwood Thanks for the clarification, makes sense.
This makes sense to me: when |
Query_string will be a challengeAnother consideration is that the most popular way of writing wildcard queries is not done using JSON - it's likely via The The These existing design choices I assume are guided by the idea that normalisation is a base level of functionality that sits below "analysis" choices like stemming etc. As such, it is always on regardless of the So, any new flag I assume we add to query_string would be providing a form of query-time normalization for those fields that have had no index-time normalisation (wildcard, and keyword-with-no-normalizer). This will be tricky to name and document - I expect most users will struggle to grasp the subtleties of this new flag Vs the existing KQL might be simplerPractically speaking, many users will be entering queries using Kibana and therefore KQL so I have opened a Kibana issue to add regex syntax where case sensitivity control would be controllable using |
Great discussion here all! @markharwood - Wishing to clarify an earlier discussion point:
As ECS looks to adopt wildcard, we continue to evaluate where wildcard could be a transparent replacement for keyword. Per your earlier feature comparison and your reply to @webmat's ask, |
The |
Closing in favour of #61162 |
so how to deal with wildcard datatype case insensitive in query_string now? @markharwood |
Currently the wildcard field only supports case sensitive search but it is vital that we find a way to offer case insensitive search too. A recent blog post highlighted general string-matching problems and how users have resorted to ugly regex expressions like this one to overcome issues with case sensitivity:
The example above is a search for a string from a case-insensitive operating system where hackers may have used mixed case commands deliberately to try avoid simpler rule detection.
Solution 1: Index-time case choices
We could make the wildcard field accept an optional normalizer to lower-case the content at index time (much like the
keyword
field). However, in a centralised logging system we may be storing content from both Windows and Unix machines which are case insensitive and case sensitive file systems respectively. The importance of case may vary from one document to the next. This would typically mean that we would be forced to index with multi-fields (one case sensitive, the other not) which would double the storage costs.Solution 2: query-time choices
The wildcard field already has 2 representations of the original content - an ngram index for approximate matching and a binary doc value of the original bytes for verification of approximate matches. If the ngram index is changed to always use lower-case then the decision to have case-sensitive matching or not becomes a query-time option when verifying candidate matches. There would be a (likely small) increase in the number of false-positives from the approximate matching but the big advantage is no increase in today's storage costs (actually a decrease if we normalise ngrams).
In either solution the searcher has to make a conscious decision - either to search a case-insensitive field or to declare the query clause as case-insensitive.
Solution 2 looks preferable to me from the back end but is a break with existing approaches where case-sensitivity is an index-time mapping decision not a property of a query clause. This means that the wildcard query clause would have a case-sensitive parameter that is relevant if you target a
wildcard
field but not on atext
orkeyword
field (although we could amend keyword field logic to support this too).Thoughts @jimczi @jpountz ?
The text was updated successfully, but these errors were encountered: