-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flattened object fields design + implementation #33003
Comments
Pinging @elastic/es-search-aggs |
The initial example on #25312 suggests that object fields would be indexed like text since it indexes
Why do we plan to index individual tokens alone? I suspect most users won't want/need to search the entire object, meaning that this feature will double the size of the inverted index for a feature they don't need? Since you already mentioned having some sort of
If we go with keyword-style indexing, then we should probably skip highlighting, which is only useful with text fields? (matched queries are typically used instead for structured content)
+1 I suspect it will be quite easy actually.
That would be nice of course, but I'm not worried about it being slow since term queries on an indexed object would translate to a term query at the lucene level. |
Thanks @jpountz for your thoughts.
I agree — in the potential use cases we’ve seen, the data is better modelled as keywords than text. The most critical feature is the ability to filter by an exact match on key-value pairs, and performing aggregations and sorting on the values would also be nice. A couple examples of these use cases:
My sense is that we should focus on keywords for now, but in the future we could consider support for some simple analysis/ normalization, pending feedback on the feature.
It could be nice if users were able search an entire object field (e.g.
Right, that makes sense! I’ll just plan on a sanity check here. |
I also wanted to clarify a point that was a bit fuzzy to me until I did a prototype. Under the proposed implementation, we are planning to index the entire JSON blob as a lucene field, and apply a special analyzer to create tokens that resemble keywords. This is in contrast to an approach where we create a new field mapping for each key in the object (which I think would be messy and negate some of the benefit of the feature). Taking this approach means that in kibana and other clients, the field will be displayed as a single block of JSON. I created a quick example using the prototype implementation:
As these JSON blobs can be quite large, highlighting seemed useful in showing where the match actually occurred. I also wonder if we should support highlighting for consistency with keyword fields, as some clients (like kibana) do depend on highlighting for displaying these matches? |
I'm curious how this will work in practice as mixing up pre/post tags with a JSON structure sounds challenging? The other thing that worries me a bit is that if we want to support any highligter that is able to use indexed offsets or term vectors, then when we extract the JSON object from the source document at search time, we must make sure that it produces exactly the same string as what was passed to the object mapper so that offsets are comparable, and any new/removed spaces and line breaks or reordering of keys would break highlighting? |
Now that I read my comment again, the latter doesn't make sense as we will not allow enabling term vectors or indexing offsets anyway, so the highlighter will have to recompute the matched offsets anyway. |
I don't think I've dug enough into the details of highlighting to understand the concerns, but my takeaway is that it may be tricky to find a robust approach (and we should be open to punting on highlighting for v1). As for the question around non-prefixed tokens, what do you think about this plan? Whether or not to index non-prefixed tokens can be controlled through a flag, to give users the opportunity to try it out without forcing them to double their inverted index size. From my initial experiment, indexing the raw tokens doesn't add much more work/ complexity. We can mark the feature 'experimental' at first, to allow time to collect feedback about this flag, and also about analysis, a |
I agree that complexity of the implementation is fine, I'm more concerned about the API as we should strive to have as few switches as possible, especially for a v1. To me the question of this switch boils down to the problem that we are trying to solve: either we want to allow users to actually index objects, in which case indexing raw values makes sense, or we want to allow users to avoid the overhead of mappings and Lucene fields when indexing keywords and then it makes less sense? |
That is a nice way to frame it! I am thinking of it as the former (providing a true 'object' field). I think it fits better with the use cases/ data we’ve seen, which center on indexing opaque JSON objects (metric beats, user-provided blobs of data, etc.). To me, the most compelling use for this feature is in being able to work with object data that is difficult to model otherwise, and not just saving on indexing cost when working with keywords. I will try to get some more consensus/ clarity on this point, and then loop back. |
We had a discussion offline, and came to the following conclusions:
|
@colings86 @romseygeek I’ve given some thought to naming and have laid out some options. It would be great to get your opinions as well. Options I don’t think are very strong:
Current favorites:
|
@jtibshirani I agree with you on the ones you list as "Options I don’t think are very strong". On the "current favourites" I have the following thoughts:
|
Just throwing out a few ideas. Don't really think any of them are winners, but may spark an idea elsewhere. :)
|
@jtibshirani thanks again for letting me know the feature branch was ready to look at! I created an issue on the Kibana repo to start tracking our research. At the moment the biggest issue I'm seeing is that Kibana has no way to know what sub fields might be present in the objects. This prevents us from autocompleting those field names in the query bar and it also prevents the user from creating filters (the pills below the query bar) on those fields because we currently present them with a dropdown to select the field, populated from our index pattern's field list. I realize this is sort of the point of the new type, but I'm wondering if ES could somehow track which sub field names it has seen and expose that information to Kibana? I think it would dramatically improve the user experience for querying on these fields. |
(Moved to #25312 (comment)) |
@jtibshirani sure, moved. |
A note to document the results of performance benchmarks. In summary, the results looked good overall, the only surprise was the small increase in index size when using an For the testing set-up, I ran the metricbeat track on an n1-standard-8 GCP instance. In the baseline, the track is run without modifications, and all fields are mapped individually. To test the performance of JSON fields, the object field
In the context of metricbeat data, Term Query To test query performance, the following operation was added:
As expected, queries perform very similarly to the baseline, where the subfield had been mapped individually as a
Terms Aggregation The following terms aggregation was also tested:
Terms aggregations were slower than the baseline, but the performance was still acceptable. From the profiling output, the bulk of the time is spent in
Indexing Performance Indexing throughput and service time looked good, the tests showed no decline in performance. To confirm the effect, I also repeated these indexing tests with an (unrealistic) set-up where all top-level fields were mapped as
Index Size In all tests I ran, index size actually increased by a small amount. This was a bit counterintuitive for me, as I had assumed that using
|
Thanks for testing, this looks great.
This suggests that we first map segment ordinals to global ordinals before checking whether the global ordinal is in the expected range. We could probably speed it up by computing the range of valid segment ordinals and then calling nextOrd on the segment-level If you still have the index handy, I'd be curious to know what the difference is in terms of Lucene memory usage. I'm expecting a bit more, but I'd be curious to know how much more. |
That makes sense, I will take a look if it's possible to do this in a clean way. I think this benchmark presents a tough case for aggregations, in that each As for lucene memory usage, here is the relevant output from the rally benchmarks. Let me know if there are any other measurements that would be interesting:
|
Thanks for running this test, I expected memory usage to be higher for |
Main issue: #25312
Feature branch: https://github.com/elastic/elasticsearch/tree/object-fields
Note: this field type was previously called
embedded_json
, so many PRs + comments will refer to that name.Motivation
Documents sometimes contain large objects, where only a small number of the fields are frequently used in searches. By default, we create dynamic mappings for all key-value pairs in the object, and index each one as a separate field. This has a number of downsides:
In some cases, the number of field keys not just a large known number, but unbounded. Here, it can be difficult to successfully model the data at all.
Feature Summary
This feature will allow an entire JSON object to be indexed into a field, and provide limited search functionality over the field's contents. Given an object field
header
of the form{"content-type": "text/html", "referer": "https://google.com"}
, its content will be analyzed into the individual tokenscontent-type\0text/html
,referer\0https://google.com
(where\0
is some suitable delimiter). Additionally, tokens are created for each value alone:text/html
,https://google.com
. Each leaf value in the object becomes its own token, and no further analysis is applied to the individual values.In addition to being able to retrieve the JSON blob (through fetching source, or as a stored field), we plan to support queries of the following forms:
header
, value:application/json
, for example{"term": {"header": "application/json"}}
header.content-type
, value:application/json
, for example{"term": {"header.content-type": "application/json"}}
Note that it is not possible to search the prefixed tokens directly, i.e. the following query will not return results:
{"term": {"header": "content-type\0application/json"}}
.As a first pass, the following query types will be allowed:
term
,terms
,terms_set
,range
(without special support for numerics),prefix
,match
family (insofar as they work for keyword fields),query_string
,simple_query_string
,exists
.In this first version, it will not be possible to refer to field keys using wildcards, as in
{"header.content-*": "application/json"}
. Under the proposed API/ implementation, supporting field wildcards would add significant complexity and uncertainty around performance.Potential Extensions
copy_to
to work on entire objects, so that the same JSON blob could be added both as a 'queryable object' field, and also as normal object with explicit subfield definitions.prefix_length
, we could likely supportwildcard
,regexp
, andfuzzy
queries.match_phrase
.Implementation Plan
Core items:
{"header": "application/json"}
. Add a simple JSON field mapper. #33923{"header.content-type": "application/json"}
. When parsing JSON fields, also create tokens prefixed with the field key. #34207 Add support for querying JSON fields based on key. #34621embedded_json
. Rename 'json' to 'embedded_json'. #40712The text was updated successfully, but these errors were encountered: