-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scripts to expose whole values for fields of the text
family
#81246
Comments
Pinging @elastic/es-search (Team:Search) |
+1. Combined with an 'exact' query, this would also allow us to do away with the keyword + text multifield we generate by default for text input, replacing it with a fielddata-enabled text field. |
+1
The pattern for "discover then drill-down" on text fields is to use The change proposed on this issue shouldn't effect that technique. |
This seems like a great idea! I imagine it's very rare that a user wants the analyzed data when they access text in this form. This seems like a natural extension of (#80504) as well where the scripting fields api can fallback to source for any text field even when it exists in the mappings. Edit: My main concern is bwc given that some scripts won't work anymore, so we will likely need a long deprecation period, but I think starting out w/ this in the scripting fields api only while still giving access through script doc values makes the deprecation easier. |
When discussing this issue last Friday, @jtibshirani suggested that we decouple scripts and other Elasticsearch functionality when making this change, so that scripts would start seeing whole values but e.g. aggregations could keep returning one value per token for some more time. This would introduce some inconsistency, e.g. aggregating over a |
Commenting on this from a pure user perspective. I stumbled over this when building a script for a runtime field. I used:
I get an error on the
For me the ideal experiences is that I don't have to know about the underlying implementation details of Elasticsearch to access the data. I want to just have 1 way to access all the data independent of the type of the field. |
@ruflin that is the plan, the Painless team has been iterating on a new API to access field values from a script, that would transparently load values from where they are available, without users having to worry about this aspect. |
We discussed this issue with the team, and we said that scripts should load whole values for text fields, but without affecting for now existing consumers of text fields that rely on it returning all the different tokens (e.g. significant terms). This goes along with the conclusions from the discussion on falling back to loading from _source : once the script fields API is able to fallback to _source when doc_values are disabled, text fields will instead always load from _source when referred to from a script, regardless of whether fielddata is enabled or not. We could possibly also load from a keyword sub-field if available, although there are complications (if a normalizer is configured, the content differs from what you have in _source etc.), and maybe we should rather address #53181 then. |
@jpountz given that you open this issue specifically to address scripting needs, do you feel like we should discuss further what to do for text fields outside of scripting, or can we declare this issue resolved once scripts are able to transparently load whole values for text fields? |
Would the fallback for If not that would be an issue as we'd like to use synthetic source for logs but the elasticsearch/x-pack/plugin/core/src/main/resources/data-streams-mappings.json Lines 17 to 20 in 7bbdf6a
There were discussions about enabling only certain fields for _source and to rely on synthetic source for others. Maybe |
@javanna I would be ok with closing this issue if the scripting concern is addressed. @felixbarny Indeed, there have been a few discussions about this, e.g. should we store ( |
@felixbarny what synthetic source currently does for text fields is load from a keyword sub-field if it exists. We have discussed loading from stored fields when available but it's not implemented yet. We also discussed that we may need a synthetic source mode that automatically stores text fields separately so that they can be supported out-of-the-box when synthetic source is enabled, without users having to set store:true manually to all of their text fields. @nik9000 can you confirm that match_only_text behaves the same as text? |
Currently |
text
family to return whole values.text
family
I have updated the title of this issue to reflect the current goal, which is limited to scripts. |
Closing as this feature is now supported for scripting via (#89396). |
…89396) This change adds access to mapped text fields via the Painless scripting fields API. The values returned from a text field via the scripting fields API always use source as described by (elastic#81246). Access via the old-style through doc will still depend on field data, so there is no change and avoids bwc issues.
I just gave it a spin. Works great on |
thanks @felixbarny for the feedback! Let me reopen this then. |
@felixbarny Thank you for the rapid feedback! After speaking with @javanna I posted #89473 which should cover match_only_text. Once that PR is merged, I would appreciate if you could check this covers what you need. |
This change adds access to mapped match_only_text fields via the Painless scripting fields API. The values returned from a match_only_text field via the scripting fields API always use source as described by (#81246). These are not available via doc values so there are no bwc issues.
@felixbarny The change has been merged so now there should be support for both text and match_only_text. I'm going to close this again, but please let me know if there's any issues you encounter. |
Works great, thanks! |
One issue I keep hearing about is that it's too hard to define a runtime field that extracts some information from a
message
field with Painless. Something like extracting the HTTP status code from a log line of an Apache access log.I think that this issue has been put into the general meta issue of "doing simple things with Painless should be simpler" but in my opinion this particular issue has more to do with mappings than with Painless. Historically, fielddata on analyzed
string
fields would uninvert the inverted index in memory and Elasticsearch would consider that the value of a field is the set of analyzed terms that it contains. This would require lots of memory, and over time we've increasingly discouraged users from doing it.These semantics don't work well with runtime extraction of data. If you try to extract data using a regular expression that applies to
doc['message']
, you'll get an exception that fielddata is disabled by default ontext
fields. And even if Elasticsearch returned values, you'd get individual terms, which you cannot leverage to properly extract data from the message.I suggest that we change the semantics of fielddata on fields of the
text
family (includingtext
andmatch_only_text
) so that it returns whole values instead. This will enable us to give a more intuitive experience with scripts, wheredoc
could read data from_source
ontext
fields (#80504).Note that this brings a downside: in order to make it easy to slice and dice the data, Elasticsearch allows users to use terms produce by
terms
aggregations interm
filters, in order to dig further data that falls within a given bucket. This would not work ontext
fields. I don't think it's the end of the world, sinceterms
aggregations do not work ontext
fields today anyway given that we disallow fielddata, but I wanted to highlight it since it would create an exception to a rule that is otherwise honored bykeyword
,ip
or numeric fields.The text was updated successfully, but these errors were encountered: