Scripts to expose whole values for fields of the `text` family #81246

jpountz · 2021-12-02T10:04:24Z

One issue I keep hearing about is that it's too hard to define a runtime field that extracts some information from a message field with Painless. Something like extracting the HTTP status code from a log line of an Apache access log.

I think that this issue has been put into the general meta issue of "doing simple things with Painless should be simpler" but in my opinion this particular issue has more to do with mappings than with Painless. Historically, fielddata on analyzed string fields would uninvert the inverted index in memory and Elasticsearch would consider that the value of a field is the set of analyzed terms that it contains. This would require lots of memory, and over time we've increasingly discouraged users from doing it.

These semantics don't work well with runtime extraction of data. If you try to extract data using a regular expression that applies to doc['message'], you'll get an exception that fielddata is disabled by default on text fields. And even if Elasticsearch returned values, you'd get individual terms, which you cannot leverage to properly extract data from the message.

I suggest that we change the semantics of fielddata on fields of the text family (including text and match_only_text) so that it returns whole values instead. This will enable us to give a more intuitive experience with scripts, where doc could read data from _source on text fields (#80504).

Note that this brings a downside: in order to make it easy to slice and dice the data, Elasticsearch allows users to use terms produce by terms aggregations in term filters, in order to dig further data that falls within a given bucket. This would not work on text fields. I don't think it's the end of the world, since terms aggregations do not work on text fields today anyway given that we disallow fielddata, but I wanted to highlight it since it would create an exception to a rule that is otherwise honored by keyword, ip or numeric fields.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-12-02T10:04:27Z

Pinging @elastic/es-search (Team:Search)

romseygeek · 2021-12-02T10:21:24Z

+1. Combined with an 'exact' query, this would also allow us to do away with the keyword + text multifield we generate by default for text input, replacing it with a fielddata-enabled text field.

markharwood · 2021-12-02T10:40:42Z

+1

Note that this brings a downside ... Elasticsearch allows users to use terms produce by terms aggregations in term filters, in order to dig further data that falls within a given bucket. This would not work on text fields.

The pattern for "discover then drill-down" on text fields is to use significant_text agg and then a term query.
I use this daily to see trending terms/shingles in news headlines and then click to see details:

The change proposed on this issue shouldn't effect that technique.

jdconrad · 2021-12-02T16:03:43Z

This seems like a great idea! I imagine it's very rare that a user wants the analyzed data when they access text in this form. This seems like a natural extension of (#80504) as well where the scripting fields api can fallback to source for any text field even when it exists in the mappings.

Edit: My main concern is bwc given that some scripts won't work anymore, so we will likely need a long deprecation period, but I think starting out w/ this in the scripting fields api only while still giving access through script doc values makes the deprecation easier.

jpountz · 2021-12-06T14:04:37Z

When discussing this issue last Friday, @jtibshirani suggested that we decouple scripts and other Elasticsearch functionality when making this change, so that scripts would start seeing whole values but e.g. aggregations could keep returning one value per token for some more time. This would introduce some inconsistency, e.g. aggregating over a text field called body and over a script that does doc['body'].values would return different results, but maybe in a less surprising way than if we were to change the semantics of fielddata on text fields all at once?

ruflin · 2022-06-15T08:04:44Z

Commenting on this from a pure user perspective. I stumbled over this when building a script for a runtime field. I used:

String timestamp=dissect('%{foo}[%{timestamp}]%{bar} ').extract(doc["message"].value)?.timestamp;

I get an error on the message field because of the above issue and had to convert it to:

String timestamp=dissect('%{foo}[%{timestamp}]%{bar} ').extract(params["_source"]["message"])?.timestamp;

For me the ideal experiences is that I don't have to know about the underlying implementation details of Elasticsearch to access the data. I want to just have 1 way to access all the data independent of the type of the field.

javanna · 2022-06-24T20:44:17Z

@ruflin that is the plan, the Painless team has been iterating on a new API to access field values from a script, that would transparently load values from where they are available, without users having to worry about this aspect.

javanna · 2022-06-24T21:44:18Z

We discussed this issue with the team, and we said that scripts should load whole values for text fields, but without affecting for now existing consumers of text fields that rely on it returning all the different tokens (e.g. significant terms).

This goes along with the conclusions from the discussion on falling back to loading from _source : once the script fields API is able to fallback to _source when doc_values are disabled, text fields will instead always load from _source when referred to from a script, regardless of whether fielddata is enabled or not.

We could possibly also load from a keyword sub-field if available, although there are complications (if a normalizer is configured, the content differs from what you have in _source etc.), and maybe we should rather address #53181 then.

javanna · 2022-06-24T21:45:59Z

@jpountz given that you open this issue specifically to address scripting needs, do you feel like we should discuss further what to do for text fields outside of scripting, or can we declare this issue resolved once scripts are able to transparently load whole values for text fields?

felixbarny · 2022-06-29T06:36:32Z

Would the fallback for match_only_text also work if synthetic _source is enabled?

If not that would be an issue as we'd like to use synthetic source for logs but the message field is mapped as match_only_text.

elasticsearch/x-pack/plugin/core/src/main/resources/data-streams-mappings.json

Lines 17 to 20 in 7bbdf6a

    
           "match": "message", 
        
           "mapping": { 
        
             "type": "match_only_text" 
        
           }

There were discussions about enabling only certain fields for _source and to rely on synthetic source for others. Maybe message would be a field where we'd still rely on source.

jpountz · 2022-06-29T07:02:56Z

@javanna I would be ok with closing this issue if the scripting concern is addressed.

@felixbarny Indeed, there have been a few discussions about this, e.g. should we store (store: true) text and match_only_text fields to make them compatible with synthetic source.

javanna · 2022-06-29T09:48:13Z

@felixbarny what synthetic source currently does for text fields is load from a keyword sub-field if it exists. We have discussed loading from stored fields when available but it's not implemented yet. We also discussed that we may need a synthetic source mode that automatically stores text fields separately so that they can be supported out-of-the-box when synthetic source is enabled, without users having to set store:true manually to all of their text fields. @nik9000 can you confirm that match_only_text behaves the same as text?

romseygeek · 2022-06-29T10:22:13Z

Currently match_only_text doesn't support synthetic source. I don't think it would be too tricky to add, though.

javanna · 2022-08-03T16:34:47Z

I have updated the title of this issue to reflect the current goal, which is limited to scripts.

jdconrad · 2022-08-17T16:14:03Z

Closing as this feature is now supported for scripting via (#89396).

…89396) This change adds access to mapped text fields via the Painless scripting fields API. The values returned from a text field via the scripting fields API always use source as described by (elastic#81246). Access via the old-style through doc will still depend on field data, so there is no change and avoids bwc issues.

felixbarny · 2022-08-18T13:09:47Z

I just gave it a spin. Works great on text fields, but match_only_text support is missing. As the message field is mapped as match_only_text my default, we can still only access it via params["_source"]["message"] rather than $('message', ''). Any plans to add support for match_only_text? Should we create a separate issue or maybe re-open this one (as it's about adding support text-family fields).

javanna · 2022-08-18T14:20:25Z

thanks @felixbarny for the feedback! Let me reopen this then.

jdconrad · 2022-08-18T17:45:06Z

@felixbarny Thank you for the rapid feedback! After speaking with @javanna I posted #89473 which should cover match_only_text. Once that PR is merged, I would appreciate if you could check this covers what you need.

This change adds access to mapped match_only_text fields via the Painless scripting fields API. The values returned from a match_only_text field via the scripting fields API always use source as described by (#81246). These are not available via doc values so there are no bwc issues.

jdconrad · 2022-08-22T14:33:41Z

@felixbarny The change has been merged so now there should be support for both text and match_only_text. I'm going to close this again, but please let me know if there's any issues you encounter.

felixbarny · 2022-08-22T14:56:14Z

Works great, thanks!

jpountz added >enhancement >breaking :Search Foundations/Mapping Index mappings, including merging and defining field types team-discuss labels Dec 2, 2021

elasticmachine added the Team:Search Meta label for search team label Dec 2, 2021

jpountz mentioned this issue Dec 6, 2021

Automatically load unmapped fields from _source? #81357

Open

stu-elastic mentioned this issue Dec 7, 2021

Add mapped types to scripting fields api #79105

Open

49 tasks

jdconrad mentioned this issue Jun 14, 2022

Mappings should fall back to _source when doc values are disabled #80504

Open

javanna removed the team-discuss label Jun 24, 2022

javanna removed the >breaking label Jun 29, 2022

javanna changed the title ~~Change the semantics of fielddata on fields of the text family to return whole values.~~ Scripts to expose whole values for fields of the text family Aug 3, 2022

jdconrad mentioned this issue Aug 16, 2022

Add text field support in the Painless scripting fields API #89396

Merged

javanna assigned jdconrad Aug 17, 2022

jdconrad closed this as completed Aug 17, 2022

javanna reopened this Aug 18, 2022

javanna mentioned this issue Aug 18, 2022

Formalize dual text/keyword mappings #53181

Open

jdconrad mentioned this issue Aug 18, 2022

Add source fallback support for match_only_text mapped type #89473

Merged

jdconrad closed this as completed Aug 22, 2022

javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripts to expose whole values for fields of the `text` family #81246

Scripts to expose whole values for fields of the `text` family #81246

jpountz commented Dec 2, 2021

elasticmachine commented Dec 2, 2021

romseygeek commented Dec 2, 2021

markharwood commented Dec 2, 2021

jdconrad commented Dec 2, 2021 •

edited

Loading

jpountz commented Dec 6, 2021

ruflin commented Jun 15, 2022

javanna commented Jun 24, 2022

javanna commented Jun 24, 2022

javanna commented Jun 24, 2022

felixbarny commented Jun 29, 2022

jpountz commented Jun 29, 2022

javanna commented Jun 29, 2022

romseygeek commented Jun 29, 2022

javanna commented Aug 3, 2022

jdconrad commented Aug 17, 2022 •

edited

Loading

felixbarny commented Aug 18, 2022

javanna commented Aug 18, 2022

jdconrad commented Aug 18, 2022 •

edited

Loading

jdconrad commented Aug 22, 2022

felixbarny commented Aug 22, 2022

Scripts to expose whole values for fields of the text family #81246

Scripts to expose whole values for fields of the text family #81246

Comments

jpountz commented Dec 2, 2021

elasticmachine commented Dec 2, 2021

romseygeek commented Dec 2, 2021

markharwood commented Dec 2, 2021

jdconrad commented Dec 2, 2021 • edited Loading

jpountz commented Dec 6, 2021

ruflin commented Jun 15, 2022

javanna commented Jun 24, 2022

javanna commented Jun 24, 2022

javanna commented Jun 24, 2022

felixbarny commented Jun 29, 2022

jpountz commented Jun 29, 2022

javanna commented Jun 29, 2022

romseygeek commented Jun 29, 2022

javanna commented Aug 3, 2022

jdconrad commented Aug 17, 2022 • edited Loading

felixbarny commented Aug 18, 2022

javanna commented Aug 18, 2022

jdconrad commented Aug 18, 2022 • edited Loading

jdconrad commented Aug 22, 2022

felixbarny commented Aug 22, 2022

Scripts to expose whole values for fields of the `text` family #81246

Scripts to expose whole values for fields of the `text` family #81246

jdconrad commented Dec 2, 2021 •

edited

Loading

jdconrad commented Aug 17, 2022 •

edited

Loading

jdconrad commented Aug 18, 2022 •

edited

Loading