Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripts to expose whole values for fields of the text family #81246

Closed
jpountz opened this issue Dec 2, 2021 · 20 comments
Closed

Scripts to expose whole values for fields of the text family #81246

jpountz opened this issue Dec 2, 2021 · 20 comments
Assignees
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@jpountz
Copy link
Contributor

jpountz commented Dec 2, 2021

One issue I keep hearing about is that it's too hard to define a runtime field that extracts some information from a message field with Painless. Something like extracting the HTTP status code from a log line of an Apache access log.

I think that this issue has been put into the general meta issue of "doing simple things with Painless should be simpler" but in my opinion this particular issue has more to do with mappings than with Painless. Historically, fielddata on analyzed string fields would uninvert the inverted index in memory and Elasticsearch would consider that the value of a field is the set of analyzed terms that it contains. This would require lots of memory, and over time we've increasingly discouraged users from doing it.

These semantics don't work well with runtime extraction of data. If you try to extract data using a regular expression that applies to doc['message'], you'll get an exception that fielddata is disabled by default on text fields. And even if Elasticsearch returned values, you'd get individual terms, which you cannot leverage to properly extract data from the message.

I suggest that we change the semantics of fielddata on fields of the text family (including text and match_only_text) so that it returns whole values instead. This will enable us to give a more intuitive experience with scripts, where doc could read data from _source on text fields (#80504).

Note that this brings a downside: in order to make it easy to slice and dice the data, Elasticsearch allows users to use terms produce by terms aggregations in term filters, in order to dig further data that falls within a given bucket. This would not work on text fields. I don't think it's the end of the world, since terms aggregations do not work on text fields today anyway given that we disallow fielddata, but I wanted to highlight it since it would create an exception to a rule that is otherwise honored by keyword, ip or numeric fields.

@jpountz jpountz added >enhancement >breaking :Search Foundations/Mapping Index mappings, including merging and defining field types team-discuss labels Dec 2, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Dec 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@romseygeek
Copy link
Contributor

+1. Combined with an 'exact' query, this would also allow us to do away with the keyword + text multifield we generate by default for text input, replacing it with a fielddata-enabled text field.

@markharwood
Copy link
Contributor

+1

Note that this brings a downside ... Elasticsearch allows users to use terms produce by terms aggregations in term filters, in order to dig further data that falls within a given bucket. This would not work on text fields.

The pattern for "discover then drill-down" on text fields is to use significant_text agg and then a term query.
I use this daily to see trending terms/shingles in news headlines and then click to see details:

Kibana

The change proposed on this issue shouldn't effect that technique.

@jdconrad
Copy link
Contributor

jdconrad commented Dec 2, 2021

This seems like a great idea! I imagine it's very rare that a user wants the analyzed data when they access text in this form. This seems like a natural extension of (#80504) as well where the scripting fields api can fallback to source for any text field even when it exists in the mappings.

Edit: My main concern is bwc given that some scripts won't work anymore, so we will likely need a long deprecation period, but I think starting out w/ this in the scripting fields api only while still giving access through script doc values makes the deprecation easier.

@jpountz
Copy link
Contributor Author

jpountz commented Dec 6, 2021

When discussing this issue last Friday, @jtibshirani suggested that we decouple scripts and other Elasticsearch functionality when making this change, so that scripts would start seeing whole values but e.g. aggregations could keep returning one value per token for some more time. This would introduce some inconsistency, e.g. aggregating over a text field called body and over a script that does doc['body'].values would return different results, but maybe in a less surprising way than if we were to change the semantics of fielddata on text fields all at once?

@ruflin
Copy link
Member

ruflin commented Jun 15, 2022

Commenting on this from a pure user perspective. I stumbled over this when building a script for a runtime field. I used:

String timestamp=dissect('%{foo}[%{timestamp}]%{bar} ').extract(doc["message"].value)?.timestamp;

I get an error on the message field because of the above issue and had to convert it to:

String timestamp=dissect('%{foo}[%{timestamp}]%{bar} ').extract(params["_source"]["message"])?.timestamp;

For me the ideal experiences is that I don't have to know about the underlying implementation details of Elasticsearch to access the data. I want to just have 1 way to access all the data independent of the type of the field.

@javanna
Copy link
Member

javanna commented Jun 24, 2022

@ruflin that is the plan, the Painless team has been iterating on a new API to access field values from a script, that would transparently load values from where they are available, without users having to worry about this aspect.

@javanna
Copy link
Member

javanna commented Jun 24, 2022

We discussed this issue with the team, and we said that scripts should load whole values for text fields, but without affecting for now existing consumers of text fields that rely on it returning all the different tokens (e.g. significant terms).

This goes along with the conclusions from the discussion on falling back to loading from _source : once the script fields API is able to fallback to _source when doc_values are disabled, text fields will instead always load from _source when referred to from a script, regardless of whether fielddata is enabled or not.

We could possibly also load from a keyword sub-field if available, although there are complications (if a normalizer is configured, the content differs from what you have in _source etc.), and maybe we should rather address #53181 then.

@javanna
Copy link
Member

javanna commented Jun 24, 2022

@jpountz given that you open this issue specifically to address scripting needs, do you feel like we should discuss further what to do for text fields outside of scripting, or can we declare this issue resolved once scripts are able to transparently load whole values for text fields?

@felixbarny
Copy link
Member

Would the fallback for match_only_text also work if synthetic _source is enabled?

If not that would be an issue as we'd like to use synthetic source for logs but the message field is mapped as match_only_text.

"match": "message",
"mapping": {
"type": "match_only_text"
}

There were discussions about enabling only certain fields for _source and to rely on synthetic source for others. Maybe message would be a field where we'd still rely on source.

@jpountz
Copy link
Contributor Author

jpountz commented Jun 29, 2022

@javanna I would be ok with closing this issue if the scripting concern is addressed.

@felixbarny Indeed, there have been a few discussions about this, e.g. should we store (store: true) text and match_only_text fields to make them compatible with synthetic source.

@javanna
Copy link
Member

javanna commented Jun 29, 2022

@felixbarny what synthetic source currently does for text fields is load from a keyword sub-field if it exists. We have discussed loading from stored fields when available but it's not implemented yet. We also discussed that we may need a synthetic source mode that automatically stores text fields separately so that they can be supported out-of-the-box when synthetic source is enabled, without users having to set store:true manually to all of their text fields. @nik9000 can you confirm that match_only_text behaves the same as text?

@romseygeek
Copy link
Contributor

Currently match_only_text doesn't support synthetic source. I don't think it would be too tricky to add, though.

@javanna javanna removed the >breaking label Jun 29, 2022
@javanna javanna changed the title Change the semantics of fielddata on fields of the text family to return whole values. Scripts to expose whole values for fields of the text family Aug 3, 2022
@javanna
Copy link
Member

javanna commented Aug 3, 2022

I have updated the title of this issue to reflect the current goal, which is limited to scripts.

@jdconrad
Copy link
Contributor

jdconrad commented Aug 17, 2022

Closing as this feature is now supported for scripting via (#89396).

pull bot pushed a commit to NOUIY/elasticsearch that referenced this issue Aug 17, 2022
…89396)

This change adds access to mapped text fields via the Painless scripting fields API. The values returned 
from a text field via the scripting fields API always use source as described by (elastic#81246). Access via the 
old-style through doc will still depend on field data, so there is no change and avoids bwc issues.
@felixbarny
Copy link
Member

I just gave it a spin. Works great on text fields, but match_only_text support is missing. As the message field is mapped as match_only_text my default, we can still only access it via params["_source"]["message"] rather than $('message', ''). Any plans to add support for match_only_text? Should we create a separate issue or maybe re-open this one (as it's about adding support text-family fields).

@javanna
Copy link
Member

javanna commented Aug 18, 2022

thanks @felixbarny for the feedback! Let me reopen this then.

@jdconrad
Copy link
Contributor

jdconrad commented Aug 18, 2022

@felixbarny Thank you for the rapid feedback! After speaking with @javanna I posted #89473 which should cover match_only_text. Once that PR is merged, I would appreciate if you could check this covers what you need.

jdconrad added a commit that referenced this issue Aug 22, 2022
This change adds access to mapped match_only_text fields via the Painless scripting fields API. The 
values returned from a match_only_text field via the scripting fields API always use source as described 
by (#81246). These are not available via doc values so there are no bwc issues.
@jdconrad
Copy link
Contributor

@felixbarny The change has been merged so now there should be support for both text and match_only_text. I'm going to close this again, but please let me know if there's any issues you encounter.

@felixbarny
Copy link
Member

Works great, thanks!

@javanna javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

8 participants