Scripted analysis components #26100

jpountz · 2017-08-08T15:58:49Z

If you have specific analysis needs, the only way to do it today is to write a plugin. This is quite some work since the plugin needs to be rebuilt for every release so it might be a bit frustrating if your needs are simple.

We could give users the ability to write custom analysis components using scripts so that such simple needs could be addressed with a vanilla installation of Elasticsearch.

jpountz · 2017-08-11T15:09:27Z

We agreed to do it but there are some challenges on the way, in particular:

What API do we expose? There is a simplicity / flexibility trade-off.
How do we deal with the fact that stored scripts may be updated while analysis components used at index-time should remain the same for the entire lifetime of the index?

rjernst · 2017-08-14T18:58:58Z

How do we deal with the fact that stored scripts may be updated while analysis components used at index-time should remain the same for the entire lifetime of the index?

This is what concerns me the most. Perhaps this is a case where the scripting capability needs to have an api shim over it, in order to maintain the inability to modify these scripts?

nik9000 · 2017-08-15T15:51:52Z

This is what concerns me the most.

Yeah. We talked about, for example, forcing you to use inline scripts in the index settings. Then they can't change but you are stuck using inline scripts.

romseygeek · 2018-03-14T14:09:22Z

cc @elastic/es-search-aggs

romseygeek · 2018-09-06T13:17:59Z

We have condition filters now (#31958) and we should have scriptable stop filters soon (#33431). We also have the multiplexing filter which allows you to emit different variations on a token at the same position. I think the only thing remaining is to have a token mutating filter that allows you to change the bytes of a token using a script.

The tricky part here will be the API. We could just expose CharTermAttribute but that's not a trivial class to use, particularly if you want to prepend characters to a token. I'm also slightly worried that it would be easy to write a script that ends up inadvertently creating a bunch of objects, which you really don't want to do in the fast loop of an analysis chain.

One option could be to create a new type that wraps CharTermAttribute, with append(), prepend() and substring() methods, which would cover most use cases. More complicated substitutions can already be done using PatternReplaceFilter.

jpountz · 2018-09-07T12:37:30Z

have a token mutating filter that allows you to change the bytes of a token using a script

+1 to only allow mutating bytes

One option could be to create a new type that wraps CharTermAttribute, with append(), prepend() and substring() methods, which would cover most use cases.

Maybe we can look at existing stemmers and see what building blocks would be useful to implement them. For instance I think reimplementing EnglishMinimalStemFilter would make a great example in our docs.

nik9000 · 2018-09-07T12:46:36Z

For instance I think reimplementing EnglishMinimalStemFilter would make a great example in our docs.

❤️

We have a tool in the docs that calls the analyze API on a bunch of strings with two different analyzers and compares the results. It might help here.

cbuescher · 2018-10-24T11:15:50Z

As an additional datapoint where a scripted filter might be interesting: #34402.
Trying to summarize what the request is there: Support for a token filter/char filter that cleans up certain boundary characters without having to usea regex in a pattern_replace filter that can be used in a normalizer for the keyword data type.
While this is probably a bit to specific to implement in a dedicated filter, it would probably fit in with a more generalized "script" filter nicely.

javanna · 2022-11-16T20:56:38Z

This issue has been opened for 5 years and had no activity in the last 2. A couple of scripted components have been added, and we are not currently planning to add more, hence I am closing. Let's reopen in case the need comes up again in the future.

jpountz added :Search Relevance/Analysis How text is split into tokens discuss labels Aug 8, 2017

jpountz added >feature and removed discuss labels Aug 11, 2017

cbuescher mentioned this issue Oct 24, 2018

Support for token filter/char filter that cleanups word boundaries (remove or replace with space) according to Unicode Text Segmentation algorithm #34402

Closed

rjernst added the Team:Search Meta label for search team label May 4, 2020

javanna closed this as completed Nov 16, 2022

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripted analysis components #26100

Scripted analysis components #26100

jpountz commented Aug 8, 2017

jpountz commented Aug 11, 2017

rjernst commented Aug 14, 2017

nik9000 commented Aug 15, 2017

romseygeek commented Mar 14, 2018

romseygeek commented Sep 6, 2018

jpountz commented Sep 7, 2018

nik9000 commented Sep 7, 2018

cbuescher commented Oct 24, 2018

javanna commented Nov 16, 2022

Scripted analysis components #26100

Scripted analysis components #26100

Comments

jpountz commented Aug 8, 2017

jpountz commented Aug 11, 2017

rjernst commented Aug 14, 2017

nik9000 commented Aug 15, 2017

romseygeek commented Mar 14, 2018

romseygeek commented Sep 6, 2018

jpountz commented Sep 7, 2018

nik9000 commented Sep 7, 2018

cbuescher commented Oct 24, 2018

javanna commented Nov 16, 2022