Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripted analysis components #26100

Closed
jpountz opened this issue Aug 8, 2017 · 9 comments
Closed

Scripted analysis components #26100

jpountz opened this issue Aug 8, 2017 · 9 comments
Labels
>feature :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@jpountz
Copy link
Contributor

jpountz commented Aug 8, 2017

If you have specific analysis needs, the only way to do it today is to write a plugin. This is quite some work since the plugin needs to be rebuilt for every release so it might be a bit frustrating if your needs are simple.

We could give users the ability to write custom analysis components using scripts so that such simple needs could be addressed with a vanilla installation of Elasticsearch.

@jpountz jpountz added :Search Relevance/Analysis How text is split into tokens discuss labels Aug 8, 2017
@jpountz
Copy link
Contributor Author

jpountz commented Aug 11, 2017

We agreed to do it but there are some challenges on the way, in particular:

  • What API do we expose? There is a simplicity / flexibility trade-off.
  • How do we deal with the fact that stored scripts may be updated while analysis components used at index-time should remain the same for the entire lifetime of the index?

@jpountz jpountz added >feature and removed discuss labels Aug 11, 2017
@rjernst
Copy link
Member

rjernst commented Aug 14, 2017

How do we deal with the fact that stored scripts may be updated while analysis components used at index-time should remain the same for the entire lifetime of the index?

This is what concerns me the most. Perhaps this is a case where the scripting capability needs to have an api shim over it, in order to maintain the inability to modify these scripts?

@nik9000
Copy link
Member

nik9000 commented Aug 15, 2017

This is what concerns me the most.

Yeah. We talked about, for example, forcing you to use inline scripts in the index settings. Then they can't change but you are stuck using inline scripts.

@romseygeek
Copy link
Contributor

cc @elastic/es-search-aggs

@romseygeek
Copy link
Contributor

We have condition filters now (#31958) and we should have scriptable stop filters soon (#33431). We also have the multiplexing filter which allows you to emit different variations on a token at the same position. I think the only thing remaining is to have a token mutating filter that allows you to change the bytes of a token using a script.

The tricky part here will be the API. We could just expose CharTermAttribute but that's not a trivial class to use, particularly if you want to prepend characters to a token. I'm also slightly worried that it would be easy to write a script that ends up inadvertently creating a bunch of objects, which you really don't want to do in the fast loop of an analysis chain.

One option could be to create a new type that wraps CharTermAttribute, with append(), prepend() and substring() methods, which would cover most use cases. More complicated substitutions can already be done using PatternReplaceFilter.

@jpountz
Copy link
Contributor Author

jpountz commented Sep 7, 2018

have a token mutating filter that allows you to change the bytes of a token using a script

+1 to only allow mutating bytes

One option could be to create a new type that wraps CharTermAttribute, with append(), prepend() and substring() methods, which would cover most use cases.

Maybe we can look at existing stemmers and see what building blocks would be useful to implement them. For instance I think reimplementing EnglishMinimalStemFilter would make a great example in our docs.

@nik9000
Copy link
Member

nik9000 commented Sep 7, 2018

For instance I think reimplementing EnglishMinimalStemFilter would make a great example in our docs.

❤️

We have a tool in the docs that calls the analyze API on a bunch of strings with two different analyzers and compares the results. It might help here.

@cbuescher
Copy link
Member

As an additional datapoint where a scripted filter might be interesting: #34402.
Trying to summarize what the request is there: Support for a token filter/char filter that cleans up certain boundary characters without having to usea regex in a pattern_replace filter that can be used in a normalizer for the keyword data type.
While this is probably a bit to specific to implement in a dedicated filter, it would probably fit in with a more generalized "script" filter nicely.

@javanna
Copy link
Member

javanna commented Nov 16, 2022

This issue has been opened for 5 years and had no activity in the last 2. A couple of scripted components have been added, and we are not currently planning to add more, hence I am closing. Let's reopen in case the need comes up again in the future.

@javanna javanna closed this as completed Nov 16, 2022
@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

6 participants