-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add conditional token filter to elasticsearch #31958
Conversation
Pinging @elastic/es-search-aggs |
This is a first step adding the analysis script context, and illustrating how to use it with the conditional token filter. I'd also like to look at adding scriptable filtering filters (ie, 'ignore this token if it matches a predicate'), and possibly the ability to make changes to the token itself. |
int endOffset | ||
String type | ||
boolean isKeyword | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm exposing this information directly at the moment, but it might be better to expose getters instead so that consumers don't try and do things like change offsets or position increments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to exposing getters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also should we expoe the absolute position rather than the position increment? This looks more useful to me for filtering. Or both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Painless doesn't require the get
part of the getter anyway so I'd prefer getters.
|
||
public void testAnalysisScript() { | ||
AnalysisPredicateScript.Factory factory = scriptEngine.compile("test", "return \"one\".contentEquals(term.term)", | ||
AnalysisPredicateScript.CONTEXT, Collections.emptyMap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to use contentEquals
is a bit trappy here as well, but I want to avoid creating Strings on every call to incrementToken()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but like it overall.
int endOffset | ||
String type | ||
boolean isKeyword | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to exposing getters
int endOffset | ||
String type | ||
boolean isKeyword | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also should we expoe the absolute position rather than the position increment? This looks more useful to me for filtering. Or both?
@@ -175,3 +175,13 @@ class org.elasticsearch.index.similarity.ScriptedSimilarity$Doc { | |||
int getLength() | |||
float getFreq() | |||
} | |||
|
|||
class org.elasticsearch.script.AnalysisPredicateScript$Term { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be called Token rather than Term? It might be just me but I feel like it better carries the information that there are other attributes here like positions and offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should look at creating a new context and adding this to the whitelist of that context. See `plugins/example/painless-whitelist' for some of that.
@@ -202,6 +222,8 @@ | |||
filters.put("classic", ClassicFilterFactory::new); | |||
filters.put("czech_stem", CzechStemTokenFilterFactory::new); | |||
filters.put("common_grams", requriesAnalysisSettings(CommonGramsTokenFilterFactory::new)); | |||
filters.put("condition", | |||
requriesAnalysisSettings((i, e, n, s) -> new ScriptedConditionTokenFilterFactory(i, n, s, scriptService.get()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requires has a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pre-existing typo; I'll open a separate PR to fix it
[float] | ||
=== Options | ||
[horizontal] | ||
filters:: a list of token filters to apply to the current token if the predicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be explicit that these filters are chained. Maybe call it filter
(no plural) for consistency with analyzers?
} | ||
this.factory = scriptService.compile(script, AnalysisPredicateScript.CONTEXT); | ||
|
||
this.filterNames = settings.getAsList("filters"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we fail on an empty list?
"type" : "condition", | ||
"filters" : [ "lowercase" ], | ||
"script" : { | ||
"source" : "return term.term.length() < 5" <1> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe you need the return
here and I think it'd read a little better without it.
|
||
@Override | ||
public Collection<Object> createComponents(Client client, ClusterService clusterService, ThreadPool threadPool, ResourceWatcherService resourceWatcherService, ScriptService scriptService, NamedXContentRegistry xContentRegistry, Environment environment, NodeEnvironment nodeEnvironment, NamedWriteableRegistry namedWriteableRegistry) { | ||
this.scriptService.set(scriptService); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where're never very happy with using SetOnce
like this. It gets the job done but it reaks guice's "everything depends on everything"-ness that we've worked so hard to remove over the years. Not that I have anything better though.
@@ -175,3 +175,13 @@ class org.elasticsearch.index.similarity.ScriptedSimilarity$Doc { | |||
int getLength() | |||
float getFreq() | |||
} | |||
|
|||
class org.elasticsearch.script.AnalysisPredicateScript$Term { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should look at creating a new context and adding this to the whitelist of that context. See `plugins/example/painless-whitelist' for some of that.
int endOffset | ||
String type | ||
boolean isKeyword | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Painless doesn't require the get
part of the getter anyway so I'd prefer getters.
import java.util.List; | ||
import java.util.Map; | ||
|
||
public class AnalysisScriptTests extends ScriptTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be in your plugin instead of painless if we can manage it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow - do you mean moving the analysis script context out of core and into the common analysis plugin, or do you mean creating an entirely new plugin just for this token filter (and any other script-based filters we come up with)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean I feel like it should live in the common analysis plugin and not in the server at all. Since it is a new thing I figure we should isolate it as much as we can so we know when stuff is using it. Not because it is dirty or anything, just because a smaller core is easier to reason about. And keeping this entirely within the analysis-common plugin helps to prove out the work that we've done for plugins extending painless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tricky part here will be unit testing, as ScriptTestCase is a painless-specific class. The painless extension module uses rest tests only, but I'd much rather have compiled unit tests here. Maybe we should build a separate painless-testing module, that can be used for tests elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we try to unit test this sort of thing without painless and use a mock script engine instead that just calls code on the test class. It is totally reasonable to have a few integration tests that do use painless though.
The previous errors in compileJava were not cause by the brackets but my the content of the @link section. Corrected this so its a working javadoc link again.
…2042) When a replica is fully recovered (i.e., in `POST_RECOVERY` state) we send a request to the master to start the shard. The master changes the state of the replica and publishes a cluster state to that effect. In certain cases, that cluster state can be processed on the node hosting the replica *together* with a cluster state that promotes that, now started, replica to a primary. This can happen due to cluster state batched processing or if the master died after having committed the cluster state that starts the shard but before publishing it to the node with the replica. If the master also held the primary shard, the new master node will remove the primary (as it failed) and will also immediately promote the replica (thinking it is started). Sadly our code in IndexShard didn't allow for this which caused [assertions](https://github.com/elastic/elasticsearch/blob/13917162ad5c59a96ccb4d6a81a5044546c45c22/server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java#L482) to be tripped in some of our tests runs.
Add EC2 credential test for repository-s3 Relates to #26913
This change adds two contexts the execute scripts against: * SEARCH_SCRIPT: Allows to run scripts in a search script context. This context is used in `function_score` query's script function, script fields, script sorting and `terms_set` query. * FILTER_SCRIPT: Allows to run scripts in a filter script context. This context is used in the `script` query. In both contexts a index name needs to be specified and a sample document. The document is needed to create an in-memory index that the script can access via the `doc[...]` and other notations. The index name is needed because a mapping is needed to index the document. Examples: ``` POST /_scripts/painless/_execute { "script": { "source": "doc['field'].value.length()" }, "context" : { "search_script": { "document": { "field": "four" }, "index": "my-index" } } } ``` Returns: ``` { "result": 4 } ``` POST /_scripts/painless/_execute { "script": { "source": "doc['field'].value.length() <= params.max_length", "params": { "max_length": 4 } }, "context" : { "filter_script": { "document": { "field": "four" }, "index": "my-index" } } } Returns: ``` { "result": true } ``` Also changed PainlessExecuteAction.TransportAction to use TransportSingleShardAction instead of HandledAction, because now in case score or filter contexts are used the request needs to be redirected to a node that has an active IndexService for the index being referenced (a node with a shard copy for that index).
Today it is unclear what guarantees are offered by the search preference feature, and we claim a guarantee that is stronger than what we really offer: > A custom value will be used to guarantee that the same shards will be used > for the same custom value. This commit clarifies this documentation. Forward-port of #32098 to `master`.
@nik9000 the fix for ScriptContexts in modules is now in, so I think this is ready for re-review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
|
||
@Override | ||
@SuppressWarnings("rawtypes") // TODO ScriptPlugin needs to change this to pass precommit? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to leave this TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a backwards breaking change? ScriptPlugin#getContexts()
needs to change return value from List<ScriptContext>
to List<ScriptContext<?>>
. Which ought to be done, but it's a separate change. I'll open an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opening an issue makes sense. Yeah, it is a separate thing.
This allows tokenfilters to be applied selectively, depending on the status of the current token in the tokenstream. The filter takes a scripted predicate, and only applies its subfilter when the predicate returns true.
* master: Fix deprecated setting specializations (elastic#33412) HLRC: split cluster request converters (elastic#33400) HLRC: Add ML get influencers API (elastic#33389) Add conditional token filter to elasticsearch (elastic#31958) Build: Merge xpack checkstyle config into core (elastic#33399) Disable IndexRecoveryIT.testRerouteRecovery. INGEST: Implement Drop Processor (elastic#32278) [ML] Add field stats to log structure finder (elastic#33351) Add interval response parameter to AutoDateInterval histogram (elastic#33254) MINOR+CORE: Remove Dead Methods ClusterService (elastic#33346)
Lucene has a ConditionTokenFilter which allows you to selectively apply tokenfilters, depending on the state of the current token in the tokenstream. This commit exposes this functionality in elasticsearch, adding a new AnalysisPredicateScript context.