-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add conditional token filter to elasticsearch #31958
Changes from 110 commits
e5b20de
da0fd1e
df9bffc
402ed36
fb7c21d
d8f0170
bcee3f0
bba5939
21cd02f
4315682
52955df
dd139c7
1d5deff
57a73f2
609951b
801a704
f923d9c
a21fb82
62fea58
35a6774
9203e50
849e690
de213a8
3bbc8c6
4440df5
2ab7db3
2183fff
0e7a6b4
315999d
b959534
94d3311
2945c3a
e48de6a
ec470f2
f446f91
bd20e99
71cd43b
e045ad6
dabbba1
82e8fce
9d48815
040bc9d
142d24a
ce8b3e3
a1ad7a1
bbe1b7c
391641c
6ec52fe
ced669b
e38e69c
4a9fbe7
5f130a2
1dd0279
53f029b
e325526
e514ad0
50d8fa8
c0ffec7
ee4ef86
9371e77
780697f
1106355
7ce9926
8ff5735
97fbe49
ca2844f
8bad2c6
89bce93
c2ee07b
24547e8
5a383c2
5bad3a8
94330d8
3d0854d
70d2db3
5afea06
bb9fae0
b31dc36
a7c8e07
7490ec6
a481ef6
346edfa
cdf5c8a
a835503
ff7ff36
101458b
88c4f6c
8486f24
413d211
fa16875
688deeb
ace3771
84ee20e
947fa2e
e67cede
8924ac3
b79dc6a
d225048
997ebe8
f591095
67a4dcb
93ecf1d
945fadf
303de4f
33b9da4
b775f71
1dff0f6
546aa11
701fbf2
396843f
8bc3d65
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
[[painless-analysis-predicate-context]] | ||
=== Analysis Predicate Context | ||
|
||
Use a painless script to determine whether or not the current token in an | ||
analysis chain matches a predicate. | ||
|
||
*Variables* | ||
|
||
`params` (`Map`, read-only):: | ||
User-defined parameters passed in as part of the query. | ||
|
||
`token.term` (`CharSequence`, read-only):: | ||
The characters of the current token | ||
|
||
`token.position` (`int`, read-only):: | ||
The position of the current token | ||
|
||
`token.positionIncrement` (`int`, read-only):: | ||
The position increment of the current token | ||
|
||
`token.positionLength` (`int`, read-only):: | ||
The position length of the current token | ||
|
||
`token.startOffset` (`int`, read-only):: | ||
The start offset of the current token | ||
|
||
`token.endOffset` (`int`, read-only):: | ||
The end offset of the current token | ||
|
||
`token.type` (`String`, read-only):: | ||
The type of the current token | ||
|
||
`token.keyword` ('boolean`, read-only):: | ||
Whether or not the current token is marked as a keyword | ||
|
||
*Return* | ||
|
||
`boolean`:: | ||
Whether or not the current token matches the predicate | ||
|
||
*API* | ||
|
||
The standard <<painless-api-reference, Painless API>> is available. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
[[analysis-condition-tokenfilter]] | ||
=== Conditional Token Filter | ||
|
||
The conditional token filter takes a predicate script and a list of subfilters, and | ||
only applies the subfilters to the current token if it matches the predicate. | ||
|
||
[float] | ||
=== Options | ||
[horizontal] | ||
filter:: a chain of token filters to apply to the current token if the predicate | ||
matches. These can be any token filters defined elsewhere in the index mappings. | ||
|
||
script:: a predicate script that determines whether or not the filters will be applied | ||
to the current token. Note that only inline scripts are supported | ||
|
||
[float] | ||
=== Settings example | ||
|
||
You can set it up like: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
PUT /condition_example | ||
{ | ||
"settings" : { | ||
"analysis" : { | ||
"analyzer" : { | ||
"my_analyzer" : { | ||
"tokenizer" : "standard", | ||
"filter" : [ "my_condition" ] | ||
} | ||
}, | ||
"filter" : { | ||
"my_condition" : { | ||
"type" : "condition", | ||
"filter" : [ "lowercase" ], | ||
"script" : { | ||
"source" : "token.getTerm().length() < 5" <1> | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
|
||
<1> This will only apply the lowercase filter to terms that are less than 5 | ||
characters in length | ||
|
||
And test it like: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
POST /condition_example/_analyze | ||
{ | ||
"analyzer" : "my_analyzer", | ||
"text" : "What Flapdoodle" | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
// TEST[continued] | ||
|
||
And it'd respond: | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
{ | ||
"tokens": [ | ||
{ | ||
"token": "what", <1> | ||
"start_offset": 0, | ||
"end_offset": 4, | ||
"type": "<ALPHANUM>", | ||
"position": 0 | ||
}, | ||
{ | ||
"token": "Flapdoodle", <2> | ||
"start_offset": 5, | ||
"end_offset": 15, | ||
"type": "<ALPHANUM>", | ||
"position": 1 | ||
} | ||
] | ||
} | ||
-------------------------------------------------- | ||
// TESTRESPONSE | ||
<1> The term `What` has been lowercased, because it is only 4 characters long | ||
<2> The term `Flapdoodle` has been left in its original case, because it doesn't pass | ||
the predicate |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
/* | ||
* Licensed to Elasticsearch under one or more contributor | ||
* license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright | ||
* ownership. Elasticsearch licenses this file to you under | ||
* the Apache License, Version 2.0 (the "License"); you may | ||
* not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
|
||
package org.elasticsearch.analysis.common; | ||
|
||
import org.elasticsearch.painless.spi.PainlessExtension; | ||
import org.elasticsearch.painless.spi.Whitelist; | ||
import org.elasticsearch.painless.spi.WhitelistLoader; | ||
import org.elasticsearch.script.ScriptContext; | ||
|
||
import java.util.Collections; | ||
import java.util.List; | ||
import java.util.Map; | ||
|
||
public class AnalysisPainlessExtension implements PainlessExtension { | ||
|
||
private static final Whitelist WHITELIST = | ||
WhitelistLoader.loadFromResourceFiles(AnalysisPainlessExtension.class, "painless_whitelist.txt"); | ||
|
||
@Override | ||
public Map<ScriptContext<?>, List<Whitelist>> getContextWhitelists() { | ||
return Collections.singletonMap(AnalysisPredicateScript.CONTEXT, Collections.singletonList(WHITELIST)); | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
/* | ||
* Licensed to Elasticsearch under one or more contributor | ||
* license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright | ||
* ownership. Elasticsearch licenses this file to you under | ||
* the Apache License, Version 2.0 (the "License"); you may | ||
* not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
|
||
package org.elasticsearch.analysis.common; | ||
|
||
import org.elasticsearch.script.ScriptContext; | ||
|
||
/** | ||
* A predicate based on the current token in a TokenStream | ||
*/ | ||
public abstract class AnalysisPredicateScript { | ||
|
||
/** | ||
* Encapsulation of the state of the current token | ||
*/ | ||
public static class Token { | ||
public CharSequence term; | ||
public int pos; | ||
public int posInc; | ||
public int posLen; | ||
public int startOffset; | ||
public int endOffset; | ||
public String type; | ||
public boolean isKeyword; | ||
|
||
public CharSequence getTerm() { | ||
return term; | ||
} | ||
|
||
public int getPositionIncrement() { | ||
return posInc; | ||
} | ||
|
||
public int getPosition() { | ||
return pos; | ||
} | ||
|
||
public int getPositionLength() { | ||
return posLen; | ||
} | ||
|
||
public int getStartOffset() { | ||
return startOffset; | ||
} | ||
|
||
public int getEndOffset() { | ||
return endOffset; | ||
} | ||
|
||
public String getType() { | ||
return type; | ||
} | ||
|
||
public boolean isKeyword() { | ||
return isKeyword; | ||
} | ||
} | ||
|
||
/** | ||
* Returns {@code true} if the current term matches the predicate | ||
*/ | ||
public abstract boolean execute(Token token); | ||
|
||
public interface Factory { | ||
AnalysisPredicateScript newInstance(); | ||
} | ||
|
||
public static final String[] PARAMETERS = new String[]{ "token" }; | ||
public static final ScriptContext<Factory> CONTEXT = new ScriptContext<>("analysis", Factory.class); | ||
|
||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -111,9 +111,16 @@ | |
import org.apache.lucene.analysis.tr.ApostropheFilter; | ||
import org.apache.lucene.analysis.tr.TurkishAnalyzer; | ||
import org.apache.lucene.analysis.util.ElisionFilter; | ||
import org.apache.lucene.util.SetOnce; | ||
import org.elasticsearch.client.Client; | ||
import org.elasticsearch.cluster.service.ClusterService; | ||
import org.elasticsearch.common.io.stream.NamedWriteableRegistry; | ||
import org.elasticsearch.common.logging.DeprecationLogger; | ||
import org.elasticsearch.common.logging.Loggers; | ||
import org.elasticsearch.common.regex.Regex; | ||
import org.elasticsearch.common.xcontent.NamedXContentRegistry; | ||
import org.elasticsearch.env.Environment; | ||
import org.elasticsearch.env.NodeEnvironment; | ||
import org.elasticsearch.index.analysis.AnalyzerProvider; | ||
import org.elasticsearch.index.analysis.CharFilterFactory; | ||
import org.elasticsearch.index.analysis.PreBuiltAnalyzerProviderFactory; | ||
|
@@ -127,20 +134,44 @@ | |
import org.elasticsearch.indices.analysis.PreBuiltCacheFactory.CachingStrategy; | ||
import org.elasticsearch.plugins.AnalysisPlugin; | ||
import org.elasticsearch.plugins.Plugin; | ||
import org.elasticsearch.plugins.ScriptPlugin; | ||
import org.elasticsearch.script.ScriptContext; | ||
import org.elasticsearch.script.ScriptService; | ||
import org.elasticsearch.threadpool.ThreadPool; | ||
import org.elasticsearch.watcher.ResourceWatcherService; | ||
import org.tartarus.snowball.ext.DutchStemmer; | ||
import org.tartarus.snowball.ext.FrenchStemmer; | ||
|
||
import java.util.ArrayList; | ||
import java.util.Collection; | ||
import java.util.Collections; | ||
import java.util.List; | ||
import java.util.Map; | ||
import java.util.TreeMap; | ||
|
||
import static org.elasticsearch.plugins.AnalysisPlugin.requiresAnalysisSettings; | ||
|
||
public class CommonAnalysisPlugin extends Plugin implements AnalysisPlugin { | ||
public class CommonAnalysisPlugin extends Plugin implements AnalysisPlugin, ScriptPlugin { | ||
|
||
private static final DeprecationLogger DEPRECATION_LOGGER = new DeprecationLogger(Loggers.getLogger(CommonAnalysisPlugin.class)); | ||
|
||
private final SetOnce<ScriptService> scriptService = new SetOnce<>(); | ||
|
||
@Override | ||
public Collection<Object> createComponents(Client client, ClusterService clusterService, ThreadPool threadPool, | ||
ResourceWatcherService resourceWatcherService, ScriptService scriptService, | ||
NamedXContentRegistry xContentRegistry, Environment environment, | ||
NodeEnvironment nodeEnvironment, NamedWriteableRegistry namedWriteableRegistry) { | ||
this.scriptService.set(scriptService); | ||
return Collections.emptyList(); | ||
} | ||
|
||
@Override | ||
@SuppressWarnings("rawtypes") // TODO ScriptPlugin needs to change this to pass precommit? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you mean to leave this TODO? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's a backwards breaking change? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Opening an issue makes sense. Yeah, it is a separate thing. |
||
public List<ScriptContext> getContexts() { | ||
return Collections.singletonList(AnalysisPredicateScript.CONTEXT); | ||
} | ||
|
||
@Override | ||
public Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() { | ||
Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> analyzers = new TreeMap<>(); | ||
|
@@ -202,6 +233,8 @@ public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() { | |
filters.put("classic", ClassicFilterFactory::new); | ||
filters.put("czech_stem", CzechStemTokenFilterFactory::new); | ||
filters.put("common_grams", requiresAnalysisSettings(CommonGramsTokenFilterFactory::new)); | ||
filters.put("condition", | ||
requiresAnalysisSettings((i, e, n, s) -> new ScriptedConditionTokenFilterFactory(i, n, s, scriptService.get()))); | ||
filters.put("decimal_digit", DecimalDigitFilterFactory::new); | ||
filters.put("delimited_payload_filter", LegacyDelimitedPayloadTokenFilterFactory::new); | ||
filters.put("delimited_payload", DelimitedPayloadTokenFilterFactory::new); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where're never very happy with using
SetOnce
like this. It gets the job done but it reaks guice's "everything depends on everything"-ness that we've worked so hard to remove over the years. Not that I have anything better though.