Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conditional token filter to elasticsearch #31958

Merged
merged 111 commits into from
Sep 5, 2018
Merged
Show file tree
Hide file tree
Changes from 110 commits
Commits
Show all changes
111 commits
Select commit Hold shift + click to select a range
e5b20de
WIP
romseygeek Jun 29, 2018
da0fd1e
WIP
romseygeek Jul 2, 2018
df9bffc
WIP
romseygeek Jul 4, 2018
402ed36
WIP
romseygeek Jul 11, 2018
fb7c21d
WIP
romseygeek Jul 11, 2018
d8f0170
docs
romseygeek Jul 11, 2018
bcee3f0
tests
romseygeek Jul 11, 2018
bba5939
d'oh
romseygeek Jul 11, 2018
21cd02f
class name change in SPI
romseygeek Jul 11, 2018
4315682
docs
romseygeek Jul 11, 2018
52955df
Broekn
romseygeek Jul 13, 2018
dd139c7
nuke unit test
romseygeek Jul 17, 2018
1d5deff
Merge branch 'master' into scripted-analysis
romseygeek Jul 17, 2018
57a73f2
feedback
romseygeek Jul 17, 2018
609951b
Term -> Token; move ScriptContext into module
romseygeek Jul 17, 2018
801a704
Re-instate link in StringFunctionUtils javadocs
Jul 13, 2018
f923d9c
Docs: Change formatting of Cloud options
clintongormley Jul 13, 2018
a21fb82
Docs: Restyled cloud link in getting started
clintongormley Jul 13, 2018
62fea58
[Rollup] Use composite's missing_bucket (#31402)
polyfractal Jul 13, 2018
35a6774
Test: Fix a second case of bad watch creation
hub-cap Jul 13, 2018
9203e50
Remove deprecated AnalysisPlugin#requriesAnalysisSettings method (#32…
romseygeek Jul 13, 2018
849e690
Add second level of field collapsing (#31808)
mayya-sharipova Jul 13, 2018
de213a8
Mute ML AutodetectMemoryLimitIT#testTooManyPartitions on Windows (#32…
Jul 13, 2018
3bbc8c6
Watcher: cleanup ensureWatchExists use (#31926)
hub-cap Jul 13, 2018
4440df5
Add secure setting for watcher email password (#31620)
hub-cap Jul 13, 2018
2ab7db3
HLRC: Add xpack usage api (#31975)
rjernst Jul 13, 2018
2183fff
Adds a new auto-interval date histogram (#28993)
colings86 Jul 13, 2018
0e7a6b4
lazy snapshot repository initialization (#31606)
vladimirdolzhenko Jul 13, 2018
315999d
Watcher: Make settings reloadable (#31746)
hub-cap Jul 13, 2018
b959534
fix typo
Jul 13, 2018
94d3311
Clean Up Snapshot Create Rest API (#31779)
jdconrad Jul 13, 2018
2945c3a
[Rollup] Histo group config should support scaled_floats (#32048)
polyfractal Jul 13, 2018
e48de6a
Mute failing tests
Jul 13, 2018
ec470f2
Replace Ingest ScriptContext with Custom Interface (#32003)
original-brownbear Jul 13, 2018
f446f91
Add nio http transport to security plugin (#32018)
Tim-Brooks Jul 13, 2018
bd20e99
Fix compile issues introduced by merge (#32058)
Tim-Brooks Jul 14, 2018
71cd43b
SCRIPTING: Remove unused MultiSearchTemplateRequestBuilder (#32049)
original-brownbear Jul 14, 2018
e045ad6
Cleanup Duplication in `PainlessScriptEngine` (#31991)
original-brownbear Jul 14, 2018
dabbba1
Fix broken OpenLDAP Vagrant QA test
tvernum Jul 16, 2018
82e8fce
Turn off real-mem breaker in single node tests
danielmitterdorfer Jul 16, 2018
9d48815
Turn off real-mem breaker in REST tests
danielmitterdorfer Jul 16, 2018
040bc9d
[Test] Mute MlJobIT#testDeleteJobAfterMissingAliases
jimczi Jul 16, 2018
142d24a
Remove unused params from SSource and Walker (#31935)
Jul 16, 2018
ce8b3e3
[Tests] Fix failure due to changes exception message (#32036)
Jul 16, 2018
a1ad7a1
Fix BWC check after backport
jimczi Jul 16, 2018
bbe1b7c
Unmute field collapsing rest tests
jimczi Jul 16, 2018
391641c
Ensure only parent breaker trips in unit test
danielmitterdorfer Jul 16, 2018
6ec52fe
[Rollup] Fix duplicate field names in test (#32075)
jimczi Jul 16, 2018
ced669b
[TEST] Consistent algorithm usage (#32077)
jkakavas Jul 16, 2018
e38e69c
[Rollup] Replace RollupIT with a ESRestTestCase version (#31977)
polyfractal Jul 16, 2018
4a9fbe7
Scripting: Remove dead code from painless module (#32064)
original-brownbear Jul 16, 2018
5f130a2
Painless: Separate PainlessLookup into PainlessLookup and PainlessLoo…
jdconrad Jul 16, 2018
1dd0279
Use correct formatting for links (#29460)
dedemorton Jul 16, 2018
53f029b
Watcher: Store username on watch execution (#31873)
hub-cap Jul 16, 2018
e325526
Tweaked Elasticsearch Service links for SEO
debadair Jul 16, 2018
e514ad0
Tweaked Elasticsearch Service links for SEO
debadair Jul 16, 2018
50d8fa8
[test] turn on host io cache for opensuse (#32053)
andyb-elastic Jul 16, 2018
c0ffec7
DOCS: put LIMIT 10 to the SQL query (#32065)
ahmedakef Jul 16, 2018
ee4ef86
SQL: allow LEFT and RIGHT as function names (#32066)
costin Jul 16, 2018
9371e77
Revert "[test] disable packaging tests for suse boxes"
andyb-elastic Jul 16, 2018
780697f
[Rollup] Add new capabilities endpoint for concrete rollup indices (#…
polyfractal Jul 16, 2018
1106355
Switch non-x-pack to new style requests (#32106)
nik9000 Jul 16, 2018
7ce9926
Bypass highlight query terms extraction on empty fields (#32090)
jimczi Jul 16, 2018
8ff5735
Painless: Move and Rename Several Methods in the lookup package (#32105)
jdconrad Jul 16, 2018
97fbe49
Add Index UUID to `/_stats` Response (#31871)
original-brownbear Jul 17, 2018
ca2844f
[Test] Modify assert statement for ssl handshake (#32072)
bizybot Jul 17, 2018
8bad2c6
Add exclusion option to `keep_types` token filter (#32012)
Jul 17, 2018
89bce93
Fix put mappings java API documentation (#31955)
Jul 17, 2018
c2ee07b
Enable testing in FIPS140 JVM (#31666)
jkakavas Jul 17, 2018
24547e8
Check that client methods match API defined in the REST spec (#31825)
javanna Jul 17, 2018
5a383c2
Mute :qa:mixed-cluster indices.stats/10_index/Index - all’
davidkyle Jul 17, 2018
5bad3a8
Updates the build to gradle 4.9 (#32087)
alpar-t Jul 17, 2018
94330d8
Relax TermVectors API to work with textual fields other than TextFiel…
markharwood Jul 17, 2018
3d0854d
Handle TokenizerFactory TODOs (#32063)
original-brownbear Jul 17, 2018
70d2db3
Ensure to release translog snapshot in primary-replica resync (#32045)
dnhatn Jul 17, 2018
5afea06
[ML] Move analyzer dependencies out of categorization config (#32123)
droberts195 Jul 17, 2018
bb9fae0
[ML] Wait for aliases in multi-node tests (#32086)
davidkyle Jul 17, 2018
b31dc36
Docs: Fix missing example script quote (#32010)
aptxx Jul 17, 2018
a7c8e07
Re-disable packaging tests on suse boxes
andyb-elastic Jul 17, 2018
7490ec6
Remove empty @param from Javadoc
jkakavas Jul 17, 2018
a481ef6
Painless: Fix Bug with Duplicate PainlessClasses (#32110)
jdconrad Jul 17, 2018
346edfa
Build: Move shadow customizations into common code (#32014)
nik9000 Jul 17, 2018
cdf5c8a
Disable C2 from using AVX-512 on JDK 10 (#32138)
jasontedor Jul 17, 2018
a835503
Build: Make additional test deps of check (#32015)
rjernst Jul 17, 2018
ff7ff36
Painless: Add PainlessClassBuilder (#32141)
jdconrad Jul 17, 2018
101458b
Build: Skip jar tests if jar disabled
nik9000 Jul 17, 2018
88c4f6c
Switch distribution to new style Requests (#30595)
nik9000 Jul 18, 2018
8486f24
Remove versionType from translog (#31945)
dnhatn Jul 18, 2018
413d211
ESIndexLevelReplicationTestCase doesn't support replicated failures b…
bleskes Jul 17, 2018
fa16875
[DOCS] Update TLS on Docker for 6.3 (#32114)
ninaspitfire Jul 18, 2018
688deeb
Fix `range` queries on `_type` field for singe type indices (#31756)
Jul 18, 2018
ace3771
Term -> Token; deps
romseygeek Jul 18, 2018
84ee20e
Fix CP for namingConventions when gradle home has spaces (#31914)
alpar-t Jul 18, 2018
947fa2e
Fix Java 11 javadoc compile problem
Jul 18, 2018
e67cede
A replica can be promoted and started in one cluster state update (#3…
bleskes Jul 18, 2018
8924ac3
Add EC2 credential test for repository-s3 (#31918)
vladimirdolzhenko Jul 18, 2018
b79dc6a
Add more contexts to painless execute api (#30511)
martijnvg Jul 18, 2018
d225048
use before instead of onOrBefore
martijnvg Jul 18, 2018
997ebe8
Improve docs for search preferences (#32159)
DaveCTurner Jul 18, 2018
f591095
Fix BwC Tests looking for UUID Pre 6.4 (#32158)
original-brownbear Jul 18, 2018
67a4dcb
Call setReferences() on custom referring tokenfilters in _analyze (#3…
romseygeek Jul 18, 2018
93ecf1d
Merge conflicts
romseygeek Jul 18, 2018
945fadf
more docs
romseygeek Jul 18, 2018
303de4f
tests for all script variables
romseygeek Jul 18, 2018
33b9da4
Merge branch 'master' into scripted-analysis
romseygeek Jul 18, 2018
b775f71
Merge branch 'master' into scripted-analysis
romseygeek Aug 21, 2018
1dff0f6
merge error
romseygeek Aug 21, 2018
546aa11
checkstyle
romseygeek Aug 21, 2018
701fbf2
headers
romseygeek Aug 21, 2018
396843f
Use actual painless syntax, not my own made-up syntax
romseygeek Aug 21, 2018
8bc3d65
Merge branch 'master' into scripted-analysis
romseygeek Sep 5, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/painless/painless-contexts/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ include::painless-metric-agg-reduce-context.asciidoc[]

include::painless-bucket-agg-context.asciidoc[]

include::painless-analysis-predicate-context.asciidoc[]

include::painless-watcher-condition-context.asciidoc[]

include::painless-watcher-transform-context.asciidoc[]
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
[[painless-analysis-predicate-context]]
=== Analysis Predicate Context

Use a painless script to determine whether or not the current token in an
analysis chain matches a predicate.

*Variables*

`params` (`Map`, read-only)::
User-defined parameters passed in as part of the query.

`token.term` (`CharSequence`, read-only)::
The characters of the current token

`token.position` (`int`, read-only)::
The position of the current token

`token.positionIncrement` (`int`, read-only)::
The position increment of the current token

`token.positionLength` (`int`, read-only)::
The position length of the current token

`token.startOffset` (`int`, read-only)::
The start offset of the current token

`token.endOffset` (`int`, read-only)::
The end offset of the current token

`token.type` (`String`, read-only)::
The type of the current token

`token.keyword` ('boolean`, read-only)::
Whether or not the current token is marked as a keyword

*Return*

`boolean`::
Whether or not the current token matches the predicate

*API*

The standard <<painless-api-reference, Painless API>> is available.
2 changes: 2 additions & 0 deletions docs/reference/analysis/tokenfilters.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ include::tokenfilters/word-delimiter-graph-tokenfilter.asciidoc[]

include::tokenfilters/multiplexer-tokenfilter.asciidoc[]

include::tokenfilters/condition-tokenfilter.asciidoc[]

include::tokenfilters/stemmer-tokenfilter.asciidoc[]

include::tokenfilters/stemmer-override-tokenfilter.asciidoc[]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
[[analysis-condition-tokenfilter]]
=== Conditional Token Filter

The conditional token filter takes a predicate script and a list of subfilters, and
only applies the subfilters to the current token if it matches the predicate.

[float]
=== Options
[horizontal]
filter:: a chain of token filters to apply to the current token if the predicate
matches. These can be any token filters defined elsewhere in the index mappings.

script:: a predicate script that determines whether or not the filters will be applied
to the current token. Note that only inline scripts are supported

[float]
=== Settings example

You can set it up like:

[source,js]
--------------------------------------------------
PUT /condition_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : [ "my_condition" ]
}
},
"filter" : {
"my_condition" : {
"type" : "condition",
"filter" : [ "lowercase" ],
"script" : {
"source" : "token.getTerm().length() < 5" <1>
}
}
}
}
}
}
--------------------------------------------------
// CONSOLE

<1> This will only apply the lowercase filter to terms that are less than 5
characters in length

And test it like:

[source,js]
--------------------------------------------------
POST /condition_example/_analyze
{
"analyzer" : "my_analyzer",
"text" : "What Flapdoodle"
}
--------------------------------------------------
// CONSOLE
// TEST[continued]

And it'd respond:

[source,js]
--------------------------------------------------
{
"tokens": [
{
"token": "what", <1>
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Flapdoodle", <2>
"start_offset": 5,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
}
]
}
--------------------------------------------------
// TESTRESPONSE
<1> The term `What` has been lowercased, because it is only 4 characters long
<2> The term `Flapdoodle` has been left in its original case, because it doesn't pass
the predicate
9 changes: 9 additions & 0 deletions modules/analysis-common/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,13 @@
esplugin {
description 'Adds "built in" analyzers to Elasticsearch.'
classname 'org.elasticsearch.analysis.common.CommonAnalysisPlugin'
extendedPlugins = ['lang-painless']
}

dependencies {
compileOnly project(':modules:lang-painless')
}

integTestCluster {
module project(':modules:lang-painless')
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.elasticsearch.analysis.common;

import org.elasticsearch.painless.spi.PainlessExtension;
import org.elasticsearch.painless.spi.Whitelist;
import org.elasticsearch.painless.spi.WhitelistLoader;
import org.elasticsearch.script.ScriptContext;

import java.util.Collections;
import java.util.List;
import java.util.Map;

public class AnalysisPainlessExtension implements PainlessExtension {

private static final Whitelist WHITELIST =
WhitelistLoader.loadFromResourceFiles(AnalysisPainlessExtension.class, "painless_whitelist.txt");

@Override
public Map<ScriptContext<?>, List<Whitelist>> getContextWhitelists() {
return Collections.singletonMap(AnalysisPredicateScript.CONTEXT, Collections.singletonList(WHITELIST));
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.elasticsearch.analysis.common;

import org.elasticsearch.script.ScriptContext;

/**
* A predicate based on the current token in a TokenStream
*/
public abstract class AnalysisPredicateScript {

/**
* Encapsulation of the state of the current token
*/
public static class Token {
public CharSequence term;
public int pos;
public int posInc;
public int posLen;
public int startOffset;
public int endOffset;
public String type;
public boolean isKeyword;

public CharSequence getTerm() {
return term;
}

public int getPositionIncrement() {
return posInc;
}

public int getPosition() {
return pos;
}

public int getPositionLength() {
return posLen;
}

public int getStartOffset() {
return startOffset;
}

public int getEndOffset() {
return endOffset;
}

public String getType() {
return type;
}

public boolean isKeyword() {
return isKeyword;
}
}

/**
* Returns {@code true} if the current term matches the predicate
*/
public abstract boolean execute(Token token);

public interface Factory {
AnalysisPredicateScript newInstance();
}

public static final String[] PARAMETERS = new String[]{ "token" };
public static final ScriptContext<Factory> CONTEXT = new ScriptContext<>("analysis", Factory.class);

}
Original file line number Diff line number Diff line change
Expand Up @@ -111,9 +111,16 @@
import org.apache.lucene.analysis.tr.ApostropheFilter;
import org.apache.lucene.analysis.tr.TurkishAnalyzer;
import org.apache.lucene.analysis.util.ElisionFilter;
import org.apache.lucene.util.SetOnce;
import org.elasticsearch.client.Client;
import org.elasticsearch.cluster.service.ClusterService;
import org.elasticsearch.common.io.stream.NamedWriteableRegistry;
import org.elasticsearch.common.logging.DeprecationLogger;
import org.elasticsearch.common.logging.Loggers;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.xcontent.NamedXContentRegistry;
import org.elasticsearch.env.Environment;
import org.elasticsearch.env.NodeEnvironment;
import org.elasticsearch.index.analysis.AnalyzerProvider;
import org.elasticsearch.index.analysis.CharFilterFactory;
import org.elasticsearch.index.analysis.PreBuiltAnalyzerProviderFactory;
Expand All @@ -127,20 +134,44 @@
import org.elasticsearch.indices.analysis.PreBuiltCacheFactory.CachingStrategy;
import org.elasticsearch.plugins.AnalysisPlugin;
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.plugins.ScriptPlugin;
import org.elasticsearch.script.ScriptContext;
import org.elasticsearch.script.ScriptService;
import org.elasticsearch.threadpool.ThreadPool;
import org.elasticsearch.watcher.ResourceWatcherService;
import org.tartarus.snowball.ext.DutchStemmer;
import org.tartarus.snowball.ext.FrenchStemmer;

import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

import static org.elasticsearch.plugins.AnalysisPlugin.requiresAnalysisSettings;

public class CommonAnalysisPlugin extends Plugin implements AnalysisPlugin {
public class CommonAnalysisPlugin extends Plugin implements AnalysisPlugin, ScriptPlugin {

private static final DeprecationLogger DEPRECATION_LOGGER = new DeprecationLogger(Loggers.getLogger(CommonAnalysisPlugin.class));

private final SetOnce<ScriptService> scriptService = new SetOnce<>();

@Override
public Collection<Object> createComponents(Client client, ClusterService clusterService, ThreadPool threadPool,
ResourceWatcherService resourceWatcherService, ScriptService scriptService,
NamedXContentRegistry xContentRegistry, Environment environment,
NodeEnvironment nodeEnvironment, NamedWriteableRegistry namedWriteableRegistry) {
this.scriptService.set(scriptService);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where're never very happy with using SetOnce like this. It gets the job done but it reaks guice's "everything depends on everything"-ness that we've worked so hard to remove over the years. Not that I have anything better though.

return Collections.emptyList();
}

@Override
@SuppressWarnings("rawtypes") // TODO ScriptPlugin needs to change this to pass precommit?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to leave this TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a backwards breaking change? ScriptPlugin#getContexts() needs to change return value from List<ScriptContext> to List<ScriptContext<?>>. Which ought to be done, but it's a separate change. I'll open an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening an issue makes sense. Yeah, it is a separate thing.

public List<ScriptContext> getContexts() {
return Collections.singletonList(AnalysisPredicateScript.CONTEXT);
}

@Override
public Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {
Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> analyzers = new TreeMap<>();
Expand Down Expand Up @@ -202,6 +233,8 @@ public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() {
filters.put("classic", ClassicFilterFactory::new);
filters.put("czech_stem", CzechStemTokenFilterFactory::new);
filters.put("common_grams", requiresAnalysisSettings(CommonGramsTokenFilterFactory::new));
filters.put("condition",
requiresAnalysisSettings((i, e, n, s) -> new ScriptedConditionTokenFilterFactory(i, n, s, scriptService.get())));
filters.put("decimal_digit", DecimalDigitFilterFactory::new);
filters.put("delimited_payload_filter", LegacyDelimitedPayloadTokenFilterFactory::new);
filters.put("delimited_payload", DelimitedPayloadTokenFilterFactory::new);
Expand Down
Loading