Multiplexing token filter #31208

romseygeek · 2018-06-08T15:37:38Z

This adds a multiplexer token filter to elasticsearch, which allows you to run tokens through multiple different tokenfilters and stack the results. For example, you can now easily index the original form of a token, its lowercase form and a stemmed form all at the same position, allowing you to search for stemmed and unstemmed tokens in the same field.

elasticmachine · 2018-06-08T15:37:40Z

Pinging @elastic/es-search-aggs

romseygeek · 2018-06-08T15:38:29Z

This would solve the usecase mentioned in #22478

jpountz

I'm wondering whether the API should take 2D arrays rather than parse comma-delimited strings. Out of curiosity does it work well with filters that modify the graph like stop or synonym filters?

jpountz · 2018-06-08T16:05:16Z

docs/reference/analysis/tokenfilters/multiplexer-tokenfilter.asciidoc

+            "filter" : {
+                "my_multiplexer" : {
+                    "type" : "multiplexer",
+                    "filters" : [ "identity", "lowercase", "lowercase, porter_stem" ]


thinking about the API I wonder whether this should be an array of arrays, eg. [ [], [ "lowercase" ], [ "lowercase", "porter_stem" ] ].

I thought about that, but I think it will be far more common to apply single filters, in which case the extra square parens are just noise.

then maybe make squares optional for single-element lists?
[ [], "lowercase", [ "lowercase", "porter_stem" ] ]

How easy is that to do with the Settings API? I've had a quick look, and it seems to expect either single keys or lists of strings, whereas here we'd need a list of variant types.

Ok to keep a comma-separated list then. The other thing that worried me is what happens if the user also has a filter whose name is identity. Even though I like it a bit less from an API perspective, maybe it would be better to add a preserve_original setting or something like that to the multiplexer type so that we do not have to add identity as a reserved filter name.

romseygeek · 2018-06-09T08:15:25Z

Out of curiosity does it work well with filters that modify the graph like stop or synonym filters?

The multiplexing filter will only pass one token at a time to its child filters, so filters that need to read ahead, like shingle or synonym filters, won't work. A stop filter would work, in that if you chained a stop filter and then a stemmer, the stop filter could prevent terms being sent to the stemmer.

martijnvg

Looks great. I left 2 comments.

martijnvg · 2018-06-11T06:41:42Z

server/src/main/java/org/elasticsearch/index/analysis/MultiplexingTokenFilterFactory.java

@@ -0,0 +1,177 @@
+package org.elasticsearch.index.analysis;


I think we should move this token filter and its tests to the org.elasticsearch.analysis.common package in the analysis-common module with all the other migrated analysis components.

Instead of wiring this token filter up in AnalysisRegistry you will need to do that in CommonAnalysisPlugin.

martijnvg · 2018-06-11T06:42:59Z

server/src/test/java/org/elasticsearch/index/analysis/AnalysisRegistryTests.java

@@ -239,4 +243,126 @@ public void testEnsureCloseInvocationProperlyDelegated() throws IOException {
        registry.close();
        verify(mock).close();
    }
+
+    private final class TruncateTokenFilter extends TokenFilter {


Maybe use Lucene's TruncateTokenFilter / UpperCaseFilter?

romseygeek · 2018-06-11T12:29:48Z

I moved things to the common analysis plugin. I still need to do a bit of wiring in AnalysisRegistry to pass the list of registered filters back to the multiplexer so it can resolve things properly. I added a new interface for this, but another alternative would be to add a method to TokenFilterFactory with a default no-op implementation.

martijnvg

Thanks @romseygeek, I left one suggestion.

martijnvg · 2018-06-11T13:58:18Z

server/src/main/java/org/elasticsearch/index/analysis/ReferringFilterFactory.java

+    /**
+     * Called with a map of all registered filter factories
+     */
+    void addReferences(Map<String, TokenFilterFactory> factories);


This should be invoked once per token filter factory? In that case I would rename this method to setReference(...) to highlight this.

jpountz · 2018-06-11T14:41:59Z

...s-common/src/main/java/org/elasticsearch/analysis/common/MultiplexingTokenFilterFactory.java

+            @Override
+            public void reset() throws IOException {
+                super.reset();
+                selector = 0;


should it set selector = filterCount - 1 and merge the two if statements from incrementToken?

jpountz · 2018-06-11T14:49:11Z

docs/reference/analysis/tokenfilters/multiplexer-tokenfilter.asciidoc

+            "filter" : {
+                "my_multiplexer" : {
+                    "type" : "multiplexer",
+                    "filters" : [ "identity", "lowercase", "lowercase, porter_stem" ]


Ok to keep a comma-separated list then. The other thing that worried me is what happens if the user also has a filter whose name is identity. Even though I like it a bit less from an API perspective, maybe it would be better to add a preserve_original setting or something like that to the multiplexer type so that we do not have to add identity as a reserved filter name.

jpountz · 2018-06-11T14:51:27Z

The multiplexing filter will only pass one token at a time to its child filters, so filters that need to read ahead, like shingle or synonym filters, won't work.

Let's document this?

romseygeek · 2018-06-11T15:39:11Z

I added preserve_original as an option and updated the docs accordingly

nik9000 · 2018-06-11T16:19:50Z

docs/reference/analysis/tokenfilters/multiplexer-tokenfilter.asciidoc

+A token filter of type `multiplexer` will emit multiple tokens at the same position,
+each version of the token having been run through a different filter.
+
+Note that the child filters will in effect be passed a mock tokenstream consisting


I might put this as a WARNING: in the Options section. And I'd flip the sentence around to something like "Shingle or multi-word synonym token filters will not function normally when they are declared in the filters array because they read ahead internally which is unsupported by the multiplexer.

nik9000 · 2018-06-11T16:20:27Z

docs/reference/analysis/tokenfilters/multiplexer-tokenfilter.asciidoc

+
+[source,js]
+--------------------------------------------------
+POST /multiplexer_example/_analyze


This is seriously one of my favorite APIs in Elasticsearch.

nik9000 · 2018-06-11T16:22:41Z

docs/reference/analysis/tokenfilters/multiplexer-tokenfilter.asciidoc

+      "position": 2
+    },
+    {
+      "token": "home",


Hmmmm. It might be nice to add a callout (looks like <1>) to explain the duplicate. I presume we get the duplicate because we don't deduplicate.

I was going to add a deduplication step as well to remove this confusion, but then I noticed that we don't seem to have the deduplication filter exposed in ES anywhere. I'll open another issue for that, as I think it will be very useful combined with a multiplexer

I wonder that we might want to remove duplicates by default (or even enforce it). Otherwise eg. terms that are not modified through lowercasing or stemming will artificially get higher term freqs?

I'd prefer to to document the duplicates and have folks use the deduplicating filter that @romseygeek proposed today. I like documenting it so folks understand what costs they are paying.

Also, if the token steam comes with duplicates on the way into this token filter then adding deduplicating filter by default would deduplicate the existing duplicates as a side effect, right?

Sorry, I just saw this message. I don't think the documentation path gives the best user experience as I can't think of a use-case to retain duplicates if multiple filters produce the same token. That said, I agree that simply adding a deduplicating token filter feels wrong if the original stream has duplicates, so maybe this is something that needs to be implemented directly into this new token filter.

nik9000 · 2018-06-11T16:30:17Z

...s-common/src/main/java/org/elasticsearch/analysis/common/MultiplexingTokenFilterFactory.java

+    }
+
+    @Override
+    public void setReferences(Map<String, TokenFilterFactory> factories) {


I wonder if It'd be cleaner if create took the Map? It'd certainly make the change larger though.

Maybe a thing to do in a followup because its large but almost entirely mechanical.

It would also be backwards breaking, so would have to be a master-only change, I think? TokenFilterFactory is an API you're encouraged to use via analysis plugins

We have a special tag for things that break the plugin or java client APIs: "breaking-java". We allow ourselves to use it much more liberally. One day we will have a properly semver-ed plugin API but we don't have one now.

jpountz

It looks good to me. I think @nik9000 raised a good point about removing duplicates: if some filters do not modify a token, or if they modify it the same way, then this term will have its frequency artificially bumped up, which might hurt relevance. So maybe we should always remove duplicates?

jpountz · 2018-06-12T15:06:14Z

docs/reference/analysis/tokenfilters/multiplexer-tokenfilter.asciidoc

+      "position": 2
+    },
+    {
+      "token": "home",


I wonder that we might want to remove duplicates by default (or even enforce it). Otherwise eg. terms that are not modified through lowercasing or stemming will artificially get higher term freqs?

jpountz · 2018-06-12T15:07:39Z

...s-common/src/main/java/org/elasticsearch/analysis/common/MultiplexingTokenFilterFactory.java

+    public MultiplexingTokenFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) throws IOException {
+        super(indexSettings, name, settings);
+        this.filterNames = settings.getAsList("filters");
+        this.preserveOriginal = settings.getAsBoolean("preserveOriginal", true);


use underscore case instead?

The `multiplexer` filter emits multiple tokens at the same position, each version of the token haivng been passed through a different filter chain. Identical tokens at the same position are removed. This allows users to, for example, index lowercase and original-case tokens, or stemmed and unstemmed versions, in the same field, so that they can search for a stemmed term within x positions of an unstemmed term.

* 6.x: [DOCS] Omit shard failures assertion for incompatible responses (#31430) [DOCS] Move licensing APIs to docs (#31445) backport of: add is-write-index flag to aliases (#30942) (#31412) backport of: Add rollover-creation-date setting to rolled over index (#31144) (#31413) [Docs] Extend Homebrew installation instructions (#28902) [Docs] Mention ip_range datatypes on ip type page (#31416) Multiplexing token filter (#31208) Fix use of time zone in date_histogram rewrite (#31407) Revert "Mute DefaultShardsIT#testDefaultShards test" [DOCS] Fixes code snippet testing for machine learning (#31189) Security: fix joining cluster with production license (#31341) [DOCS] Updated version in Info API example [DOCS] Moves the info API to docs (#31121) Revert "Increasing skip version for failing test on 6.x" Preserve response headers on cluster update task (#31421) [DOCS] Add code snippet testing for more ML APIs (#31404) Docs: Advice for reindexing many indices (#31279)

* master: [DOCS] Omit shard failures assertion for incompatible responses (#31430) [DOCS] Move licensing APIs to docs (#31445) Add Delete Snapshot High Level REST API Remove QueryCachingPolicy#ALWAYS_CACHE (#31451) [Docs] Extend Homebrew installation instructions (#28902) Choose JVM options ergonomically [Docs] Mention ip_range datatypes on ip type page (#31416) Multiplexing token filter (#31208) Fix use of time zone in date_histogram rewrite (#31407) Core: Remove index name resolver from base TransportAction (#31002) [DOCS] Fixes code snippet testing for machine learning (#31189) [DOCS] Removed and params from MLT. Closes #28128 (#31370) Security: fix joining cluster with production license (#31341) Unify http channels and exception handling (#31379) [DOCS] Moves the info API to docs (#31121) Preserve response headers on cluster update task (#31421) [DOCS] Add code snippet testing for more ML APIs (#31404) Do not preallocate bytes for channel buffer (#31400) Docs: Advice for reindexing many indices (#31279) Mute HttpExporterTests#testHttpExporterShutdown test Tracked by #31433 Docs: Add note about removing prepareExecute from the java client (#31401) Make release notes ignore the `>test-failure` label. (#31309)

romseygeek added 2 commits June 8, 2018 13:12

Multiplexing filter

b3275b7

Allow chaining; docs

0f6598b

romseygeek added >feature :Search Relevance/Analysis How text is split into tokens v7.0.0 v6.4.0 labels Jun 8, 2018

romseygeek self-assigned this Jun 8, 2018

romseygeek requested review from nik9000 and martijnvg June 8, 2018 15:37

checkstyle

721de2c

jpountz reviewed Jun 8, 2018

View reviewed changes

martijnvg reviewed Jun 11, 2018

View reviewed changes

romseygeek added 2 commits June 11, 2018 13:25

Move multiplexer to common analysis plugin

ac86ce3

tidy up

7ad7d9d

martijnvg approved these changes Jun 11, 2018

View reviewed changes

jpountz requested changes Jun 11, 2018

View reviewed changes

romseygeek added 4 commits June 11, 2018 16:21

addRef -> setRefs

3cc89b2

simplify

f367fef

Add preserve_original settings

692542c

docs

24de7ad

nik9000 reviewed Jun 11, 2018

View reviewed changes

docs

65064d9

romseygeek mentioned this pull request Jun 12, 2018

Expose lucene's RemoveDuplicatesTokenFilter #31275

Merged

jpountz approved these changes Jun 12, 2018

View reviewed changes

romseygeek added 6 commits June 18, 2018 10:02

Merge branch 'master' into multiplexing-token-filter

7d17964

Add deduplication to the multiplexer example docs

0aa4bc1

snake_case; make class names consistent

f936171

docs

de07870

Always remove duplicate tokens

b497fe2

checkstyle

024d67a

romseygeek merged commit 5683bc6 into elastic:master Jun 20, 2018

romseygeek mentioned this pull request Jun 20, 2018

Ngram/Edgengram filters don't work with keyword repeat filters #22478

Closed

Mpdreamz mentioned this pull request Sep 21, 2018

[meta] 6.4.0 release elastic/elasticsearch-net#3397

Closed

89 tasks

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Sep 28, 2018

Add support for multiplexer token filter (elastic/elasticsearch#31208)

c748605

Mpdreamz mentioned this pull request Sep 28, 2018

Feature/multiplexing token filter elastic/elasticsearch-net#3425

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiplexing token filter #31208

Multiplexing token filter #31208

romseygeek commented Jun 8, 2018

elasticmachine commented Jun 8, 2018

romseygeek commented Jun 8, 2018

jpountz left a comment

jpountz Jun 8, 2018

romseygeek Jun 8, 2018

jpountz Jun 8, 2018

romseygeek Jun 9, 2018

jpountz Jun 11, 2018

romseygeek commented Jun 9, 2018

martijnvg left a comment

martijnvg Jun 11, 2018

martijnvg Jun 11, 2018

romseygeek commented Jun 11, 2018

martijnvg left a comment

martijnvg Jun 11, 2018

jpountz Jun 11, 2018

jpountz Jun 11, 2018

jpountz commented Jun 11, 2018

romseygeek commented Jun 11, 2018

nik9000 Jun 11, 2018

nik9000 Jun 11, 2018

nik9000 Jun 11, 2018

romseygeek Jun 12, 2018

jpountz Jun 12, 2018

nik9000 Jun 12, 2018

jpountz Jun 18, 2018

nik9000 Jun 11, 2018

nik9000 Jun 11, 2018

romseygeek Jun 12, 2018

nik9000 Jun 12, 2018

jpountz left a comment

jpountz Jun 12, 2018

jpountz Jun 12, 2018

Multiplexing token filter #31208

Multiplexing token filter #31208

Conversation

romseygeek commented Jun 8, 2018

elasticmachine commented Jun 8, 2018

romseygeek commented Jun 8, 2018

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek commented Jun 9, 2018

martijnvg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek commented Jun 11, 2018

martijnvg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Jun 11, 2018

romseygeek commented Jun 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment