-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiplexing token filter #31208
Multiplexing token filter #31208
Conversation
Pinging @elastic/es-search-aggs |
This would solve the usecase mentioned in #22478 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether the API should take 2D arrays rather than parse comma-delimited strings. Out of curiosity does it work well with filters that modify the graph like stop or synonym filters?
"filter" : { | ||
"my_multiplexer" : { | ||
"type" : "multiplexer", | ||
"filters" : [ "identity", "lowercase", "lowercase, porter_stem" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about the API I wonder whether this should be an array of arrays, eg. [ [], [ "lowercase" ], [ "lowercase", "porter_stem" ] ]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but I think it will be far more common to apply single filters, in which case the extra square parens are just noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then maybe make squares optional for single-element lists?
[ [], "lowercase", [ "lowercase", "porter_stem" ] ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How easy is that to do with the Settings API? I've had a quick look, and it seems to expect either single keys or lists of strings, whereas here we'd need a list of variant types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok to keep a comma-separated list then. The other thing that worried me is what happens if the user also has a filter whose name is identity
. Even though I like it a bit less from an API perspective, maybe it would be better to add a preserve_original
setting or something like that to the multiplexer
type so that we do not have to add identity
as a reserved filter name.
The multiplexing filter will only pass one token at a time to its child filters, so filters that need to read ahead, like shingle or synonym filters, won't work. A stop filter would work, in that if you chained a stop filter and then a stemmer, the stop filter could prevent terms being sent to the stemmer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I left 2 comments.
@@ -0,0 +1,177 @@ | |||
package org.elasticsearch.index.analysis; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this token filter and its tests to the org.elasticsearch.analysis.common
package in the analysis-common
module with all the other migrated analysis components.
Instead of wiring this token filter up in AnalysisRegistry
you will need to do that in CommonAnalysisPlugin
.
@@ -239,4 +243,126 @@ public void testEnsureCloseInvocationProperlyDelegated() throws IOException { | |||
registry.close(); | |||
verify(mock).close(); | |||
} | |||
|
|||
private final class TruncateTokenFilter extends TokenFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use Lucene's TruncateTokenFilter
/ UpperCaseFilter
?
I moved things to the common analysis plugin. I still need to do a bit of wiring in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @romseygeek, I left one suggestion.
/** | ||
* Called with a map of all registered filter factories | ||
*/ | ||
void addReferences(Map<String, TokenFilterFactory> factories); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be invoked once per token filter factory? In that case I would rename this method to setReference(...)
to highlight this.
@Override | ||
public void reset() throws IOException { | ||
super.reset(); | ||
selector = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it set selector = filterCount - 1
and merge the two if statements from incrementToken
?
"filter" : { | ||
"my_multiplexer" : { | ||
"type" : "multiplexer", | ||
"filters" : [ "identity", "lowercase", "lowercase, porter_stem" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok to keep a comma-separated list then. The other thing that worried me is what happens if the user also has a filter whose name is identity
. Even though I like it a bit less from an API perspective, maybe it would be better to add a preserve_original
setting or something like that to the multiplexer
type so that we do not have to add identity
as a reserved filter name.
Let's document this? |
I added |
A token filter of type `multiplexer` will emit multiple tokens at the same position, | ||
each version of the token having been run through a different filter. | ||
|
||
Note that the child filters will in effect be passed a mock tokenstream consisting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might put this as a WARNING:
in the Options
section. And I'd flip the sentence around to something like "Shingle or multi-word synonym token filters will not function normally when they are declared in the filters
array because they read ahead internally which is unsupported by the multiplexer
.
|
||
[source,js] | ||
-------------------------------------------------- | ||
POST /multiplexer_example/_analyze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is seriously one of my favorite APIs in Elasticsearch.
"position": 2 | ||
}, | ||
{ | ||
"token": "home", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm. It might be nice to add a callout (looks like <1>
) to explain the duplicate. I presume we get the duplicate because we don't deduplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to add a deduplication step as well to remove this confusion, but then I noticed that we don't seem to have the deduplication filter exposed in ES anywhere. I'll open another issue for that, as I think it will be very useful combined with a multiplexer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder that we might want to remove duplicates by default (or even enforce it). Otherwise eg. terms that are not modified through lowercasing or stemming will artificially get higher term freqs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to to document the duplicates and have folks use the deduplicating filter that @romseygeek proposed today. I like documenting it so folks understand what costs they are paying.
Also, if the token steam comes with duplicates on the way into this token filter then adding deduplicating filter by default would deduplicate the existing duplicates as a side effect, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just saw this message. I don't think the documentation path gives the best user experience as I can't think of a use-case to retain duplicates if multiple filters produce the same token. That said, I agree that simply adding a deduplicating token filter feels wrong if the original stream has duplicates, so maybe this is something that needs to be implemented directly into this new token filter.
} | ||
|
||
@Override | ||
public void setReferences(Map<String, TokenFilterFactory> factories) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if It'd be cleaner if create
took the Map
? It'd certainly make the change larger though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a thing to do in a followup because its large but almost entirely mechanical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be backwards breaking, so would have to be a master-only change, I think? TokenFilterFactory is an API you're encouraged to use via analysis plugins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a special tag for things that break the plugin or java client APIs: "breaking-java". We allow ourselves to use it much more liberally. One day we will have a properly semver-ed plugin API but we don't have one now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. I think @nik9000 raised a good point about removing duplicates: if some filters do not modify a token, or if they modify it the same way, then this term will have its frequency artificially bumped up, which might hurt relevance. So maybe we should always remove duplicates?
"position": 2 | ||
}, | ||
{ | ||
"token": "home", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder that we might want to remove duplicates by default (or even enforce it). Otherwise eg. terms that are not modified through lowercasing or stemming will artificially get higher term freqs?
public MultiplexingTokenFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) throws IOException { | ||
super(indexSettings, name, settings); | ||
this.filterNames = settings.getAsList("filters"); | ||
this.preserveOriginal = settings.getAsBoolean("preserveOriginal", true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use underscore case instead?
The `multiplexer` filter emits multiple tokens at the same position, each version of the token haivng been passed through a different filter chain. Identical tokens at the same position are removed. This allows users to, for example, index lowercase and original-case tokens, or stemmed and unstemmed versions, in the same field, so that they can search for a stemmed term within x positions of an unstemmed term.
* 6.x: [DOCS] Omit shard failures assertion for incompatible responses (#31430) [DOCS] Move licensing APIs to docs (#31445) backport of: add is-write-index flag to aliases (#30942) (#31412) backport of: Add rollover-creation-date setting to rolled over index (#31144) (#31413) [Docs] Extend Homebrew installation instructions (#28902) [Docs] Mention ip_range datatypes on ip type page (#31416) Multiplexing token filter (#31208) Fix use of time zone in date_histogram rewrite (#31407) Revert "Mute DefaultShardsIT#testDefaultShards test" [DOCS] Fixes code snippet testing for machine learning (#31189) Security: fix joining cluster with production license (#31341) [DOCS] Updated version in Info API example [DOCS] Moves the info API to docs (#31121) Revert "Increasing skip version for failing test on 6.x" Preserve response headers on cluster update task (#31421) [DOCS] Add code snippet testing for more ML APIs (#31404) Docs: Advice for reindexing many indices (#31279)
* master: [DOCS] Omit shard failures assertion for incompatible responses (#31430) [DOCS] Move licensing APIs to docs (#31445) Add Delete Snapshot High Level REST API Remove QueryCachingPolicy#ALWAYS_CACHE (#31451) [Docs] Extend Homebrew installation instructions (#28902) Choose JVM options ergonomically [Docs] Mention ip_range datatypes on ip type page (#31416) Multiplexing token filter (#31208) Fix use of time zone in date_histogram rewrite (#31407) Core: Remove index name resolver from base TransportAction (#31002) [DOCS] Fixes code snippet testing for machine learning (#31189) [DOCS] Removed and params from MLT. Closes #28128 (#31370) Security: fix joining cluster with production license (#31341) Unify http channels and exception handling (#31379) [DOCS] Moves the info API to docs (#31121) Preserve response headers on cluster update task (#31421) [DOCS] Add code snippet testing for more ML APIs (#31404) Do not preallocate bytes for channel buffer (#31400) Docs: Advice for reindexing many indices (#31279) Mute HttpExporterTests#testHttpExporterShutdown test Tracked by #31433 Docs: Add note about removing prepareExecute from the java client (#31401) Make release notes ignore the `>test-failure` label. (#31309)
This adds a
multiplexer
token filter to elasticsearch, which allows you to run tokens through multiple different tokenfilters and stack the results. For example, you can now easily index the original form of a token, its lowercase form and a stemmed form all at the same position, allowing you to search for stemmed and unstemmed tokens in the same field.