Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiplexing token filter #31208

Merged
merged 16 commits into from
Jun 20, 2018
Merged

Multiplexing token filter #31208

merged 16 commits into from
Jun 20, 2018

Conversation

romseygeek
Copy link
Contributor

This adds a multiplexer token filter to elasticsearch, which allows you to run tokens through multiple different tokenfilters and stack the results. For example, you can now easily index the original form of a token, its lowercase form and a stemmed form all at the same position, allowing you to search for stemmed and unstemmed tokens in the same field.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@romseygeek
Copy link
Contributor Author

This would solve the usecase mentioned in #22478

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether the API should take 2D arrays rather than parse comma-delimited strings. Out of curiosity does it work well with filters that modify the graph like stop or synonym filters?

"filter" : {
"my_multiplexer" : {
"type" : "multiplexer",
"filters" : [ "identity", "lowercase", "lowercase, porter_stem" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about the API I wonder whether this should be an array of arrays, eg. [ [], [ "lowercase" ], [ "lowercase", "porter_stem" ] ].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but I think it will be far more common to apply single filters, in which case the extra square parens are just noise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then maybe make squares optional for single-element lists?
[ [], "lowercase", [ "lowercase", "porter_stem" ] ]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How easy is that to do with the Settings API? I've had a quick look, and it seems to expect either single keys or lists of strings, whereas here we'd need a list of variant types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok to keep a comma-separated list then. The other thing that worried me is what happens if the user also has a filter whose name is identity. Even though I like it a bit less from an API perspective, maybe it would be better to add a preserve_original setting or something like that to the multiplexer type so that we do not have to add identity as a reserved filter name.

@romseygeek
Copy link
Contributor Author

Out of curiosity does it work well with filters that modify the graph like stop or synonym filters?

The multiplexing filter will only pass one token at a time to its child filters, so filters that need to read ahead, like shingle or synonym filters, won't work. A stop filter would work, in that if you chained a stop filter and then a stemmer, the stop filter could prevent terms being sent to the stemmer.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I left 2 comments.

@@ -0,0 +1,177 @@
package org.elasticsearch.index.analysis;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this token filter and its tests to the org.elasticsearch.analysis.common package in the analysis-common module with all the other migrated analysis components.

Instead of wiring this token filter up in AnalysisRegistry you will need to do that in CommonAnalysisPlugin.

@@ -239,4 +243,126 @@ public void testEnsureCloseInvocationProperlyDelegated() throws IOException {
registry.close();
verify(mock).close();
}

private final class TruncateTokenFilter extends TokenFilter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use Lucene's TruncateTokenFilter / UpperCaseFilter?

@romseygeek
Copy link
Contributor Author

I moved things to the common analysis plugin. I still need to do a bit of wiring in AnalysisRegistry to pass the list of registered filters back to the multiplexer so it can resolve things properly. I added a new interface for this, but another alternative would be to add a method to TokenFilterFactory with a default no-op implementation.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romseygeek, I left one suggestion.

/**
* Called with a map of all registered filter factories
*/
void addReferences(Map<String, TokenFilterFactory> factories);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be invoked once per token filter factory? In that case I would rename this method to setReference(...) to highlight this.

@Override
public void reset() throws IOException {
super.reset();
selector = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it set selector = filterCount - 1 and merge the two if statements from incrementToken?

"filter" : {
"my_multiplexer" : {
"type" : "multiplexer",
"filters" : [ "identity", "lowercase", "lowercase, porter_stem" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok to keep a comma-separated list then. The other thing that worried me is what happens if the user also has a filter whose name is identity. Even though I like it a bit less from an API perspective, maybe it would be better to add a preserve_original setting or something like that to the multiplexer type so that we do not have to add identity as a reserved filter name.

@jpountz
Copy link
Contributor

jpountz commented Jun 11, 2018

The multiplexing filter will only pass one token at a time to its child filters, so filters that need to read ahead, like shingle or synonym filters, won't work.

Let's document this?

@romseygeek
Copy link
Contributor Author

I added preserve_original as an option and updated the docs accordingly

A token filter of type `multiplexer` will emit multiple tokens at the same position,
each version of the token having been run through a different filter.

Note that the child filters will in effect be passed a mock tokenstream consisting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might put this as a WARNING: in the Options section. And I'd flip the sentence around to something like "Shingle or multi-word synonym token filters will not function normally when they are declared in the filters array because they read ahead internally which is unsupported by the multiplexer.


[source,js]
--------------------------------------------------
POST /multiplexer_example/_analyze
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is seriously one of my favorite APIs in Elasticsearch.

"position": 2
},
{
"token": "home",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm. It might be nice to add a callout (looks like <1>) to explain the duplicate. I presume we get the duplicate because we don't deduplicate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to add a deduplication step as well to remove this confusion, but then I noticed that we don't seem to have the deduplication filter exposed in ES anywhere. I'll open another issue for that, as I think it will be very useful combined with a multiplexer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder that we might want to remove duplicates by default (or even enforce it). Otherwise eg. terms that are not modified through lowercasing or stemming will artificially get higher term freqs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to to document the duplicates and have folks use the deduplicating filter that @romseygeek proposed today. I like documenting it so folks understand what costs they are paying.

Also, if the token steam comes with duplicates on the way into this token filter then adding deduplicating filter by default would deduplicate the existing duplicates as a side effect, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just saw this message. I don't think the documentation path gives the best user experience as I can't think of a use-case to retain duplicates if multiple filters produce the same token. That said, I agree that simply adding a deduplicating token filter feels wrong if the original stream has duplicates, so maybe this is something that needs to be implemented directly into this new token filter.

}

@Override
public void setReferences(Map<String, TokenFilterFactory> factories) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if It'd be cleaner if create took the Map? It'd certainly make the change larger though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a thing to do in a followup because its large but almost entirely mechanical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be backwards breaking, so would have to be a master-only change, I think? TokenFilterFactory is an API you're encouraged to use via analysis plugins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a special tag for things that break the plugin or java client APIs: "breaking-java". We allow ourselves to use it much more liberally. One day we will have a properly semver-ed plugin API but we don't have one now.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. I think @nik9000 raised a good point about removing duplicates: if some filters do not modify a token, or if they modify it the same way, then this term will have its frequency artificially bumped up, which might hurt relevance. So maybe we should always remove duplicates?

"position": 2
},
{
"token": "home",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder that we might want to remove duplicates by default (or even enforce it). Otherwise eg. terms that are not modified through lowercasing or stemming will artificially get higher term freqs?

public MultiplexingTokenFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) throws IOException {
super(indexSettings, name, settings);
this.filterNames = settings.getAsList("filters");
this.preserveOriginal = settings.getAsBoolean("preserveOriginal", true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use underscore case instead?

@romseygeek romseygeek merged commit 5683bc6 into elastic:master Jun 20, 2018
romseygeek added a commit that referenced this pull request Jun 20, 2018
The `multiplexer` filter emits multiple tokens at the same position, each 
version of the token haivng been passed through a different filter chain.
Identical tokens at the same position are removed.

This allows users to, for example, index lowercase and original-case tokens,
or stemmed and unstemmed versions, in the same field, so that they can search
for a stemmed term within x positions of an unstemmed term.
dnhatn added a commit that referenced this pull request Jun 20, 2018
* 6.x:
  [DOCS] Omit shard failures assertion for incompatible responses  (#31430)
  [DOCS] Move licensing APIs to docs (#31445)
  backport of: add is-write-index flag to aliases (#30942) (#31412)
  backport of: Add rollover-creation-date setting to rolled over index (#31144) (#31413)
  [Docs] Extend Homebrew installation instructions (#28902)
  [Docs] Mention ip_range datatypes on ip type page (#31416)
  Multiplexing token filter (#31208)
  Fix use of time zone in date_histogram rewrite (#31407)
  Revert "Mute DefaultShardsIT#testDefaultShards test"
  [DOCS] Fixes code snippet testing for machine learning (#31189)
  Security: fix joining cluster with production license (#31341)
  [DOCS] Updated version in Info API example
  [DOCS] Moves the info API to docs (#31121)
  Revert "Increasing skip version for failing test on 6.x"
  Preserve response headers on cluster update task (#31421)
  [DOCS] Add code snippet testing for more ML APIs (#31404)
  Docs: Advice for reindexing many indices (#31279)
dnhatn added a commit that referenced this pull request Jun 20, 2018
* master:
  [DOCS] Omit shard failures assertion for incompatible responses  (#31430)
  [DOCS] Move licensing APIs to docs (#31445)
  Add Delete Snapshot High Level REST API
  Remove QueryCachingPolicy#ALWAYS_CACHE (#31451)
  [Docs] Extend Homebrew installation instructions (#28902)
  Choose JVM options ergonomically
  [Docs] Mention ip_range datatypes on ip type page (#31416)
  Multiplexing token filter (#31208)
  Fix use of time zone in date_histogram rewrite (#31407)
  Core: Remove index name resolver from base TransportAction (#31002)
  [DOCS] Fixes code snippet testing for machine learning (#31189)
  [DOCS] Removed  and  params from MLT. Closes #28128 (#31370)
  Security: fix joining cluster with production license (#31341)
  Unify http channels and exception handling (#31379)
  [DOCS] Moves the info API to docs (#31121)
  Preserve response headers on cluster update task (#31421)
  [DOCS] Add code snippet testing for more ML APIs (#31404)
  Do not preallocate bytes for channel buffer (#31400)
  Docs: Advice for reindexing many indices (#31279)
  Mute HttpExporterTests#testHttpExporterShutdown test Tracked by #31433
  Docs: Add note about removing prepareExecute from the java client (#31401)
  Make release notes ignore the `>test-failure` label. (#31309)
Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Sep 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants