-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use preconfigured filters correctly in Analyze API #43568
Use preconfigured filters correctly in Analyze API #43568
Conversation
Pinging @elastic/es-search |
This is an interesting failure: the pre-configured EdgeNGramTokenizer uses the lucene default values for min and max, which are both 1; however, the elasticsearch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main change in AnalysisRegistry makes sense to me looking at the underlying issue. When reading the test I was wondering if it would be better (and/or easier) to start testing AnalysisRegistry#buildCustomAnalyzer() more directly in AnalysisRegistryTests instead of going through TransportAnalyzeActionTests. Not a heard requirement for this PR but maybe there are things that could be pulled out or added easily.
+1, that makes sense, I'll open a follow-up issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a minor comment around testing, feel free to disregard if you don't like the idea. Rest LGTM
VersionUtils.randomVersionBetween(random(), Version.V_7_0_0, VersionUtils.getPreviousVersion(Version.V_7_3_0))) | ||
.put("index.analysis.analyzer.my_analyzer.tokenizer", "edge_ngram") | ||
.build(); | ||
IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", indexSettings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would make sense to factor out the settings creation into a private helper that takes the version and the tokenizer name as arguments. Its more or less the same in all four cases and takes up a bit of space.
.put("index.analysis.analyzer.my_analyzer.tokenizer", "standard") | ||
.putList("index.analysis.analyzer.my_analyzer.filter", "word_delimiter_graph") | ||
.build(); | ||
IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", indexSettings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
When a named token filter or char filter is passed as part of an Analyze API request with no index, we currently try and build the relevant filter using no index settings. However, this can miss cases where there is a pre-configured filter defined in the analysis registry. One example here is the elision filter, which has a pre-configured version built with the french elision set; when used as part of normal analysis, this preconfigured set is used, but when used as part of the Analyze API we end up with NPEs because it tries to instantiate the filter with no index settings. This commit changes the Analyze API to check for pre-configured filters in the case that the request has no index defined, and is using a name rather than a custom definition for a filter. It also changes the pre-configured `word_delimiter_graph` filter and `edge_ngram` tokenizer to make their settings consistent with the defaults used when creating them with no settings Closes #43002 Closes #43621 Closes #43582
…r is used (#43684) #26625 deprecated delimited_payload_filter and added tests to check that warnings would be emitted when both a normal and pre-configured filter were used. Unfortunately, due to a bug in the Analyze API, the pre- configured filter check was never actually triggered, and it turns out that the deprecation warning was not in fact being emitted in this case. #43568 fixed the Analyze API bug, which then surfaced this on backport. This commit ensures that the preconfigured filter also emits the warnings and triggers an error if a new index tries to use a preconfigured delimited_payload_filter
edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0, hence their support can entirely be removed from main. The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support. Relates to elastic#50376, elastic#50862, elastic#43568
edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0, hence their support can entirely be removed from main. The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support. Relates to #50376, #50862, #43568
…c#113009) edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0, hence their support can entirely be removed from main. The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support. Relates to elastic#50376, elastic#50862, elastic#43568
When a named token filter or char filter is passed as part of an Analyze API
request with no index, we currently try and build the relevant filter using no
index settings. However, this can miss cases where there is a pre-configured
filter defined in the analysis registry. One example here is the elision filter, which
has a pre-configured version built with the french elision set; when used as part
of normal analysis, this preconfigured set is used, but when used as part of the
Analyze API we end up with NPEs because it tries to instantiate the filter with
no index settings.
This commit changes the Analyze API to check for pre-configured filters in the case
that the request has no index defined, and is using a name rather than a custom
definition for a filter.
Relates to #43002