-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add limits for ngram and shingle settings #27211
Add limits for ngram and shingle settings #27211
Conversation
e463ca4
to
e961c03
Compare
retest this please |
e961c03
to
412e589
Compare
retest this please |
@mayya-sharipova, can you drop the PR template when you open the PR? The checklist is great to go over but not super useful to leave in the description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good @mayya-sharipova
Can you add a small note in the documentation regarding these limits ?
For the shingle for instance it would be in:
docs/reference/analysis/tokenfilters/shingle-tokenfilter.asciidoc`
final Settings settings = newAnalysisSettingsBuilder().put("min_gram", min_gram).put("max_gram", max_gram).build(); | ||
try { | ||
new NGramTokenizerFactory(indexProperties, null, name, settings).create(); | ||
fail(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use LuceneTestCase#expectThrows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to use LuceneTestCase#expectThrows
, but it requires that a constructorpublic NGramTokenizerFactory(..)
throw a checked exception. We can add this to the constructor, but then we need to modify all other classes that use NGramTokenizerFactory
to catch this exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works fine with a runtime exception, you don't need to change the constructor:
IllegalArgumentException exc = expectThrows(IllegalArgumentException.class, () ->
new NGramTokenizerFactory(indexProperties, null, name, settings).create());
.. and then you can check the message of the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Jim, will try this
Integer maxShingleSize = settings.getAsInt("max_shingle_size", ShingleFilter.DEFAULT_MAX_SHINGLE_SIZE); | ||
Integer minShingleSize = settings.getAsInt("min_shingle_size", ShingleFilter.DEFAULT_MIN_SHINGLE_SIZE); | ||
int shingleDiff = maxShingleSize - minShingleSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add output_unigrams in the formula:
maxGram - minGram + (outputUnigram ? 1 : 0) ?
This way the default diff is 1 since output_unigrams defaults to true.
* the maximum difference between | ||
* max_shingle_size and min_shingle_size. | ||
* The default value is 3 is defensive as it prevents generating too many tokens. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes too many tokens is problematic. I also don't like that we build invalid boolean queries when the diff is greater than 0. For instance with unigram set to true and shingles of size 2, the input foo bar foobar
is analyzed as:
position 1: (foo_bar, foo), position 2: (bar, bar_foobar) ...
which is problematic for any position query.
3 seems reasonable especially for fields that don't index positions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi thanks for your comment. I don't think I understand how diff in shingle size greater that 0 makes boolean queries invalid. Does it mean that if we have tokens: {"token" : "foo_bar", "position" : 1}, {"token" : "foo", "position" : 1}, {"token" : "bar", "position" : 2}, {"token" : "bar_foobar", "position" : 2},
, we can't have a phrase query say: "foo bar"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foo bar
would return the correct document but it would build an invalid phrase query:
"(foo_bar foo) bar"
... trying to find document with foo_bar bar
as a phrase query which could be simplified in foo_bar
. For boolean query it would not consider that foo_bar
is enough to match foo
AND bar
so the bigram would be useless for matching this type of query. We have solutions on the query side to build the correct query (foo_bar
OR (foo
AND bar
)) but this is currently deactivated because it can explode the number of clauses in the query if you have a large diff between min and max.
For simplicity I prefer when a single shingle size is used for a shingle field. In this case a phrase query like "foo bar foobar baz"
could be optimized to "foo_bar foobar_baz"
with shingles of size 2 (bar_foobar
can be safely ignored since we know that the previous match can only be foo_bar
).
Integer maxShingleSize = settings.getAsInt("max_shingle_size", ShingleFilter.DEFAULT_MAX_SHINGLE_SIZE); | ||
Integer minShingleSize = settings.getAsInt("min_shingle_size", ShingleFilter.DEFAULT_MIN_SHINGLE_SIZE); | ||
int shingleDiff = maxShingleSize - minShingleSize; | ||
if (shingleDiff > maxAllowedShingleDiff) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that this would make the upgrade experience really difficult for indices that break this limit already. They would have to close their index before the upgrade, change the limit in the new version and open the index again. That's a bit too much I believe so to be safe we should check this setting only for indices created before it existed:
if (indexSettings.getIndexVersionCreated().onOrAfter(Version.V_7_0_0_alpha1))
... and only log a deprecation warning if the filter breaks the limit on an index that was created an a previous version (see DeprecationLogger and its usage to see how to do that).
This way we could also backport this new setting in 6.x and makes the upgrade easier.
Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference between max_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Throw an IllegalArgumentException when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings value. Closes elastic#25887
81baa06
to
13303f1
Compare
Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference betweenmax_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Throw an IllegalArgumentException from v7.X when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings value. Create a deprecated warning for versions < 7.X for this case. Closes elastic#25887
13303f1
to
e26f6f1
Compare
@jimczi Jim, thank you for your feedback. I have addressed your comments in the commit. Could you please review when you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @mayya-sharipova
* master: (556 commits) Fix find remote when building BWC Remove colons from task and configuration names Add unreleased 5.6.5 version number testCreateSplitIndexToN: do not set `routing_partition_size` to >= `number_of_routing_shards` Snapshot/Restore: better handle incorrect chunk_size settings in FS repo (elastic#26844) Add limits for ngram and shingle settings (elastic#27211) (elastic#27318) Correct comment in index shard test Roll translog generation on primary promotion ObjectParser: Replace IllegalStateException with ParsingException (elastic#27302) scripted_metric _agg parameter disappears if params are provided (elastic#27159) Update discovery-ec2.asciidoc Update shrink's bwc version to 6.1.0 and enabled bwc tests Add limits for ngram and shingle settings (elastic#27211) Disable bwc tests in preparation of backporting elastic#26931 TemplateUpgradeService should only run on the master (elastic#27294) Die with dignity while merging Fix profiling naming issues (elastic#27133) Correctly encode warning headers Fixed references to Multi Index Syntax (elastic#27283) Add an active Elasticsearch WordPress plugin link (elastic#27279) ...
* master: (22 commits) Update Tika version to 1.15 Aggregations: bucket_sort pipeline aggregation (#27152) Introduce templating support to timezone/locale in DateProcessor (#27089) Increase logging on qa:mixed-cluster tests Update to AWS SDK 1.11.223 (#27278) Improve error message for parse failures of completion fields (#27297) Ensure external refreshes will also refresh internal searcher to minimize segment creation (#27253) Remove optimisations to reuse objects when applying a new `ClusterState` (#27317) Decouple `ChannelFactory` from Tcp classes (#27286) Fix find remote when building BWC Remove colons from task and configuration names Add unreleased 5.6.5 version number testCreateSplitIndexToN: do not set `routing_partition_size` to >= `number_of_routing_shards` Snapshot/Restore: better handle incorrect chunk_size settings in FS repo (#26844) Add limits for ngram and shingle settings (#27211) (#27318) Correct comment in index shard test Roll translog generation on primary promotion ObjectParser: Replace IllegalStateException with ParsingException (#27302) scripted_metric _agg parameter disappears if params are provided (#27159) Update discovery-ec2.asciidoc ...
This commit removes conditionals for logic that no longer needs to be version dependent, in that all indices in the cluster will match such condition. Relates to elastic#27211,elastic#34331
Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
min_shingle_size in ShingleTokenFilter. Default is 3.
Throw an IllegalArgumentException when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings value.
Closes #25887