Add limits for ngram and shingle settings #27211

mayya-sharipova · 2017-11-01T16:23:34Z

Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
min_shingle_size in ShingleTokenFilter. Default is 3.

Throw an IllegalArgumentException when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings value.

Closes #25887

mayya-sharipova · 2017-11-01T18:40:54Z

retest this please

mayya-sharipova · 2017-11-02T02:08:17Z

retest this please

nik9000 · 2017-11-02T03:05:54Z

@mayya-sharipova, can you drop the PR template when you open the PR? The checklist is great to go over but not super useful to leave in the description.

jimczi

The changes look good @mayya-sharipova
Can you add a small note in the documentation regarding these limits ?
For the shingle for instance it would be in: docs/reference/analysis/tokenfilters/shingle-tokenfilter.asciidoc`

jimczi · 2017-11-02T15:08:56Z

...lysis-common/src/test/java/org/elasticsearch/analysis/common/NGramTokenizerFactoryTests.java

+        final Settings settings = newAnalysisSettingsBuilder().put("min_gram", min_gram).put("max_gram", max_gram).build();
+        try {
+            new NGramTokenizerFactory(indexProperties, null, name, settings).create();
+            fail();


Maybe use LuceneTestCase#expectThrows

I was trying to use LuceneTestCase#expectThrows, but it requires that a constructorpublic NGramTokenizerFactory(..) throw a checked exception. We can add this to the constructor, but then we need to modify all other classes that use NGramTokenizerFactory to catch this exception

It works fine with a runtime exception, you don't need to change the constructor:

IllegalArgumentException exc = expectThrows(IllegalArgumentException.class, () -> new NGramTokenizerFactory(indexProperties, null, name, settings).create());

.. and then you can check the message of the exception.

thanks Jim, will try this

jimczi · 2017-11-02T15:09:20Z

core/src/main/java/org/elasticsearch/index/analysis/ShingleTokenFilterFactory.java

        Integer maxShingleSize = settings.getAsInt("max_shingle_size", ShingleFilter.DEFAULT_MAX_SHINGLE_SIZE);
        Integer minShingleSize = settings.getAsInt("min_shingle_size", ShingleFilter.DEFAULT_MIN_SHINGLE_SIZE);
+        int shingleDiff = maxShingleSize - minShingleSize;


Can you add output_unigrams in the formula:
maxGram - minGram + (outputUnigram ? 1 : 0) ?
This way the default diff is 1 since output_unigrams defaults to true.

jimczi · 2017-11-02T15:29:26Z

core/src/main/java/org/elasticsearch/index/IndexSettings.java

+     * the maximum difference between
+     * max_shingle_size and min_shingle_size.
+     * The default value is 3 is defensive as it prevents generating too many tokens.
+     */


Yes too many tokens is problematic. I also don't like that we build invalid boolean queries when the diff is greater than 0. For instance with unigram set to true and shingles of size 2, the input foo bar foobar is analyzed as:
position 1: (foo_bar, foo), position 2: (bar, bar_foobar) ... which is problematic for any position query.
3 seems reasonable especially for fields that don't index positions.

@jimczi thanks for your comment. I don't think I understand how diff in shingle size greater that 0 makes boolean queries invalid. Does it mean that if we have tokens: {"token" : "foo_bar", "position" : 1}, {"token" : "foo", "position" : 1}, {"token" : "bar", "position" : 2}, {"token" : "bar_foobar", "position" : 2},, we can't have a phrase query say: "foo bar"?

foo bar would return the correct document but it would build an invalid phrase query:

"(foo_bar foo) bar"

... trying to find document with foo_bar bar as a phrase query which could be simplified in foo_bar. For boolean query it would not consider that foo_bar is enough to match foo AND bar so the bigram would be useless for matching this type of query. We have solutions on the query side to build the correct query (foo_bar OR (foo AND bar)) but this is currently deactivated because it can explode the number of clauses in the query if you have a large diff between min and max.
For simplicity I prefer when a single shingle size is used for a shingle field. In this case a phrase query like "foo bar foobar baz" could be optimized to "foo_bar foobar_baz" with shingles of size 2 (bar_foobar can be safely ignored since we know that the previous match can only be foo_bar).

jimczi · 2017-11-02T15:58:22Z

core/src/main/java/org/elasticsearch/index/analysis/ShingleTokenFilterFactory.java

        Integer maxShingleSize = settings.getAsInt("max_shingle_size", ShingleFilter.DEFAULT_MAX_SHINGLE_SIZE);
        Integer minShingleSize = settings.getAsInt("min_shingle_size", ShingleFilter.DEFAULT_MIN_SHINGLE_SIZE);
+        int shingleDiff = maxShingleSize - minShingleSize;
+        if (shingleDiff > maxAllowedShingleDiff) {


I just realized that this would make the upgrade experience really difficult for indices that break this limit already. They would have to close their index before the upgrade, change the limit in the new version and open the index again. That's a bit too much I believe so to be safe we should check this setting only for indices created before it existed:

if (indexSettings.getIndexVersionCreated().onOrAfter(Version.V_7_0_0_alpha1))

... and only log a deprecation warning if the filter breaks the limit on an index that was created an a previous version (see DeprecationLogger and its usage to see how to do that).
This way we could also backport this new setting in 6.x and makes the upgrade easier.

Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference between max_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Throw an IllegalArgumentException when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings value. Closes elastic#25887

Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference betweenmax_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Throw an IllegalArgumentException from v7.X when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings value. Create a deprecated warning for versions < 7.X for this case. Closes elastic#25887

mayya-sharipova · 2017-11-04T21:14:14Z

@jimczi Jim, thank you for your feedback. I have addressed your comments in the commit. Could you please review when you have time.

jimczi

LGTM, thanks @mayya-sharipova

Relates to elastic#25887

Relates to #25887

* master: (556 commits) Fix find remote when building BWC Remove colons from task and configuration names Add unreleased 5.6.5 version number testCreateSplitIndexToN: do not set `routing_partition_size` to >= `number_of_routing_shards` Snapshot/Restore: better handle incorrect chunk_size settings in FS repo (elastic#26844) Add limits for ngram and shingle settings (elastic#27211) (elastic#27318) Correct comment in index shard test Roll translog generation on primary promotion ObjectParser: Replace IllegalStateException with ParsingException (elastic#27302) scripted_metric _agg parameter disappears if params are provided (elastic#27159) Update discovery-ec2.asciidoc Update shrink's bwc version to 6.1.0 and enabled bwc tests Add limits for ngram and shingle settings (elastic#27211) Disable bwc tests in preparation of backporting elastic#26931 TemplateUpgradeService should only run on the master (elastic#27294) Die with dignity while merging Fix profiling naming issues (elastic#27133) Correctly encode warning headers Fixed references to Multi Index Syntax (elastic#27283) Add an active Elasticsearch WordPress plugin link (elastic#27279) ...

* master: (22 commits) Update Tika version to 1.15 Aggregations: bucket_sort pipeline aggregation (#27152) Introduce templating support to timezone/locale in DateProcessor (#27089) Increase logging on qa:mixed-cluster tests Update to AWS SDK 1.11.223 (#27278) Improve error message for parse failures of completion fields (#27297) Ensure external refreshes will also refresh internal searcher to minimize segment creation (#27253) Remove optimisations to reuse objects when applying a new `ClusterState` (#27317) Decouple `ChannelFactory` from Tcp classes (#27286) Fix find remote when building BWC Remove colons from task and configuration names Add unreleased 5.6.5 version number testCreateSplitIndexToN: do not set `routing_partition_size` to >= `number_of_routing_shards` Snapshot/Restore: better handle incorrect chunk_size settings in FS repo (#26844) Add limits for ngram and shingle settings (#27211) (#27318) Correct comment in index shard test Roll translog generation on primary promotion ObjectParser: Replace IllegalStateException with ParsingException (#27302) scripted_metric _agg parameter disappears if params are provided (#27159) Update discovery-ec2.asciidoc ...

Relates: elastic/elasticsearch#27211

This commit removes conditionals for logic that no longer needs to be version dependent, in that all indices in the cluster will match such condition. Relates to elastic#27211,elastic#34331

This commit removes conditionals for logic that no longer needs to be version dependent, in that all indices in the cluster will match such condition. Relates to #27211,#34331

mayya-sharipova force-pushed the add-limits-ngram-shingle-settings branch from e463ca4 to e961c03 Compare November 1, 2017 16:25

mayya-sharipova force-pushed the add-limits-ngram-shingle-settings branch from e961c03 to 412e589 Compare November 1, 2017 20:17

jimczi reviewed Nov 2, 2017

View reviewed changes

mayya-sharipova force-pushed the add-limits-ngram-shingle-settings branch 2 times, most recently from 81baa06 to 13303f1 Compare November 3, 2017 15:26

mayya-sharipova force-pushed the add-limits-ngram-shingle-settings branch from 13303f1 to e26f6f1 Compare November 3, 2017 16:12

jimczi approved these changes Nov 6, 2017

View reviewed changes

mayya-sharipova merged commit 148376c into elastic:master Nov 7, 2017

mayya-sharipova added >breaking :Search Relevance/Analysis How text is split into tokens >enhancement v7.0.0 labels Nov 7, 2017

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Nov 8, 2017

Add limits for ngram and shingle settings (elastic#27211)

9a4c34a

Relates to elastic#25887

mayya-sharipova mentioned this pull request Nov 8, 2017

Add limits for ngram and shingle settings #27318

Merged

mayya-sharipova added a commit that referenced this pull request Nov 8, 2017

Add limits for ngram and shingle settings (#27211) (#27318)

abbe853

Relates to #25887

mayya-sharipova removed the >enhancement label Nov 13, 2017

mayya-sharipova mentioned this pull request Nov 16, 2017

Add limits for ngram and shingle settings #27411

Merged

andyb-elastic mentioned this pull request Mar 19, 2018

Fuzzy match very slow on indices with shingle filter #23594

Closed

gwbrown mentioned this pull request Nov 29, 2018

Update Deprecation Info API for 7.0 breaking changes #36024

Closed

26 tasks

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jun 20, 2019

Add updateable max* index settings

24b87c5

Relates: elastic/elasticsearch#27211

russcam mentioned this pull request Jun 20, 2019

Add updateable max* index settings elastic/elasticsearch-net#3838

Merged

Mpdreamz pushed a commit to elastic/elasticsearch-net that referenced this pull request Jun 20, 2019

Add updateable max* index settings (#3838)

66010ff

Relates: elastic/elasticsearch#27211

javanna mentioned this pull request Sep 17, 2024

Remove version dependent logic from ShingleTokenFilterFactory #113019

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add limits for ngram and shingle settings #27211

Add limits for ngram and shingle settings #27211

mayya-sharipova commented Nov 1, 2017 •

edited by nik9000

Loading

mayya-sharipova commented Nov 1, 2017

mayya-sharipova commented Nov 2, 2017

nik9000 commented Nov 2, 2017

jimczi left a comment

jimczi Nov 2, 2017

mayya-sharipova Nov 2, 2017 •

edited

Loading

jimczi Nov 3, 2017

mayya-sharipova Nov 3, 2017

jimczi Nov 2, 2017

jimczi Nov 2, 2017

mayya-sharipova Nov 2, 2017

jimczi Nov 3, 2017

jimczi Nov 2, 2017

mayya-sharipova commented Nov 4, 2017

jimczi left a comment

Add limits for ngram and shingle settings #27211

Add limits for ngram and shingle settings #27211

Conversation

mayya-sharipova commented Nov 1, 2017 • edited by nik9000 Loading

mayya-sharipova commented Nov 1, 2017

mayya-sharipova commented Nov 2, 2017

nik9000 commented Nov 2, 2017

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova Nov 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova commented Nov 4, 2017

jimczi left a comment

Choose a reason for hiding this comment

mayya-sharipova commented Nov 1, 2017 •

edited by nik9000

Loading

mayya-sharipova Nov 2, 2017 •

edited

Loading