Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add limits for ngram and shingle settings #27211

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented Nov 1, 2017

Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
min_shingle_size in ShingleTokenFilter. Default is 3.

Throw an IllegalArgumentException when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings value.

Closes #25887

@mayya-sharipova mayya-sharipova force-pushed the add-limits-ngram-shingle-settings branch from e463ca4 to e961c03 Compare November 1, 2017 16:25
@mayya-sharipova
Copy link
Contributor Author

retest this please

@mayya-sharipova mayya-sharipova force-pushed the add-limits-ngram-shingle-settings branch from e961c03 to 412e589 Compare November 1, 2017 20:17
@mayya-sharipova
Copy link
Contributor Author

retest this please

@nik9000
Copy link
Member

nik9000 commented Nov 2, 2017

@mayya-sharipova, can you drop the PR template when you open the PR? The checklist is great to go over but not super useful to leave in the description.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good @mayya-sharipova
Can you add a small note in the documentation regarding these limits ?
For the shingle for instance it would be in: docs/reference/analysis/tokenfilters/shingle-tokenfilter.asciidoc`

final Settings settings = newAnalysisSettingsBuilder().put("min_gram", min_gram).put("max_gram", max_gram).build();
try {
new NGramTokenizerFactory(indexProperties, null, name, settings).create();
fail();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use LuceneTestCase#expectThrows

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to use LuceneTestCase#expectThrows, but it requires that a constructorpublic NGramTokenizerFactory(..) throw a checked exception. We can add this to the constructor, but then we need to modify all other classes that use NGramTokenizerFactory to catch this exception

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works fine with a runtime exception, you don't need to change the constructor:

IllegalArgumentException exc = expectThrows(IllegalArgumentException.class, () ->
                new NGramTokenizerFactory(indexProperties, null, name, settings).create());

.. and then you can check the message of the exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Jim, will try this

Integer maxShingleSize = settings.getAsInt("max_shingle_size", ShingleFilter.DEFAULT_MAX_SHINGLE_SIZE);
Integer minShingleSize = settings.getAsInt("min_shingle_size", ShingleFilter.DEFAULT_MIN_SHINGLE_SIZE);
int shingleDiff = maxShingleSize - minShingleSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add output_unigrams in the formula:
maxGram - minGram + (outputUnigram ? 1 : 0) ?
This way the default diff is 1 since output_unigrams defaults to true.

* the maximum difference between
* max_shingle_size and min_shingle_size.
* The default value is 3 is defensive as it prevents generating too many tokens.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes too many tokens is problematic. I also don't like that we build invalid boolean queries when the diff is greater than 0. For instance with unigram set to true and shingles of size 2, the input foo bar foobar is analyzed as:
position 1: (foo_bar, foo), position 2: (bar, bar_foobar) ... which is problematic for any position query.
3 seems reasonable especially for fields that don't index positions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi thanks for your comment. I don't think I understand how diff in shingle size greater that 0 makes boolean queries invalid. Does it mean that if we have tokens: {"token" : "foo_bar", "position" : 1}, {"token" : "foo", "position" : 1}, {"token" : "bar", "position" : 2}, {"token" : "bar_foobar", "position" : 2},, we can't have a phrase query say: "foo bar"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foo bar would return the correct document but it would build an invalid phrase query:

"(foo_bar foo) bar"

... trying to find document with foo_bar bar as a phrase query which could be simplified in foo_bar. For boolean query it would not consider that foo_bar is enough to match foo AND bar so the bigram would be useless for matching this type of query. We have solutions on the query side to build the correct query (foo_bar OR (foo AND bar)) but this is currently deactivated because it can explode the number of clauses in the query if you have a large diff between min and max.
For simplicity I prefer when a single shingle size is used for a shingle field. In this case a phrase query like "foo bar foobar baz" could be optimized to "foo_bar foobar_baz" with shingles of size 2 (bar_foobar can be safely ignored since we know that the previous match can only be foo_bar).

Integer maxShingleSize = settings.getAsInt("max_shingle_size", ShingleFilter.DEFAULT_MAX_SHINGLE_SIZE);
Integer minShingleSize = settings.getAsInt("min_shingle_size", ShingleFilter.DEFAULT_MIN_SHINGLE_SIZE);
int shingleDiff = maxShingleSize - minShingleSize;
if (shingleDiff > maxAllowedShingleDiff) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that this would make the upgrade experience really difficult for indices that break this limit already. They would have to close their index before the upgrade, change the limit in the new version and open the index again. That's a bit too much I believe so to be safe we should check this setting only for indices created before it existed:

if (indexSettings.getIndexVersionCreated().onOrAfter(Version.V_7_0_0_alpha1))

... and only log a deprecation warning if the filter breaks the limit on an index that was created an a previous version (see DeprecationLogger and its usage to see how to do that).
This way we could also backport this new setting in 6.x and makes the upgrade easier.

Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
 min_shingle_size in ShingleTokenFilter.  Default is 3.

Throw an IllegalArgumentException when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings value.

Closes elastic#25887
@mayya-sharipova mayya-sharipova force-pushed the add-limits-ngram-shingle-settings branch 2 times, most recently from 81baa06 to 13303f1 Compare November 3, 2017 15:26
Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference betweenmax_shingle_size and
 min_shingle_size in ShingleTokenFilter.  Default is 3.

Throw an IllegalArgumentException from v7.X when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings value.
Create a deprecated warning for versions < 7.X for this case.

Closes elastic#25887
@mayya-sharipova mayya-sharipova force-pushed the add-limits-ngram-shingle-settings branch from 13303f1 to e26f6f1 Compare November 3, 2017 16:12
@mayya-sharipova
Copy link
Contributor Author

@jimczi Jim, thank you for your feedback. I have addressed your comments in the commit. Could you please review when you have time.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @mayya-sharipova

@mayya-sharipova mayya-sharipova merged commit 148376c into elastic:master Nov 7, 2017
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Nov 8, 2017
jasontedor added a commit to glefloch/elasticsearch that referenced this pull request Nov 9, 2017
* master: (556 commits)
  Fix find remote when building BWC
  Remove colons from task and configuration names
  Add unreleased 5.6.5 version number
  testCreateSplitIndexToN: do not set `routing_partition_size` to >= `number_of_routing_shards`
  Snapshot/Restore: better handle incorrect chunk_size settings in FS repo (elastic#26844)
  Add limits for ngram and shingle settings (elastic#27211) (elastic#27318)
  Correct comment in index shard test
  Roll translog generation on primary promotion
  ObjectParser: Replace IllegalStateException with ParsingException (elastic#27302)
  scripted_metric _agg parameter disappears if params are provided (elastic#27159)
  Update discovery-ec2.asciidoc
  Update shrink's bwc version to 6.1.0 and enabled bwc tests
  Add limits for ngram and shingle settings (elastic#27211)
  Disable bwc tests in preparation of backporting elastic#26931
  TemplateUpgradeService should only run on the master (elastic#27294)
  Die with dignity while merging
  Fix profiling naming issues (elastic#27133)
  Correctly encode warning headers
  Fixed references to Multi Index Syntax (elastic#27283)
  Add an active Elasticsearch WordPress plugin link (elastic#27279)
  ...
jasontedor added a commit that referenced this pull request Nov 9, 2017
* master: (22 commits)
  Update Tika version to 1.15
  Aggregations: bucket_sort pipeline aggregation (#27152)
  Introduce templating support to timezone/locale in DateProcessor (#27089)
  Increase logging on qa:mixed-cluster tests
  Update to AWS SDK 1.11.223 (#27278)
  Improve error message for parse failures of completion fields (#27297)
  Ensure external refreshes will also refresh internal searcher to minimize segment creation (#27253)
  Remove optimisations to reuse objects when applying a new `ClusterState` (#27317)
  Decouple `ChannelFactory` from Tcp classes (#27286)
  Fix find remote when building BWC
  Remove colons from task and configuration names
  Add unreleased 5.6.5 version number
  testCreateSplitIndexToN: do not set `routing_partition_size` to >= `number_of_routing_shards`
  Snapshot/Restore: better handle incorrect chunk_size settings in FS repo (#26844)
  Add limits for ngram and shingle settings (#27211) (#27318)
  Correct comment in index shard test
  Roll translog generation on primary promotion
  ObjectParser: Replace IllegalStateException with ParsingException (#27302)
  scripted_metric _agg parameter disappears if params are provided (#27159)
  Update discovery-ec2.asciidoc
  ...
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jun 20, 2019
Mpdreamz pushed a commit to elastic/elasticsearch-net that referenced this pull request Jun 20, 2019
javanna added a commit to javanna/elasticsearch that referenced this pull request Sep 17, 2024
This commit removes conditionals for logic that no longer needs to be version
dependent, in that all indices in the cluster will match such condition.

Relates to elastic#27211,elastic#34331
javanna added a commit that referenced this pull request Sep 18, 2024
This commit removes conditionals for logic that no longer needs to be version
dependent, in that all indices in the cluster will match such condition.

Relates to #27211,#34331
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants