Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minimal sanity checks to custom/scripted similarities. #33564

Merged
merged 5 commits into from
Sep 19, 2018

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Sep 10, 2018

Lucene 8 introduced more constraints on similarities, in particular:

  • scores must not be negative,
  • scores must not decrease when term freq increases,
  • scores must not increase when norm (interpreted as an unsigned long)
    increases.

We can't check every single case, but could at least run some sanity checks.

Relates #33309

Lucene 8 introduced more constraints on similarities, in particular:
 - scores must not be negative,
 - scores must not decrease when term freq increases,
 - scores must not increase when norm (interpreted as an unsigned long)
   increases.

We can't check every single case, but could at least run some sanity checks.

Relates elastic#33309
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jimczi jimczi mentioned this pull request Sep 10, 2018
5 tasks
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I wonder if this is sufficient for custom similarities though. Should we also check scores at runtime to catch all negative scores ?

@romseygeek
Copy link
Contributor

Can we check at runtime without slowing things down massively though?

@jpountz
Copy link
Contributor Author

jpountz commented Sep 12, 2018

@jimczi @romseygeek I added runtime checks for negative scores. This is only done for custom types and scripted similarities to keep the overhead contained, and only adds one float comparison otherwise. I also added an escape hatch which truncates scores so that you are not doomed to reindex in order to change your similarity in case you happen to discover corner-cases eg. in a scripted similarity.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz, the escape hatch is a great idea.

@elasticdog
Copy link
Contributor

This job triggered CI during a migration of the master. Kicking off an additional build for you manually...

Jenkins, test this please.

@jpountz
Copy link
Contributor Author

jpountz commented Sep 18, 2018

Jenkins, test this please.

@jpountz jpountz merged commit c4261ba into elastic:master Sep 19, 2018
@jpountz jpountz deleted the fix/check_similarity_scores branch September 19, 2018 07:19
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Sep 19, 2018
* master: (46 commits)
  Fixing assertions in integration test (elastic#33833)
  [CCR] Rename idle_shard_retry_delay to poll_timout in auto follow patterns (elastic#33821)
  HLRC: Delete ML calendar (elastic#33775)
  Move DocsStats into Engine (elastic#33835)
  [Docs] Clarify accessing Date methods in painless (elastic#33560)
  add elasticsearch-shard tool (elastic#32281)
  Cut over to unwrap segment reader (elastic#33843)
  SQL: Fix issue with options for QUERY() and MATCH(). (elastic#33828)
  Emphasize that filesystem-level backups don't work (elastic#33102)
  Use the global doc id to generate a random score (elastic#33599)
  Add minimal sanity checks to custom/scripted similarities. (elastic#33564)
  Profiler: Don’t profile NEXTDOC for ConstantScoreQuery. (elastic#33196)
  [CCR] Change FollowIndexAction.Request class to be more user friendly (elastic#33810)
  SQL: day and month name functions tests locale providers enforcement (elastic#33653)
  TESTS: Set SO_LINGER = 0 for MockNioTransport (elastic#32560)
  Test: Relax jarhell gradle test (elastic#33787)
  [CCR] Fail with a descriptive error if leader index does not exist (elastic#33797)
  Add ES version 6.4.2 (elastic#33831)
  MINOR: Remove Some Dead Code in Scripting (elastic#33800)
  Ensure realtime `_get` and `_termvectors` don't run on the network thread (elastic#33814)
  ...
@jpountz jpountz removed the v6.5.0 label Sep 20, 2018
jpountz added a commit that referenced this pull request Sep 25, 2018
…#33893)

Add minimal sanity checks to custom/scripted similarities.

Lucene 8 introduced more constraints on similarities, in particular:
 - scores must not be negative,
 - scores must not decrease when term freq increases,
 - scores must not increase when norm (interpreted as an unsigned long)
   increases.

We can't check every single case, but could at least run some sanity checks.

Backport of #33564
@feherbbj
Copy link

Regarding the first and last statement i completely understand why you want to avoid those cases, but regarding 'scores must not decrease when term freq increases' can i ask why it shall be checked. Is there any issue with the performance if this case happens? You states that Lucence 8 forced this sanity check, so even though if somehow i managed to skip this specific sanity check, then will throw some exception?

If i understand well now if you search for 'abc' on specific field and you find a document, field of which value is 'abc' and another document with 'abcabc' , then 'abcabc' must always have to be declared more relevant than 'abc' by the custom similarity scoring process. This does not seems rights for some cases. For example i use ElasticSearch for text analysis and in most of the cases text analysis prefers to find the most matching/fitting term, so it prefers to find (has a higher score) 'abc' over 'abcabc'.

How shall i overcome this issue? Is there any way to configure it in the settings to skip this particular check? Can you add a settings to skip this check? I want to upgrade from 6.7 to 7.0, but this way it seems impossible. Or maybe i misunderstand something?

@jpountz
Copy link
Contributor Author

jpountz commented May 27, 2019

regarding 'scores must not decrease when term freq increases' can i ask why it shall be checked. Is there any issue with the performance if this case happens?

It has to do with the fact that we try to record the maximum score per block since 8.0. Since we can't actually store scores, which depend on index statistics that we don't have at index time such as doc frequency (total number of documents that contain a term across all segments), we actually store pairs of (term_frequency, norm). But then we don't want to record every unique pair that occurs within a block, so we automatically remove those that trigger less competitive scores than others and introduced the rule that scores can't decrease when term_frequency increases of norm decreases for that purpose. This way if one document has a (term_frequency, norm) pair of (3, 12) and another one of (2, 12) in the same block, then we know that the score of (2, 12) is less than the score of (3, 12) and we just store the latter.

If you don't meet this requirement, Elasticsearch might fail to match some hits, because it will assume that their score is not competitive.

If i understand well now if you search for 'abc' on specific field and you find a document, field of which value is 'abc' and another document with 'abcabc' , then 'abcabc' must always have to be declared more relevant than 'abc' by the custom similarity scoring process.

If your term is abc, one document is abc and the other one is abc abc, the latter does indeed have a term frequency of 2, but it also has a greater norm, so their scores are not comparable. The above statement that "score must not decrease when term frequency increases" assumes document that have the same value for the normalization factor. Similarities are free to rank abc over abc abc or vice-versa since their (term_frequency,norm) pairs are not directly comparable: if they were to both occur in the same block, we'd store them both.

Can you add a settings to skip this check?

Are you actually hitting this check? Can you share the definition of your scripted similarity?

@feherbbj
Copy link

Sorry for the late reply and thank you for the information!

So you try to store pairs of (term_frequency, norm). According to this article (https://www.elastic.co/guide/en/elasticsearch/reference/7.1/text.html) at section 'norms' you can disable norms for a field. What happens if i disable a norms for a field and i use a similarity on that field where score increases when norm increases (so it fails at your sanity check). Can i update the schema this way? Or it will throw this sanity error if i try to put the index settings and mappings?

Since if you disabled norms, then i assume ES/Lucene can not calculate store these pairs, so this sanity check should not happen in that case.

You asked whether i am hitting this check. Reinstalled ES 7.0.0 and tried to put the schema again and it turned out i was wrong, i bumped into the 3rd check : "scores must not increase when norm (interpreted as an unsigned long) increases.".

Regarding the similarity script:

  • The field is n-gram tokenized, on which the script is defined
  • I boost the query with a special number (call it 'sn')
  • If sn > doc.length, then i return doc.length/sn
  • Otherwise i return sn/return doc.length
  • So technically till a specific length score should increase after that length score should decrease

@jpountz
Copy link
Contributor Author

jpountz commented Jun 18, 2019

What happens if i disable a norms for a field and i use a similarity on that field where score increases when norm increases (so it fails at your sanity check). Can i update the schema this way? Or it will throw this sanity error if i try to put the index settings and mappings?

The latter, this will be rejected. Disabled norms are treated as if all documents had a norm value of 1. In that case postings only record a single pair, which is (max(termFreq), 1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants