Add minimal sanity checks to custom/scripted similarities. #33564

jpountz · 2018-09-10T12:43:58Z

Lucene 8 introduced more constraints on similarities, in particular:

scores must not be negative,
scores must not decrease when term freq increases,
scores must not increase when norm (interpreted as an unsigned long)
increases.

We can't check every single case, but could at least run some sanity checks.

Lucene 8 introduced more constraints on similarities, in particular: - scores must not be negative, - scores must not decrease when term freq increases, - scores must not increase when norm (interpreted as an unsigned long) increases. We can't check every single case, but could at least run some sanity checks. Relates elastic#33309

elasticmachine · 2018-09-10T12:44:00Z

Pinging @elastic/es-search-aggs

jimczi

LGTM. I wonder if this is sufficient for custom similarities though. Should we also check scores at runtime to catch all negative scores ?

romseygeek · 2018-09-10T13:27:08Z

Can we check at runtime without slowing things down massively though?

jpountz · 2018-09-12T14:34:20Z

@jimczi @romseygeek I added runtime checks for negative scores. This is only done for custom types and scripted similarities to keep the overhead contained, and only adds one float comparison otherwise. I also added an escape hatch which truncates scores so that you are not doomed to reindex in order to change your similarity in case you happen to discover corner-cases eg. in a scripted similarity.

jimczi

Thanks @jpountz, the escape hatch is a great idea.

elasticdog · 2018-09-14T15:27:44Z

This job triggered CI during a migration of the master. Kicking off an additional build for you manually...

Jenkins, test this please.

jpountz · 2018-09-18T08:57:11Z

Jenkins, test this please.

* master: (46 commits) Fixing assertions in integration test (elastic#33833) [CCR] Rename idle_shard_retry_delay to poll_timout in auto follow patterns (elastic#33821) HLRC: Delete ML calendar (elastic#33775) Move DocsStats into Engine (elastic#33835) [Docs] Clarify accessing Date methods in painless (elastic#33560) add elasticsearch-shard tool (elastic#32281) Cut over to unwrap segment reader (elastic#33843) SQL: Fix issue with options for QUERY() and MATCH(). (elastic#33828) Emphasize that filesystem-level backups don't work (elastic#33102) Use the global doc id to generate a random score (elastic#33599) Add minimal sanity checks to custom/scripted similarities. (elastic#33564) Profiler: Don’t profile NEXTDOC for ConstantScoreQuery. (elastic#33196) [CCR] Change FollowIndexAction.Request class to be more user friendly (elastic#33810) SQL: day and month name functions tests locale providers enforcement (elastic#33653) TESTS: Set SO_LINGER = 0 for MockNioTransport (elastic#32560) Test: Relax jarhell gradle test (elastic#33787) [CCR] Fail with a descriptive error if leader index does not exist (elastic#33797) Add ES version 6.4.2 (elastic#33831) MINOR: Remove Some Dead Code in Scripting (elastic#33800) Ensure realtime `_get` and `_termvectors` don't run on the network thread (elastic#33814) ...

…#33893) Add minimal sanity checks to custom/scripted similarities. Lucene 8 introduced more constraints on similarities, in particular: - scores must not be negative, - scores must not decrease when term freq increases, - scores must not increase when norm (interpreted as an unsigned long) increases. We can't check every single case, but could at least run some sanity checks. Backport of #33564

feherbbj · 2019-05-27T12:16:57Z

Regarding the first and last statement i completely understand why you want to avoid those cases, but regarding 'scores must not decrease when term freq increases' can i ask why it shall be checked. Is there any issue with the performance if this case happens? You states that Lucence 8 forced this sanity check, so even though if somehow i managed to skip this specific sanity check, then will throw some exception?

If i understand well now if you search for 'abc' on specific field and you find a document, field of which value is 'abc' and another document with 'abcabc' , then 'abcabc' must always have to be declared more relevant than 'abc' by the custom similarity scoring process. This does not seems rights for some cases. For example i use ElasticSearch for text analysis and in most of the cases text analysis prefers to find the most matching/fitting term, so it prefers to find (has a higher score) 'abc' over 'abcabc'.

How shall i overcome this issue? Is there any way to configure it in the settings to skip this particular check? Can you add a settings to skip this check? I want to upgrade from 6.7 to 7.0, but this way it seems impossible. Or maybe i misunderstand something?

jpountz · 2019-05-27T13:17:13Z

regarding 'scores must not decrease when term freq increases' can i ask why it shall be checked. Is there any issue with the performance if this case happens?

It has to do with the fact that we try to record the maximum score per block since 8.0. Since we can't actually store scores, which depend on index statistics that we don't have at index time such as doc frequency (total number of documents that contain a term across all segments), we actually store pairs of (term_frequency, norm). But then we don't want to record every unique pair that occurs within a block, so we automatically remove those that trigger less competitive scores than others and introduced the rule that scores can't decrease when term_frequency increases of norm decreases for that purpose. This way if one document has a (term_frequency, norm) pair of (3, 12) and another one of (2, 12) in the same block, then we know that the score of (2, 12) is less than the score of (3, 12) and we just store the latter.

If you don't meet this requirement, Elasticsearch might fail to match some hits, because it will assume that their score is not competitive.

If i understand well now if you search for 'abc' on specific field and you find a document, field of which value is 'abc' and another document with 'abcabc' , then 'abcabc' must always have to be declared more relevant than 'abc' by the custom similarity scoring process.

If your term is abc, one document is abc and the other one is abc abc, the latter does indeed have a term frequency of 2, but it also has a greater norm, so their scores are not comparable. The above statement that "score must not decrease when term frequency increases" assumes document that have the same value for the normalization factor. Similarities are free to rank abc over abc abc or vice-versa since their (term_frequency,norm) pairs are not directly comparable: if they were to both occur in the same block, we'd store them both.

Can you add a settings to skip this check?

Are you actually hitting this check? Can you share the definition of your scripted similarity?

feherbbj · 2019-06-18T08:48:35Z

Sorry for the late reply and thank you for the information!

So you try to store pairs of (term_frequency, norm). According to this article (https://www.elastic.co/guide/en/elasticsearch/reference/7.1/text.html) at section 'norms' you can disable norms for a field. What happens if i disable a norms for a field and i use a similarity on that field where score increases when norm increases (so it fails at your sanity check). Can i update the schema this way? Or it will throw this sanity error if i try to put the index settings and mappings?

Since if you disabled norms, then i assume ES/Lucene can not calculate store these pairs, so this sanity check should not happen in that case.

You asked whether i am hitting this check. Reinstalled ES 7.0.0 and tried to put the schema again and it turned out i was wrong, i bumped into the 3rd check : "scores must not increase when norm (interpreted as an unsigned long) increases.".

Regarding the similarity script:

The field is n-gram tokenized, on which the script is defined
I boost the query with a special number (call it 'sn')
If sn > doc.length, then i return doc.length/sn
Otherwise i return sn/return doc.length
So technically till a specific length score should increase after that length score should decrease

jpountz · 2019-06-18T09:38:49Z

What happens if i disable a norms for a field and i use a similarity on that field where score increases when norm increases (so it fails at your sanity check). Can i update the schema this way? Or it will throw this sanity error if i try to put the index settings and mappings?

The latter, this will be rejected. Disabled norms are treated as if all documents had a norm value of 1. In that case postings only record a single pair, which is (max(termFreq), 1).

jpountz added >enhancement >breaking :Search Relevance/Ranking Scoring, rescoring, rank evaluation. v7.0.0 v6.5.0 labels Sep 10, 2018

jimczi mentioned this pull request Sep 10, 2018

Forbid negative query scores #33309

Closed

5 tasks

jimczi approved these changes Sep 10, 2018

View reviewed changes

jpountz added 2 commits September 12, 2018 15:19

Merge branch 'master' into fix/check_similarity_scores

6f3d0b7

Assert that scores are not negative.

efcd827

jimczi approved these changes Sep 12, 2018

View reviewed changes

Merge branch 'master' into fix/check_similarity_scores

a948eea

Merge branch 'master' into fix/check_similarity_scores

dec27dd

jpountz merged commit c4261ba into elastic:master Sep 19, 2018

jpountz deleted the fix/check_similarity_scores branch September 19, 2018 07:19

jpountz removed the v6.5.0 label Sep 20, 2018

jpountz mentioned this pull request Sep 20, 2018

Add minimal sanity checks to custom/scripted similarities. (backport) #33893

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add minimal sanity checks to custom/scripted similarities. #33564

Add minimal sanity checks to custom/scripted similarities. #33564

jpountz commented Sep 10, 2018

elasticmachine commented Sep 10, 2018

jimczi left a comment

romseygeek commented Sep 10, 2018

jpountz commented Sep 12, 2018

jimczi left a comment

elasticdog commented Sep 14, 2018

jpountz commented Sep 18, 2018

feherbbj commented May 27, 2019

jpountz commented May 27, 2019

feherbbj commented Jun 18, 2019

jpountz commented Jun 18, 2019

Add minimal sanity checks to custom/scripted similarities. #33564

Add minimal sanity checks to custom/scripted similarities. #33564

Conversation

jpountz commented Sep 10, 2018

elasticmachine commented Sep 10, 2018

jimczi left a comment

Choose a reason for hiding this comment

romseygeek commented Sep 10, 2018

jpountz commented Sep 12, 2018

jimczi left a comment

Choose a reason for hiding this comment

elasticdog commented Sep 14, 2018

jpountz commented Sep 18, 2018

feherbbj commented May 27, 2019

jpountz commented May 27, 2019

feherbbj commented Jun 18, 2019

jpountz commented Jun 18, 2019