-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid negative scores with cross_fields type #89016
Conversation
Hi @jtibshirani, I've created a changelog YAML for you. |
Hi @jtibshirani, I've updated the changelog YAML for you. |
The cross_fields scoring type can produce negative scores when some documents are missing fields. When blending term document frequencies, we take the maximum document frequency across all fields. If one field appears in fewer documents than another, this means that its IDF can become negative. This is because IDF is calculated as `Math.log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))` This change adjusts the docFreq for each field to `Math.min(docCount, docFreq)` so that the IDF can never become negative. It makes sense that the term document frequency should never exceed the number of documents containing the field.
40d9164
to
78d81e7
Compare
Pinging @elastic/es-search (Team:Search) |
Hi @jtibshirani, I've created a changelog YAML for you. |
@@ -148,7 +148,10 @@ protected int compare(int i, int j) { | |||
if (prev > current) { | |||
actualDf++; | |||
} | |||
contexts[i] = ctx = adjustDF(reader.getContext(), ctx, Math.min(maxDoc, actualDf)); | |||
|
|||
int docCount = reader.getDocCount(terms[i].field()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query's score calculation is complex and can be hard to reason about. I liked this change because it doesn't affect the non-broken case at all. It only changes the score if docCount < docFreq
, which would have resulted in a negative IDF and a broken score anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @jpountz will want to have a look as well but this makes sense to me.
Thanks for the review @romseygeek. |
@jpountz would you be up for reviewing this? Thanks in advance! |
Thanks @jtibshirani for the ping. This looks like the smallest change we could do that would fix the bug, but I worry that it goes against the intent of Would it make more sense to adjust |
@jpountz thanks for looking. I considered adjusting
We could attempt a larger refactor, but it seems risky and not the right investment. As you mentioned, we overlooked the sparse case so the statement "Treats fields with the same analyzer as though they were one big field" was never accurate. I think our options are (1) consider a bigger, principled change to fix the sparse case, (2) throw an exception in the sparse case saying it's not supported, or (3) make a targeted fix like this one. I would vote for (3) as the most user friendly and practical option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jtibshirani for the additional context. Now that I better understand what it would take to implement a correct fix, I agree with your suggestion of going with option 3, essentially further redirecting users from the cross_fields
mode of multi_match
to the combined_fields
query.
<<query-dsl-combined-fields-query,`combined_fields`>> query, which is also | ||
term-centric but combines field statistics in a more robust way. | ||
WARNING: The `cross_fields` type blends field statistics in a complex way that | ||
can be hard to interpret. You should consider the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the discussion on fixing docCount vs. docFreq, I wonder if we should make this warning stronger and point out that scores may not only be hard to interpret (which covers e.g. the way it tries to prefer the most likely field) but also incorrect in case of fields that have different statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
The cross_fields scoring type can produce negative scores when some documents are missing fields. When blending term document frequencies, we take the maximum document frequency across all fields. If one field appears in fewer documents than another, this means that its IDF can become negative. This is because IDF is calculated as `Math.log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))` This change adjusts the docFreq for each field to `Math.min(docCount, docFreq)` so that the IDF can never become negative. It makes sense that the term document frequency should never exceed the number of documents containing the field.
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative scores. This fix accidentally dropped another important fix that was added in #41938. Specifically, we need to make sure to take the minimum between the document frequency (`actualDf`) and the minimum total term frequency (`minTTF`). Otherwise, we can produce invalid term statistics where the total term frequency is less than the document frequency. Fixes #90275
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative scores. This fix accidentally dropped another important fix that was added in #41938. Specifically, we need to make sure to take the minimum between the document frequency (`actualDf`) and the minimum total term frequency (`minTTF`). Otherwise, we can produce invalid term statistics where the total term frequency is less than the document frequency. Fixes #90275
In elastic#89016 we adjusted the `cross_fields` scoring formula to prevent negative scores. This fix accidentally dropped another important fix that was added in elastic#41938. Specifically, we need to make sure to take the minimum between the document frequency (`actualDf`) and the minimum total term frequency (`minTTF`). Otherwise, we can produce invalid term statistics where the total term frequency is less than the document frequency. Fixes elastic#90275
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative scores. This fix accidentally dropped another important fix that was added in #41938. Specifically, we need to make sure to take the minimum between the document frequency (`actualDf`) and the minimum total term frequency (`minTTF`). Otherwise, we can produce invalid term statistics where the total term frequency is less than the document frequency. Fixes #90275
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative scores. This fix accidentally dropped another important fix that was added in document frequency (`actualDf`) and the minimum total term frequency (`minTTF`). Otherwise, we can produce invalid term statistics where the total term frequency is less than the document frequency. Fixes #90275
The cross_fields scoring type can produce negative scores when some documents
are missing fields. When blending term document frequencies, we take the maximum
document frequency across all fields. If one field appears in fewer documents
than another, this means that its IDF can become negative. This is because IDF
is calculated as
Math.log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))
This change adjusts the docFreq for each field to
Math.min(docCount, docFreq)
so that the IDF can never become negative. It makes sense that the term document
frequency should never exceed the number of documents containing the field.
Fixes #44700