Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid negative scores with cross_fields type #89016

Merged
merged 6 commits into from
Sep 6, 2022

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented Aug 1, 2022

The cross_fields scoring type can produce negative scores when some documents
are missing fields. When blending term document frequencies, we take the maximum
document frequency across all fields. If one field appears in fewer documents
than another, this means that its IDF can become negative. This is because IDF
is calculated as Math.log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))

This change adjusts the docFreq for each field to Math.min(docCount, docFreq)
so that the IDF can never become negative. It makes sense that the term document
frequency should never exceed the number of documents containing the field.

Fixes #44700

@jtibshirani jtibshirani added >bug :Search Relevance/Ranking Scoring, rescoring, rank evaluation. v8.4.0 v8.5.0 labels Aug 1, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @jtibshirani, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @jtibshirani, I've updated the changelog YAML for you.

The cross_fields scoring type can produce negative scores when some documents
are missing fields. When blending term document frequencies, we take the maximum
document frequency across all fields. If one field appears in fewer documents
than another, this means that its IDF can become negative. This is because IDF
is calculated as `Math.log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))`

This change adjusts the docFreq for each field to `Math.min(docCount, docFreq)`
so that the IDF can never become negative. It makes sense that the term document
frequency should never exceed the number of documents containing the field.
@jtibshirani jtibshirani marked this pull request as ready for review August 4, 2022 17:36
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Aug 4, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Collaborator

Hi @jtibshirani, I've created a changelog YAML for you.

@@ -148,7 +148,10 @@ protected int compare(int i, int j) {
if (prev > current) {
actualDf++;
}
contexts[i] = ctx = adjustDF(reader.getContext(), ctx, Math.min(maxDoc, actualDf));

int docCount = reader.getDocCount(terms[i].field());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query's score calculation is complex and can be hard to reason about. I liked this change because it doesn't affect the non-broken case at all. It only changes the score if docCount < docFreq, which would have resulted in a negative IDF and a broken score anyways.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jpountz will want to have a look as well but this makes sense to me.

docs/reference/query-dsl/multi-match-query.asciidoc Outdated Show resolved Hide resolved
@jtibshirani
Copy link
Contributor Author

Thanks for the review @romseygeek.

@jtibshirani
Copy link
Contributor Author

@jpountz would you be up for reviewing this? Thanks in advance!

@jpountz
Copy link
Contributor

jpountz commented Aug 30, 2022

Thanks @jtibshirani for the ping. This looks like the smallest change we could do that would fix the bug, but I worry that it goes against the intent of cross_fields. cross_fields's documentation says Treats fields with the same analyzer as though they were one big field and my understanding is that the implementation tries to achieve this by forging term statistics in a way that gives terms (almost) the same IDF on all fields. If we start adjusting docFreq then we're possibly giving very different IDFs to different fields for the same term.

Would it make more sense to adjust docCount to take the max across all fields, which should in-turn help guarantee that docFreq is always less than docCount? I know that different fields may be getting very different IDFs today because they have different docCounts but my understanding is that it is due to the fact that the cross_fields implementation overlooked sparse fields and it was not intentional.

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Aug 30, 2022

@jpountz thanks for looking. I considered adjusting docCount instead, but I could not find a way to change that value, since it is not part of TermStates and is loaded from IndexReader. Only docFreq can be adjusted... here's the javadoc from TermQuery:

  /**
   * Expert: constructs a TermQuery that will use the provided docFreq instead of looking up the
   * docFreq against the searcher.
   */
  public TermQuery(Term t, TermStates states) { ... }

We could attempt a larger refactor, but it seems risky and not the right investment.

As you mentioned, we overlooked the sparse case so the statement "Treats fields with the same analyzer as though they were one big field" was never accurate. I think our options are (1) consider a bigger, principled change to fix the sparse case, (2) throw an exception in the sparse case saying it's not supported, or (3) make a targeted fix like this one. I would vote for (3) as the most user friendly and practical option.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jtibshirani for the additional context. Now that I better understand what it would take to implement a correct fix, I agree with your suggestion of going with option 3, essentially further redirecting users from the cross_fields mode of multi_match to the combined_fields query.

<<query-dsl-combined-fields-query,`combined_fields`>> query, which is also
term-centric but combines field statistics in a more robust way.
WARNING: The `cross_fields` type blends field statistics in a complex way that
can be hard to interpret. You should consider the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussion on fixing docCount vs. docFreq, I wonder if we should make this warning stronger and point out that scores may not only be hard to interpret (which covers e.g. the way it tries to prefer the most likely field) but also incorrect in case of fields that have different statistics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jtibshirani jtibshirani added v8.4.2 and removed v8.4.1 labels Sep 6, 2022
@jtibshirani jtibshirani merged commit 3c1b070 into elastic:main Sep 6, 2022
@jtibshirani jtibshirani deleted the cross-fields-score branch September 6, 2022 20:02
jtibshirani added a commit that referenced this pull request Sep 6, 2022
The cross_fields scoring type can produce negative scores when some documents
are missing fields. When blending term document frequencies, we take the maximum
document frequency across all fields. If one field appears in fewer documents
than another, this means that its IDF can become negative. This is because IDF
is calculated as `Math.log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))`

This change adjusts the docFreq for each field to `Math.min(docCount, docFreq)`
so that the IDF can never become negative. It makes sense that the term document
frequency should never exceed the number of documents containing the field.
jtibshirani added a commit that referenced this pull request Sep 23, 2022
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
#41938. Specifically, we need to make sure to take the minimum between the
document frequency (`actualDf`) and the minimum total term frequency
(`minTTF`). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

Fixes #90275
jtibshirani added a commit that referenced this pull request Sep 23, 2022
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
#41938. Specifically, we need to make sure to take the minimum between the
document frequency (`actualDf`) and the minimum total term frequency
(`minTTF`). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

Fixes #90275
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this pull request Sep 23, 2022
In elastic#89016 we adjusted the `cross_fields` scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
elastic#41938. Specifically, we need to make sure to take the minimum between the
document frequency (`actualDf`) and the minimum total term frequency
(`minTTF`). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

Fixes elastic#90275
jtibshirani added a commit that referenced this pull request Sep 23, 2022
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
#41938. Specifically, we need to make sure to take the minimum between the
document frequency (`actualDf`) and the minimum total term frequency
(`minTTF`). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

Fixes #90275
jtibshirani added a commit that referenced this pull request Sep 23, 2022
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
document frequency (`actualDf`) and the minimum total term frequency
(`minTTF`). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

Fixes #90275
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Relevance/Ranking Scoring, rescoring, rank evaluation. Team:Search Meta label for search team v7.17.7 v8.4.2 v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cross-field type creates broken scores when not all fields have the same docCount
5 participants