Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 2.x] Fix negative scores returned from multi_match query with cross_fields #13983

Merged

Conversation

msfroh
Copy link
Collaborator

@msfroh msfroh commented Jun 5, 2024

Manual backport of #13829

Under specific circumstances, when using cross_fields scoring on a multi_match query, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula.

Specifically, the IDF is calculated as:

log(1 + (N - n + 0.5) / (n + 0.5))

where N is the number of documents containing the field and n is the number of documents containing the given term in the field. Obviously, n should always be less than or equal to N.

Unfortunately, cross_fields makes up a new value for n and tries to use it across all fields.

This change finds the (nonzero) value of N for each field and uses that as an upper bound for the new value of n.

Signed-off-by: Michael Froh froh@amazon.com


Signed-off-by: Michael Froh froh@amazon.com
(cherry picked from commit fffd101)

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • API changes companion pull request created.
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…lds` (opensearch-project#13829)

Under specific circumstances, when using `cross_fields` scoring on a
`multi_match` query, we can end up with negative scores from the inverse
document frequency calculation in the BM25 formula.

Specifically, the IDF is calculated as:

```
log(1 + (N - n + 0.5) / (n + 0.5))
```

where `N` is the number of documents containing the field and `n` is the
number of documents containing the given term in the field. Obviously,
`n` should always be less than or equal to `N`.

Unfortunately, `cross_fields` makes up a new value for `n` and tries to
use it across all fields.

This change finds the (nonzero) value of `N` for each field and uses that as an
upper bound for the new value of `n`.

Signed-off-by: Michael Froh <froh@amazon.com>

---------

Signed-off-by: Michael Froh <froh@amazon.com>
(cherry picked from commit fffd101)
Copy link
Contributor

github-actions bot commented Jun 5, 2024

❌ Gradle check result for 82f3d20: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Jun 5, 2024

✅ Gradle check result for 82f3d20: SUCCESS

Copy link

codecov bot commented Jun 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.31%. Comparing base (0dd892c) to head (82f3d20).
Report is 301 commits behind head on 2.x.

Additional details and impacted files
@@             Coverage Diff              @@
##                2.x   #13983      +/-   ##
============================================
+ Coverage     71.28%   71.31%   +0.02%     
- Complexity    60145    61322    +1177     
============================================
  Files          4957     5039      +82     
  Lines        282799   288411    +5612     
  Branches      41409    42135     +726     
============================================
+ Hits         201591   205676    +4085     
- Misses        64189    65398    +1209     
- Partials      17019    17337     +318     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants