Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid negative scores with cross_fields type #89016

Merged
merged 6 commits into from
Sep 6, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/changelog/89016.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
pr: 89016
summary: Avoid negative scores with `cross_fields` type
area: Ranking
type: bug
issues:
- 44700
9 changes: 4 additions & 5 deletions docs/reference/query-dsl/multi-match-query.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -388,11 +388,10 @@ explanation:
Also, accepts `analyzer`, `boost`, `operator`, `minimum_should_match`,
`lenient` and `zero_terms_query`.

WARNING: The `cross_fields` type blends field statistics in a way that does
not always produce well-formed scores (for example scores can become
negative). As an alternative, you can consider the
<<query-dsl-combined-fields-query,`combined_fields`>> query, which is also
term-centric but combines field statistics in a more robust way.
WARNING: The `cross_fields` type blends field statistics in a complex way that
can be hard to interpret. You should consider the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussion on fixing docCount vs. docFreq, I wonder if we should make this warning stronger and point out that scores may not only be hard to interpret (which covers e.g. the way it tries to prefer the most likely field) but also incorrect in case of fields that have different statistics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

<<query-dsl-combined-fields-query,`combined_fields`>> query as an alternative,
which is also term-centric but combines field statistics in a more robust way.

[[cross-field-analysis]]
===== `cross_field` and analysis
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,10 @@ protected int compare(int i, int j) {
if (prev > current) {
actualDf++;
}
contexts[i] = ctx = adjustDF(reader.getContext(), ctx, Math.min(maxDoc, actualDf));

int docCount = reader.getDocCount(terms[i].field());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query's score calculation is complex and can be hard to reason about. I liked this change because it doesn't affect the non-broken case at all. It only changes the score if docCount < docFreq, which would have resulted in a negative IDF and a broken score anyways.

int newDocFreq = Math.min(actualDf, docCount);
contexts[i] = ctx = adjustDF(reader.getContext(), ctx, newDocFreq);
prev = current;
sumTTF += ctx.totalTermFreq();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,39 @@ public void testMinTTF() throws IOException {
dir.close();
}

public void testMissingFields() throws IOException {
Directory dir = newDirectory();
IndexWriter w = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random())));
FieldType ft = new FieldType(TextField.TYPE_NOT_STORED);
ft.freeze();

for (int i = 0; i < 10; i++) {
Document d = new Document();
d.add(new TextField("id", Integer.toString(i), Field.Store.YES));
d.add(new Field("dense", "foo", ft));
// Add a sparse field with high totalTermFreq but low docCount
if (i % 5 == 0) {
d.add(new Field("sparse", "foo", ft));
d.add(new Field("sparse", "one two three four five size", ft));
}
w.addDocument(d);
}
w.commit();

DirectoryReader reader = DirectoryReader.open(w);
IndexSearcher searcher = setSimilarity(newSearcher(reader));

String[] fields = new String[] { "dense", "sparse" };
Query query = BlendedTermQuery.dismaxBlendedQuery(toTerms(fields, "foo"), 0.1f);
TopDocs search = searcher.search(query, 10);
ScoreDoc[] scoreDocs = search.scoreDocs;
assertEquals(Integer.toString(0), reader.document(scoreDocs[0].doc).getField("id").stringValue());

reader.close();
w.close();
dir.close();
}

public void testEqualsAndHash() {
String[] fields = new String[1 + random().nextInt(10)];
for (int i = 0; i < fields.length; i++) {
Expand Down