Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Under certain conditions, sort values for a hit come from an unrelated document #31554

Closed
dbevacqua opened this issue Jun 25, 2018 · 17 comments · Fixed by #32204
Closed

Under certain conditions, sort values for a hit come from an unrelated document #31554

dbevacqua opened this issue Jun 25, 2018 · 17 comments · Fixed by #32204
Labels
>bug :Search/Search Search-related issues that do not fall into other categories

Comments

@dbevacqua
Copy link
Contributor

dbevacqua commented Jun 25, 2018

Elasticsearch version (bin/elasticsearch --version):

6.3.0 (from docker.elastic.co/elasticsearch/elasticsearch-oss:6.3.0)

Plugins installed: []

JVM version (java -version):

openjdk version "10.0.1" 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10)
OpenJDK 64-Bit Server VM (build 10.0.1+10, mixed mode)

OS version (uname -a if on a Unix-like system):

Linux f076156ce113 4.4.0-128-generic #154-Ubuntu SMP Fri May 25 14:15:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Under certain conditions, sort values for a hit come from an unrelated document. The specific case illustrated below involves two documents on the same shard, one of which is PUT both before and after the other without a refresh in between.

Expected behaviour: a search with multi-level nested sort targeting that document returns "missing" sort value.

Actual behaviour: a search with multi-level nested sort targeting that document returns the sort value from the other document.

We believe the problem is wider-reaching than this as we have observed the behaviour with many (or most) of our searches which use multi-level nested sorts.

Steps to reproduce:

Execute following bash script.

#!/bin/bash

function http {
  curl -Ss -H 'Content-Type: application/json' -X "$@"
  echo
}

http DELETE "localhost:9200/test"

http PUT "localhost:9200/test"  -d '{
    "mappings": {
      "test-type": {
        "dynamic": "strict",
        "properties": {
          "nested1": {
            "type": "nested",
            "properties": {
              "nested2": {
                "type": "nested",
                "properties": {
                  "nested2_keyword": {
                    "type": "keyword"
                  },
                  "sortScore": {
                    "type": "integer"
                  }
                }
              },
              "nested1_keyword": {
                "type": "keyword"
              }
            }
          },
          "key": {
            "type": "keyword"
          }
        }
      }
    },
    "settings": {
      "index": {
        "number_of_shards": "1"
      }
    }
}
'

http PUT "localhost:9200/test/test-type/1" -d '{ "key": 1 }'

http PUT "localhost:9200/test/test-type/2" -d '{
  "key": 2,
  "nested1": [      
    {
      "nested2": [
        {
          "nested2_keyword": "nested2_bar",
          "sortScore": 1234
        }
      ],
      "nested1_keyword": "nested1_foo"
    }      
 ]
}
'

# works with refresh here (but not earlier)
#http POST "localhost:9200/test/_refresh"

#works with update instead of PUT
#http POST "localhost:9200/test/test-type/1/_update" -H 'Content-Type: application/json' -d '{ "key": 1 }'

http PUT "localhost:9200/test/test-type/1" -d '{ "key": 1 }'

http POST "localhost:9200/test/_refresh"

http GET "localhost:9200/test/_search?pretty" -d '{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
              "key":  1
          }
        }
      ]
    }
  },
  "sort": [
    {
      "nested1.nested2.sortScore": {
        "order": "desc",
        "mode": "max",
          "missing" : -1,
          "nested": {
          "path": "nested1",
          "filter": {
            "term": {
              "nested1.nested1_keyword": "nested1_foo"
            }
          },
          "nested": {
            "path": "nested1.nested2",
            "filter": {
              "term": {
                "nested1.nested2.nested2_keyword": "nested2_bar"
              }
            }
          }
        }
      }
    }
  ]
}
'

The search returns something like this:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : null,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "test-type",
        "_id" : "1",
        "_score" : null,
        "_source" : {
          "key" : 1
        },
        "sort" : [
          1234
        ]
      }
    ]
  }
}

the sort value of 1234 is defined in doc 2, not doc 1.

Provide logs (if relevant):

@colings86 colings86 added the :Search/Search Search-related issues that do not fall into other categories label Jun 25, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@luciansabo
Copy link

I can confirm this happening on 6.2 and 6.3. If you reindex all items that need to be sorted soring works correctly. I think it uses some kind of cache because if you update a single document, the first non-matching document appears on the same position, shifting the items.
A possible workaround is to use function_score.

@mayya-sharipova
Copy link
Contributor

@luciansabo Thank you for reporting the issue!

Unfortunately I was NOT able to reproduce your issue. I followed all the steps on 6.3: put 1st doc, put 2nd doc, put 1st doc again, refresh, search. I am always getting "sort": [-1]". Can you confirm the steps are correct?

Can you reproduce this issue on an empty index?
Have you modified any index settings (refresh_interval etc)?

@polyfractal
Copy link
Contributor

Oops, I thought I had replied to this earlier but looks like I forgot to hit enter.

I managed to reproduce this locally, although it wasn't consistent. Sometimes I would get -1, sometimes 1234. It may have been timing related. It felt like executing the steps slowly would return the correct sort value. But if I ran them all in one go it returned the bad value. This wasn't super consistent though, so it may just have been coincidence.

I was able to drop a breakpoint and see it collect the incorrect nested doc's values for the sort, but I'm not too familiar with how nested sorted works so didn't make any further progress.

Not super helpful, sorry :( This was on master fwiw.

@mayya-sharipova
Copy link
Contributor

thanks @polyfractal, good to know that it is reproducible and worth investigating.

@luciansabo
Copy link

@mayya-sharipova I am not the issue author, @dbevacqua is, but I can confirm a similar thing happening here. We were seeing this bug on multiple docker setups and on our staging environment, both on empty index and older indexes. For us it is very easy to reproduce.
If you don't manage to reproduce I will find the time to prepare a basic mapping and detail the steps.

We sort scores with a condition to sort only scores of a user.
This happens for us only if there are nested documents not having scores. If you make any update to a single document with a score, the first document having no scores takes his place in the sort order. This issue does not happen if you reindex all documents that are retrieved.

@dbevacqua
Copy link
Contributor Author

The original bash script I posted, if run as-is, allows me to reproduce the issue every time.

As per my comments in the bash script, calling _refresh between second and third PUT, or using _update instead of the third PUT, gives the correct results. In addition, forcing refresh by performing all PUTs of documents with ?refresh has the same effect, i.e. no erroneous sort values. I guess it's possible that on a slower machine, the default refresh interval of 1s comes into play and means that the issue does not manifest itself on every invocation of the bash script. Setting the refresh_interval to a value greater than the default may help in reproducing the issue.

Another couple of observations I have made:

  • the issue is reproducible with the first level of nesting in the sort removed, i.e. we can further simplify the conditions required to reproduce the issue
  • the issue occurs when I use an equivalent script-based sort referencing doc values
  • with an appropriate inner_hits specification, I can extract the correct (i.e. expected) values for the sortScore field from doc values, i.e. the doc values retrieved for inner hits are correct, whereas the doc values for sorting appear not to be

I will post scripts which demonstrate these points at the earliest opportunity

@polyfractal
Copy link
Contributor

Paging @martijnvg :) Could the nested sort filter be getting confused and collecting the wrong docs or something?

@dbevacqua
Copy link
Contributor Author

dbevacqua commented Jul 2, 2018

I've created a gist which uses a simpler mapping and simpler queries to illustrate the points in my previous comment:

https://gist.github.com/dbevacqua/0ea04a3ad472da8e9b2c08559df874e9

Responses from all three search requests only contain doc 1, which has nested1.nested2.sortValue of 1, but the hit has "sort" : [ 2 ]

@dbevacqua
Copy link
Contributor Author

After some more experimenting I've discovered that the first PUT is unnecessary. Have revised the gist to reflect this.

https://gist.github.com/dbevacqua/0ea04a3ad472da8e9b2c08559df874e9

@dbevacqua
Copy link
Contributor Author

The problem may be in the BitSet passed into MultiValueMode.select(SortedNumericDocValues, long, BitSet, DocIdSetIterator, int). It always seems to have cardinality 0, which means that prevParentDoc is always -1, so the call to childDocs.advance(...) always returns the first child doc in childDocs, regardless of whether it's related to the current parent doc.

@dbevacqua
Copy link
Contributor Author

I believe this fixes the issue:

diff --git a/server/src/main/java/org/elasticsearch/search/sort/SortBuilder.java b/server/src/main/java/org/elasticsearch/search/sort/SortBuilder.java
index 9537e28..e161cb0 100644
--- a/server/src/main/java/org/elasticsearch/search/sort/SortBuilder.java
+++ b/server/src/main/java/org/elasticsearch/search/sort/SortBuilder.java
@@ -238,8 +238,7 @@ public abstract class SortBuilder<T extends SortBuilder<T>> implements NamedWrit
 
         // apply filters from the previous nested level
         if (nested != null) {
-            parentQuery = Queries.filtered(parentQuery,
-                new ToParentBlockJoinQuery(nested.getInnerQuery(), nested.getRootFilter(), ScoreMode.None));
+            parentQuery = new ToParentBlockJoinQuery(nested.getInnerQuery(), nested.getRootFilter(), ScoreMode.None);
 
             if (objectMapper != null) {
                 childQuery = Queries.filtered(childQuery,

All tests pass, including the (previously failing) one I added based on the gist. I'll prepare a pull request, but perhaps someone can comment on what the intention was with forcing the ToParentBlockJoinQuery to be a filter. Is it an optimisation of some kind?

dbevacqua added a commit to treatwell/elasticsearch that referenced this issue Jul 3, 2018
dbevacqua added a commit to treatwell/elasticsearch that referenced this issue Jul 3, 2018
@martijnvg
Copy link
Member

@dbevacqua Thanks for reporting this bug and coming up with a fix!

It looks like the conjunction between parentQuery and new ToParentBlockJoinQuery(nested.getInnerQuery(), nested.getRootFilter(), ScoreMode.None) is too strict in the case when the nested fields do not exist in all documents.

but perhaps someone can comment on what the intention was with forcing the ToParentBlockJoinQuery to be a filter. Is it an optimisation of some kind?

The idea was that for an intermediate nested parent doc both the parentQuery and ToParentBlockJoinQuery must match (that nested doc is both a parent and a child), but that is not true in case when nested fields are missing. Whether a must or filter clause if used for that is not really important. In fact for for optimization purposes, both query clauses should be filter, because scoring is not needed.

I think your fix is good and I think it makes sense to open a PR to iterate further.

@dbevacqua
Copy link
Contributor Author

Thanks @martijnvg. PR here: #31776

@JulienColin
Copy link

Hello @dbevacqua , I recently authored #32130 exposing a comparable problem (however my case is a bit difference because it's reproducible 100% of the times and not depending on timing. gist with kibana commands here https://gist.github.com/cbuescher/f9c8c2132d2667d3e907a6283d3f171a).

Would you have a simple way to check wether the fix you provided here would also resolve this other issue ?

If not , i will build and run your PR myself.

Thank you very much in advance

@dbevacqua
Copy link
Contributor Author

dbevacqua commented Jul 19, 2018

@JulienColin running your gist against a server built from 6.2.3 with my PR applied, I believe the correct response is returned (family with id=2 has sort value 30)

  "hits": {
    "total": 4,
    "max_score": null,
    "hits": [
      {
        "_index": "tree",
        "_type": "family",
        "_id": "2",
        "_score": null,
        "_source": {
          "name": "Simpson",
          "members": [
            {
              "firstName": "Homer",
              "color": "brown",
              "levels": {
                "strength": 30
              }
            },
            {
              "firstName": "Lisa",
              "color": "brown",
              "levels": {
                "strength": 40
              }
            },
            {
              "firstName": "Marge",
              "color": "brown",
              "levels": {
                "strength": 60
              }
            }
          ]
        },
        "sort": [
          30
        ]
      },
...

@JulienColin
Copy link

@dbevacqua thank you very much for your reactivity ! it's amazing , thank you very much for your fix. I hope we can enjoy it in a release soon .

jimczi added a commit that referenced this issue Jul 20, 2018
The parent filter for nested sort should always match **all** parents regardless
of the child queries. It is used to find the boundaries of a single parent and we use
the child query to match all the filters set in the nested tree so there is no need to
repeat the nested filters.
With this change we ensure that we build bitset filters
only to find the root docs (or the docs at the level where the sort applies) that can be reused
among queries.

Closes #31554
Closes #32130
Closes #31783

Co-authored-by: Dominic Bevacqua <bev@treatwell.com>
jimczi added a commit that referenced this issue Jul 20, 2018
The parent filter for nested sort should always match **all** parents regardless
of the child queries. It is used to find the boundaries of a single parent and we use
the child query to match all the filters set in the nested tree so there is no need to
repeat the nested filters.
With this change we ensure that we build bitset filters
only to find the root docs (or the docs at the level where the sort applies) that can be reused
among queries.

Closes #31554
Closes #32130
Closes #31783

Co-authored-by: Dominic Bevacqua <bev@treatwell.com>
jimczi added a commit that referenced this issue Jul 20, 2018
The parent filter for nested sort should always match **all** parents regardless
of the child queries. It is used to find the boundaries of a single parent and we use
the child query to match all the filters set in the nested tree so there is no need to
repeat the nested filters.
With this change we ensure that we build bitset filters
only to find the root docs (or the docs at the level where the sort applies) that can be reused
among queries.

Closes #31554
Closes #32130
Closes #31783

Co-authored-by: Dominic Bevacqua <bev@treatwell.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants