Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not compute hit counts by default #33028

Closed
jpountz opened this issue Aug 21, 2018 · 19 comments
Closed

Do not compute hit counts by default #33028

jpountz opened this issue Aug 21, 2018 · 19 comments
Labels
>breaking >enhancement release highlight :Search/Search Search-related issues that do not fall into other categories v7.0.0-beta1

Comments

@jpountz
Copy link
Contributor

jpountz commented Aug 21, 2018

Context
Lucene 8 introduces optimizations that allow to compute top hits more efficiently by skipping documents that do not produce competitive scores. We would like to enable this behavior by default so that users can opt in if they need accurate total hit counts, which are costly, rather than the other way around.

Not returning a hit count at all is problematic for traditional search UIs with pagination: say that you want to display up to 5 pages with 10 hits per page, you need to know whether the hit count is between 0 and 10, between 11 and 20, between 21 and 30, between 31 and 40, or greater than 40 in order to know how many pages need to be displayed. In order to address this issue, Lucene takes a configurable threshold: if the hit count is less than this threshold, then you will get an accurate hit count, otherwise you will get a lower bound of the hit count.

We don't want to discuss backward compatibility for now, let's focus what we want to have eventually, and only then discuss how we get there to make the change easy to digest for users. That's fine if we need multiple steps and the whole change is only available in 8 rather than 7.

Response format
We need a way to tell users whether the hit count is accurate or a lower bound. Multiple ideas have been mentioned:

  1. not modify the response format: if the user asked to count up to X and the hit count is greater than X, just return X as the hit count
  2. use a string that ends with a + as a way to say that the hit count is a lower bound, eg. { "hits": { "total": "1234+" } } when the hit count is a lower bound and { "hits": { "total": 1234 } } like today otherwise
  3. use another field that tells whether the hit count is accurate or a lower bound, eg. { "hits": { "total": 1234, "total_hits_relation": "gte" } }
  4. make hits.total an object with a value and a relation, eg. { "hits": { "total": { "value": 1234, "relation": "gte" } } }
  5. make hits.total an object that has two possible keys but only one is ever set, eg. { "hits": { "total": { "gte": 1234 } } } or { "hits": { "total": { "eq": 1234 } } }
  6. don't reuse hits.total at all and return a different field if ie. hits.min or some better name.

When we discussed these options, we felt like 1 would make parsing more complicated, and with 2 it would be too easy to miss the fact that you need to look up another field in order to know how to interpret the hit count.

Implementation options

Option 1: Make track_total_hits take a number

We already have a track_total_hits switch which we added for index sorting, but currently take a boolean. It would be easy to make it take a number instead that would be the minimum number of hits to count accurately.

We could ease transition to the new response format by using the current format when track_total_hits is unset or set to a boolean, and the new format when track_total_hits is set to a number, and then deprecate support for booleans.

Users who want accurate hit counts could set a very high value for this parameter, we could potentially allow using a special value like -1 as a way to mean "be accurate".

Option 2: Hardcoded number of hits to count

Always count accurately up to index.max_result_window hits. If users need accurate hits, they will need to use an aggregation (we need to add such an aggregation that counts docs).

Ping @elastic/es-clients to get opinions about the above thoughts, especially the response format.

@jpountz jpountz added >enhancement >breaking release highlight :Search/Search Search-related issues that do not fall into other categories v7.0.0 labels Aug 21, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jimczi
Copy link
Contributor

jimczi commented Aug 23, 2018

Another question that popped up is whether we should return "precise" lower bounds or just indicate the lower lower bound. If track_total_hits is set to 100 for instance, returning 123+ doesn't bring much more than returning 100+. The lower bound computed on each shard is not an approximation so any result above 100 can be misleading, a simple term query that can skip a lot of documents may return a small lower bound even if this term is present in all documents.

@s1monw
Copy link
Contributor

s1monw commented Aug 23, 2018

I don't like changing the meaning of total or make it a string. We already return "total" : -1 if track_total_hits is false. We can add an additional parameter at_least to inform the user how many hits we can have at least. We can also add this easily to 6.x. From my perspective this would only guarantee best effort. On the request end I am in favor of option 1.

@polyfractal
Copy link
Contributor

I'm also not a fan of the options where the meaning of total is context-dependent. I'd rather have a new field. In both cases you need to conditionally check two things, but at least this way incorrect assumptions are quickly noticed (total: -1) and it doesn't change the meaning of total.

It's also a bit more graceful upgrade path. UIs might be showing a bad value (-1) but it doesn't break code outright and users can implement the check for at_least/min/whatever when they get a chance.

@Mpdreamz
Copy link
Member

Mpdreamz commented Aug 23, 2018

Also prefer an additional property, total can already be skewed in the case of partial results (if allow_partial_search_results is not true (default false)) which throws people. A new property on the response is more explicit.

On the request side can we create a new property e.g track_total_hits_upto=NUMBER or total_accuracy=NUMBER and deprecatetrack_total_hits=BOOL instead of reusing track_total_hits ?

@jpountz
Copy link
Contributor Author

jpountz commented Sep 25, 2018

We had a long discussion about this issue, here are some arguments that have been made:

  • we already have a big breaking change coming: the removal of types, maybe we should try to not introduce another major breaking change at the same time
  • it's too late for breaking changes in 7.0, anything that would require clients to change behavior by default would have to wait for 8.0 anyway
  • we want to enable this optimization by default eventually, because users should opt in to get more info at the expense of more CPU usage
  • option 3 is similar to 4 but is easier to parse and feels better
  • in general option 1 didn't get much support
  • option 2 raised concerns around the fact that one needs to check the value of another field (out of many) to know how to interpret the value of hits.total
  • option 0 and 2 have the benefit of not being breaking for apps that don't check hit counts
  • often apps check the hit count to know whether they need to display something, so we need to be careful with not breaking apps that work like that

Even though we couldn't make a decision, we narrowed down to two main paths:

A. Keep same format

  • In 7.0 we make track_total_hits accept numbers in addition to boolean, which acts as a threshold for the number of hits to count. If hits.total is greater than or equal to this number then its value is a lower bound of the actual hit count. hits.total is still -1 if track_total_hits is set to false like in 6.x.
  • In 7.0 we issue deprecation warnings when not setting track_total_hits
  • In 8.0 we change the default of track_total_hits to false.

One modified version of this plan that has been discussed is to not return hits.total at all when track_total_hits is set to false in 8.0, to be consistent with eg. aggs which don't have any value when not requested, and not confuse users starting with Elasticsearch that something may be broken if hits.total is not a number.

This path has the downside of making the transition easier but the response is less explicit: you need to know the threshold that was set in the request in order to know how to interpret hits.total in the response.

B. Make hits.total an object

  • In 7.0 we make track_total_hits accept numbers in addition to boolean, which acts as a threshold for the number of hits to count. If set to a number then hits.total is an object (it remains an integer by default for bw compat) that looks like { "hits": { "total": { "value": 1234, "relation": "gte" } } }
  • In version 7.0 we issue deprecation warnings when track_total_hits is not set to a number to force users to opt in for the new response format.
  • In version 8.0 we can change the default value to potentially 1000 like Lucene or 0 to disable the tracking of hit counts entirely

This path requires more changes on client side and makes it harder to perform generic processing of search responses due to the fact that the format of hits.total will depend on the request in 7.x. Yet it makes responses more self-contained.

@stacey-gammon
Copy link
Contributor

Just want to confirm my understanding of the latest plan:

Kibana will not have to do anything for 7.0 and everything will continue to work the same except we may see deprecation warnings.

If that changes (e.g. the decision to switch the default in 7.0 changes), or is incorrect, please let me know!

@jimczi
Copy link
Contributor

jimczi commented Oct 22, 2018

Sorry I thought that I commented in this issue but for some reasons my comment is not there anymore.
So the current status after the internal discussion we had is that we'll disable track_total_hits by default in 7.0 and make hits.total and object with total (an integer that contains the number of hits) and relation, a string that defines how the total should be interpreted. When track_total_hits is false (default behavior) the total is filled with -1. We also discussed a way to get a response with the old format and decided that the client could send the major version in the request (as an header for instance), 7 would mean new response format and would default to "track_total_hits":false whereas 6 would behave like a 6x cluster (track_total_hits:true and hits.total returned as a simple integer). For Kibana this means that you'll have to explicitly set track_total_hits to true or use this new client versioning. The former can be tested now since the track_total_hits option is already available in 6x.

jimczi added a commit to jimczi/elasticsearch that referenced this issue Nov 6, 2018
In Lucene 8 searches can skip non-competitive hits if the total hit count is not requested.
It is also possible to track the number of hits up to a certain threshold. This is a trade off to speed up searches
while still being able to know a lower bound of the total hit count. This change adds the ability to set this threshold directly in the `track_total_hits` search option. A boolean value (`true`, `false`) indicates whether the total hit count should be tracked in the response. When set as an integer this option allows to compute a lower bound of the total hits while preserving the ability to skip non-competitive hits when enough hits have been collected.
In order to ensure that the result is correctly interpreted this commit also adds a new section in the search response
that indicates the number of tracked hits and whether the value is a lower bound (`gte`)  or the exact count (`eq`):
```
GET /_search
{
    "track_total_hits": 100,
    "query": {
        "term": {
            "title": "fast"
        }
    }
}
```
... will return:
```
{
  "_shards": ...
   "hits" : {
      "total" : -1,
      "tracked_total": {
        "value": 100,
        "relation": "gte"
      },
      "max_score" : 0.42,
      "hits" : []
  }
}
```

Relates elastic#33028
@jimczi
Copy link
Contributor

jimczi commented Nov 6, 2018

I opened #35291 to discuss another alternative that doesn't require to change the hits.total. The pr adds the ability to set track_total_hits as a number that indicates the number of hits to count accurately. The hits.total in the search response is set to -1 in this case and a new section called tracked_total indicates the number of tracked hits and whether the number is a lower bound or accurate:

{
  "_shards": ...
   "hits" : {
      "total" : -1,
      "tracked_total": {
        "value": 100,
        "relation": "gte"
      },
      "max_score" : 0.42,
      "hits" : []
  }
}

I prefer this option because it doesn't require a breaking change in the search response format so users that want exact count (or no count at all) are not impacted and we don't need to handle backward compatibility.

@Mpdreamz
Copy link
Member

Mpdreamz commented Nov 6, 2018

If the current thinking is to disable track_total_hits by default in 7.0, returning -1 might be just as breaking for applications as returning total as an object.

Raising this only to make sure we don't phrasing this as backwards compatible. I can get behind both options (an additional field or total as an object) as long as we feel its the cleanest option.

I personally much prefer the object approach since we can just have one location for the count and an additional field to indicate how to interpret this value.

@jimczi
Copy link
Contributor

jimczi commented Nov 6, 2018

Raising this only to make sure we don't phrasing this as backwards compatible.

I was just talking about the format of the response no matter if you disable hit counts or not. But I agree that disabling total hit count by default is a breaking change ;).

I personally much prefer the object approach since we can just have one location for the count and an additional field to indicate how to interpret this value.

Personally I prefer option A where we keep the same format. The only scenario where an object is needed is when a user explicitly sets the track_total_hits threshold and this is only possible in ES 7.
I agree that disabling the hit count by default is a breaking change but changing the response format is another one. I opened #35291 to avoid the latter with the downside of having two location for the count depending on the request options.

@Mpdreamz
Copy link
Member

Mpdreamz commented Nov 6, 2018

I prefer the object approach because its explicit. Typed languages will get a heads up on the change in defaults.

Adittionally you can write pagination over hits.total.value no matter what setting is used for the search.

We need a tiebreaker on this one I reckon :)

@jimczi
Copy link
Contributor

jimczi commented Nov 15, 2018

We had a discussion internally and we've decided to make hits.total an object in 7.0:

{ "hits": { "total": { "value": 1234, "relation": "gte" } } }

This is inlined with option B above. The revised plan to make this change in 7.0 is the following:

  • In 7.0 hits.total is an object. Users can opt out from this format with a request parameter.
  • In 7.0 we make track_total_hits accept numbers in addition to boolean, which acts as a threshold for the number of hits to count.
  • In 6x we issue deprecation warnings when track_total_hits is not set
  • In version 7.0 we change the default value to 1000 like Lucene or 0 to disable the tracking of hit counts entirely. Users can opt out from this new behavior with a request parameter.

We'll use API versioning to opt out from the new behaviors (format change and default value to track total hits). Official clients in 6x and 7 will send their version in the request automatically which will allow clients in 6x to search in a cluster with mixed versions (6x and 7).

@karmi
Copy link
Contributor

karmi commented Nov 15, 2018

Great to see that we're changing to the object!

Official clients in 6x and 7 will send their version in the request automatically which will allow clients in 6x to search in a cluster with mixed versions (6x and 7).

Can you provide more info how that will specifically work, @jimczi ? Does it only mean that a 6.x client will send a different header than a 7.x client, and it's the reponsibility of the user to use two separate clients? How would a 6.x client "know" which header to send otherwise?

@jimczi
Copy link
Contributor

jimczi commented Nov 15, 2018

Does it only mean that a 6.x client will send a different header than a 7.x client, and it's the reponsibility of the user to use two separate clients? How would a 6.x client "know" which header to send otherwise?

A 6.x client should send an header indicating its version (6) so that nodes in the cluster can adapt their defaults and responses to the version. Clients are forward compatible so during a rolling upgrade to 7 users can still use a 6x client and with this mechanism they will not see the breaking change until they upgrade their client to 7x. There are still open questions regarding this versioning but they should be tracked on the main issue.

jimczi added a commit to jimczi/elasticsearch that referenced this issue Nov 23, 2018
This commit changes the format of the `hits.total` in the search response to be an object with
a `value` and a `relation`. The `value` indicates the number of hits that match the query and the
`relation` indicates whether the number is accurate (in which case the relation is equals to `eq`)
or a lower bound of the total (in which case it is equals to `gte`).
This change also adds a parameter called `rest_total_hits_as_int` that can be used in the
search APIs to opt out from this change (retrieve the total hits as a number in the rest response).
Note that currently all search responses are accurate (`track_total_hits: true`) or they don't contain
`hits.total` (`track_total_hits: true`). We'll add a way to get a lower bound of the total hits in a
follow up (to allow numbers to be passed to `track_total_hits`).

Relates elastic#33028
jimczi added a commit to jimczi/elasticsearch that referenced this issue Nov 29, 2018
The support for rest_total_hits_as_int has already been merged to 6x
in elastic#35848 so this change adds this new option to master. The plan was
to add this new option as part of elastic#35848 but we've decided to wait a few
days before merging this breaking change so this commit just handles
the new option as a noop exactly like 6x for now. This will allow
users to migrate to this parameter before elastic#35848 is merged.

Relates elastic#33028
jimczi added a commit that referenced this issue Nov 29, 2018
The support for rest_total_hits_as_int has already been merged to 6x
in #35848 so this change adds this new option to master. The plan was
to add this new option as part of #35848 but we've decided to wait a few
days before merging this breaking change so this commit just handles
the new option as a noop exactly like 6x for now. This will allow
users to migrate to this parameter before #35848 is merged.

Relates #33028
jimczi added a commit that referenced this issue Dec 5, 2018
This commit changes the format of the `hits.total` in the search response to be an object with
a `value` and a `relation`. The `value` indicates the number of hits that match the query and the
`relation` indicates whether the number is accurate (in which case the relation is equals to `eq`)
or a lower bound of the total (in which case it is equals to `gte`).
This change also adds a parameter called `rest_total_hits_as_int` that can be used in the
search APIs to opt out from this change (retrieve the total hits as a number in the rest response).
Note that currently all search responses are accurate (`track_total_hits: true`) or they don't contain
`hits.total` (`track_total_hits: true`). We'll add a way to get a lower bound of the total hits in a
follow up (to allow numbers to be passed to `track_total_hits`).

Relates #33028
jimczi added a commit to jimczi/elasticsearch that referenced this issue Dec 9, 2018
In Lucene 8 searches can skip non-competitive hits if the total hit count is not requested.
It is also possible to track the number of hits up to a certain threshold. This is a trade off to speed up searches while still being able to know a lower bound of the total hit count. This change adds the ability to set this threshold directly in the track_total_hits search option. A boolean value (true, false) indicates whether the total hit count should be tracked in the response. When set as an integer this option allows to compute a lower bound of the total hits while preserving the ability to skip non-competitive hits when enough matches have been collected.

Relates elastic#33028
jimczi added a commit that referenced this issue Jan 4, 2019
In Lucene 8 searches can skip non-competitive hits if the total hit count is not requested.
It is also possible to track the number of hits up to a certain threshold. This is a trade off to speed up searches while still being able to know a lower bound of the total hit count. This change adds the ability to set this threshold directly in the track_total_hits search option. A boolean value (true, false) indicates whether the total hit count should be tracked in the response. When set as an integer this option allows to compute a lower bound of the total hits while preserving the ability to skip non-competitive hits when enough matches have been collected.

Relates #33028
jimczi added a commit to jimczi/elasticsearch that referenced this issue Jan 15, 2019
This commit changes the default for the `track_total_hits` option of the search request
to `10,000`. This means that by default search requests will accurately track the total hit count
up to `10,000` documents, requests that match more than this value will set the `"total.relation"`
to `"gte"` (e.g. greater than or equals) and the `"total.value"` to `10,000` in the search response.
Scroll queries are not impacted, they will continue to count the total hits accurately.
The default is set back to `true` (accurate hit count) if `rest_total_hits_as_int` is set in the search request.
I choose `10,000` as the default because that's also the number we use to limit pagination. This means that
users will be able to know how far they can jump (up to 10,000) even if the total number of hits is not accurate.

Closes elastic#33028
jimczi added a commit that referenced this issue Jan 25, 2019
This commit changes the default for the `track_total_hits` option of the search request
to `10,000`. This means that by default search requests will accurately track the total hit count
up to `10,000` documents, requests that match more than this value will set the `"total.relation"`
to `"gte"` (e.g. greater than or equals) and the `"total.value"` to `10,000` in the search response.
Scroll queries are not impacted, they will continue to count the total hits accurately.
The default is set back to `true` (accurate hit count) if `rest_total_hits_as_int` is set in the search request.
I choose `10,000` as the default because that's also the number we use to limit pagination. This means that users will be able to know how far they can jump (up to 10,000) even if the total number of hits is not accurate.

Closes #33028
@camerondavison
Copy link

whatever happened here. I like

A 6.x client should send an header indicating its version (6) so that nodes in the cluster can adapt their defaults and responses to the version. Clients are forward compatible so during a rolling upgrade to 7 users can still use a 6x client and with this mechanism

But it does not look like this was ever implemented for the java es client? I am trying with the high level rest client in 6.8.1 and I never get back any total hits since its an object but the rest client is expecting it to be a long?

@jimczi
Copy link
Contributor

jimczi commented Jul 30, 2019

But it does not look like this was ever implemented for the java es client? I am trying with the high level rest client in 6.8.1 and I never get back any total hits since its an object but the rest client is expecting it to be a long?

This was not implemented (we target the next major version for this) so the error is expected. Instead we introduced a new parameter called rest_total_hits_as_int in order to allow 6.x rest client to talk to a 7.x cluster but the HLRC does not have this option yet. There is an issue open to add the support for it here: #43925 so you can follow the progress in the issue.

@camerondavison
Copy link

@jimczi thanks for the quick reply

@NelsonBurton
Copy link

In 7.x, will adding the rest_total_hits_as_int negatively affect performance of a query as opposed to leaving it off? In the discussion above, there is a note that to calculate the accurate hits number another aggregation query would need to be run...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking >enhancement release highlight :Search/Search Search-related issues that do not fall into other categories v7.0.0-beta1
Projects
None yet
Development

No branches or pull requests