Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Consider using search_after instead of scroll in datafeeds #29781

Open
elasticmachine opened this issue Jul 10, 2017 · 4 comments
Open

[ML] Consider using search_after instead of scroll in datafeeds #29781

elasticmachine opened this issue Jul 10, 2017 · 4 comments

Comments

@elasticmachine
Copy link
Collaborator

Original comment by @droberts195:

@dimitris-athanasiou tested scroll VS search_after on a @dolaru's qa 6-node cluster (though those instances are quite small, t2.medium)

  • in this scenario data was pulled from a 5-shard index
  • ~15M docs
  • it took exactly [2min 45sec] every single time for the scroll version
  • it took ~[3min 3sec] on average when doing search_after
  • that’s a 10% slowdown with search_after

However, search_after does have some benefits for ML, like not being at risk of broken scrolls.

@droberts195
Copy link
Contributor

One more consideration with switching to search_after would be the need for a fully deterministic sort order in datafeeds. At present we just sort on the time field, which is not completely deterministic.

Sorting on time and _id is possible though very inefficient, because _id does not have doc values. Since we do not control the source indices we cannot be sure that sorting on _id would cause serious performance problems for the cluster, so should not do it.

#39187 (comment) contains an idea for a fully deterministic datafeed sort order. However, the comment below shows that we decided to step back from it in case it also caused performance problems.

@droberts195
Copy link
Contributor

Switching to search_after would prevent us hitting problems due to exceeding search.max_open_scroll_context - see #40772.

@droberts195
Copy link
Contributor

droberts195 commented Sep 13, 2019

One more consideration with switching to search_after would be the need for a fully deterministic sort order in datafeeds

The work being done in #61062 to implement #26472 solves this. Once #61062 is merged we can switch from scroll to search_after, but we'll be doing the search_after in a point-in-time view, so will be able to use a sort order of (time field, _doc) to get a unique ordering within each chunk.

@droberts195
Copy link
Contributor

We need to be very careful about implementing this. #68833 was only merged into 7.12, and it is likely to be required by any change from scroll to point-in-time search_after in ML. So when we make the switch we'll break the ability to do CCS against clusters on versions older than 7.12. Therefore we should definitely not make this change in 7.x. Even if we made the change at some point during the 8.x series it would break CCS compatibility with 7.0 to 7.11 for ML anomaly detectors. Maybe that isn't so bad as it should be much easier to upgrade a 7.0 to 7.11 cluster to 7.last than to upgrade to 8.x. We would still need to document the limitation and should still wait until a year or so after 8.0 release. We should also be mindful of any stack-wide CCS compatibility policies that get defined over the coming weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants