[ML] Consider using search_after instead of scroll in datafeeds #29781

elasticmachine · 2017-07-10T14:33:27Z

Original comment by @droberts195:

@dimitris-athanasiou tested scroll VS search_after on a @dolaru's qa 6-node cluster (though those instances are quite small, t2.medium)

in this scenario data was pulled from a 5-shard index
~15M docs
it took exactly [2min 45sec] every single time for the scroll version
it took ~[3min 3sec] on average when doing search_after
that’s a 10% slowdown with search_after

However, search_after does have some benefits for ML, like not being at risk of broken scrolls.

droberts195 · 2019-04-04T10:29:09Z

One more consideration with switching to search_after would be the need for a fully deterministic sort order in datafeeds. At present we just sort on the time field, which is not completely deterministic.

Sorting on time and _id is possible though very inefficient, because _id does not have doc values. Since we do not control the source indices we cannot be sure that sorting on _id would cause serious performance problems for the cluster, so should not do it.

#39187 (comment) contains an idea for a fully deterministic datafeed sort order. However, the comment below shows that we decided to step back from it in case it also caused performance problems.

droberts195 · 2019-04-04T10:33:05Z

Switching to search_after would prevent us hitting problems due to exceeding search.max_open_scroll_context - see #40772.

droberts195 · 2019-09-13T11:59:38Z

One more consideration with switching to search_after would be the need for a fully deterministic sort order in datafeeds

The work being done in #61062 to implement #26472 solves this. Once #61062 is merged we can switch from scroll to search_after, but we'll be doing the search_after in a point-in-time view, so will be able to use a sort order of (time field, _doc) to get a unique ordering within each chunk.

droberts195 · 2021-02-25T13:19:09Z

We need to be very careful about implementing this. #68833 was only merged into 7.12, and it is likely to be required by any change from scroll to point-in-time search_after in ML. So when we make the switch we'll break the ability to do CCS against clusters on versions older than 7.12. Therefore we should definitely not make this change in 7.x. Even if we made the change at some point during the 8.x series it would break CCS compatibility with 7.0 to 7.11 for ML anomaly detectors. Maybe that isn't so bad as it should be much easier to upgrade a 7.0 to 7.11 cluster to 7.last than to upgrade to 8.x. We would still need to document the limitation and should still wait until a year or so after 8.0 release. We should also be mindful of any stack-wide CCS compatibility policies that get defined over the coming weeks.

elasticmachine added :ml Machine learning >enhancement high hanging fruit labels Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Consider using search_after instead of scroll in datafeeds #29781

[ML] Consider using search_after instead of scroll in datafeeds #29781

elasticmachine commented Jul 10, 2017

droberts195 commented Apr 4, 2019

droberts195 commented Apr 4, 2019

droberts195 commented Sep 13, 2019 •

edited

Loading

droberts195 commented Feb 25, 2021

[ML] Consider using search_after instead of scroll in datafeeds #29781

[ML] Consider using search_after instead of scroll in datafeeds #29781

Comments

elasticmachine commented Jul 10, 2017

droberts195 commented Apr 4, 2019

droberts195 commented Apr 4, 2019

droberts195 commented Sep 13, 2019 • edited Loading

droberts195 commented Feb 25, 2021

droberts195 commented Sep 13, 2019 •

edited

Loading