ML: Add support for rollup Indexes in Datafeeds #34654

benwtrent · 2018-10-19T17:42:44Z

This adds rollup index support for data extraction in datafeeds.

There are SOME changes inside the Rollup side of the house, so review from them for these changes may be necessary

Updating RollupSearchAction so that a previously crafted search can be injected via the constructor

elasticmachine · 2018-10-19T17:42:46Z

Pinging @elastic/ml-core

benwtrent · 2018-10-22T14:28:29Z

Jenkins retest this please

benwtrent · 2018-10-22T15:57:41Z

One current issue is that doing a "rollup index ONLY" does not work...I am looking into that. Rollups don't really have timestamp fields, so unsure how to consolidate that with needing to support rollup only lookbacks.

dimitris-athanasiou

Leaving some of the comments I had before you park this.

dimitris-athanasiou · 2018-10-22T15:26:03Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/rollup/action/RollupJobCaps.java

@@ -43,32 +44,33 @@
    private static ParseField INDEX_PATTERN = new ParseField("index_pattern");
    private static ParseField FIELDS = new ParseField("fields");

+    private final RollupJobConfig rollupJobConfig;


This is missing from serialization (both wire and xContent). Is that by design?

dimitris-athanasiou · 2018-10-22T15:28:33Z

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportPreviewDatafeedAction.java

+        DataExtractorFactory.create(client,
+            previewDatafeed.build(),
+            job,
+            new Auditor(client, clusterService.nodeName()),


You should be able to inject the auditor in the action, store it as a member and use that instead of constructing a new one.

dimitris-athanasiou · 2018-10-22T15:31:34Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/extractor/DataExtractorFactory.java

+            response -> {
+                if (response.getJobs().isEmpty()) { // This means no rollup indexes are in the config
+                    if (datafeed.hasAggregations()) {
+                        auditor.info(job.getId(), "Creating aggregated data extractor for datafeed [" + datafeed.getId() + "]");


Why are those messages audited? They look more suitable for logging rather than auditing. The auditor will create job notifications and the user will see those in the job messages tab in the UI. We've been keeping those messages at a higher level for operational awareness. The type of extractor seems like a technical detail the user shouldn't really be aware of. That is my take at least, but I'm open for discussion if I missed something.

dimitris-athanasiou · 2018-10-22T15:36:57Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/extractor/DataExtractorFactory.java

+
+        GetRollupIndexCapsAction.Request request = new GetRollupIndexCapsAction.Request(datafeed.getIndices().toArray(new String[0]));
+
+        ClientHelper.<GetRollupIndexCapsAction.Request, GetRollupIndexCapsAction.Response, GetRollupIndexCapsAction.RequestBuilder>


you probably don't need the type here which will make this line more readable.

hendrikmuhs · 2018-10-24T06:41:50Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/extractor/DataExtractorFactory.java

+                    if (datafeed.hasAggregations()) { // Rollup indexes require aggregations
+                        RollupDataExtractorFactory.create(client, datafeed, job, response.getJobs(), factoryHandler);
+                    } else {
+                        throw new IllegalArgumentException("Aggregations are required when using Rollup indices");


use the listener?: factoryHandler.onFailure(new IAE(...));

There is a messages class which centrally bundles all messages as constants, might be good to use it here.

hendrikmuhs · 2018-10-24T06:49:32Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedConfig.java

@@ -424,7 +424,7 @@ private TimeValue defaultFrequencyTarget(TimeValue bucketSpan) {

        private static final TimeValue MIN_DEFAULT_QUERY_DELAY = TimeValue.timeValueMinutes(1);
        private static final TimeValue MAX_DEFAULT_QUERY_DELAY = TimeValue.timeValueMinutes(2);
-        private static final int DEFAULT_AGGREGATION_CHUNKING_BUCKETS = 1000;
+        public static final int DEFAULT_AGGREGATION_CHUNKING_BUCKETS = 1000;


nit: push to the top as it is now public

hendrikmuhs · 2018-10-24T06:51:28Z

.../elasticsearch/xpack/ml/datafeed/extractor/aggregation/AbstractAggregationDataExtractor.java

+
+    @Override
+    public void cancel() {
+        LOGGER.trace("[{}] Data extractor received cancel request", context.jobId);


nit: debug?

This was trace originally, but I agree it is worth taking the chance and changing it to debug.

hendrikmuhs · 2018-10-24T07:00:49Z

...va/org/elasticsearch/xpack/ml/datafeed/extractor/aggregation/RollupDataExtractorFactory.java

+import org.elasticsearch.xpack.core.rollup.action.RollupJobCaps.RollupFieldCaps;
+import org.elasticsearch.xpack.core.rollup.job.DateHistogramGroupConfig;
+import org.elasticsearch.xpack.ml.datafeed.extractor.DataExtractorFactory;
+import org.joda.time.DateTimeZone;


there is a push to move away from joda, is it possible to use java.time instead?

...va/org/elasticsearch/xpack/ml/datafeed/extractor/aggregation/RollupDataExtractorFactory.java

hendrikmuhs · 2018-10-24T07:16:30Z

...va/org/elasticsearch/xpack/ml/datafeed/extractor/aggregation/RollupDataExtractorFactory.java

+    private static boolean hasAggregations(ParsedRollupCaps rollupCaps, List<AggregationBuilder> datafeedAggregations) {
+        for (AggregationBuilder aggregationBuilder : datafeedAggregations) {
+            String type = aggregationBuilder.getType();
+            String field = ((ValuesSourceAggregationBuilder) aggregationBuilder).field();


I think it would be cleaner if the parameter of this method is already List<ValuesSourceAggregationBuilder> and if the calling code would therefore downcast as soon as it verified/knows that the AggregationBuilder is a ValuesSourceAggregationBuilder (I assume anything else would be a bug anyway).

benwtrent · 2018-10-24T16:35:02Z

This is now blocked due to: #34815

dimitris-athanasiou

Leaving a first batch of comments. Still working through the last 2/5 of the PR :-)

dimitris-athanasiou · 2018-10-24T15:48:36Z

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportPreviewDatafeedAction.java

-                listener.onFailure(e);
-            }
-        });
+        DataExtractorFactory.create(client,


nit: since there is now no change in this file, you could remove the file entirely from the commit.

dimitris-athanasiou · 2018-10-24T15:50:15Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/extractor/DataExtractorFactory.java

-                factory -> listener.onResponse(datafeed.getChunkingConfig().isEnabled()
-                        ? new ChunkedDataExtractorFactory(client, datafeed, job, factory) : factory)
-                , listener::onFailure
+            factory -> {


this change also seems unnecessary now

dimitris-athanasiou · 2018-10-24T15:55:04Z

.../elasticsearch/xpack/ml/datafeed/extractor/aggregation/AbstractAggregationDataExtractor.java

+
+    @Override
+    public void cancel() {
+        LOGGER.trace("[{}] Data extractor received cancel request", context.jobId);


This was trace originally, but I agree it is worth taking the chance and changing it to debug.

dimitris-athanasiou · 2018-10-24T15:57:44Z

.../elasticsearch/xpack/ml/datafeed/extractor/aggregation/AbstractAggregationDataExtractor.java

+ *
+ * @param <T> The request builder type for getting data from ElasticSearch
+ */
+public abstract class AbstractAggregationDataExtractor<T extends ActionRequestBuilder<SearchRequest, SearchResponse>>


make package private?

dimitris-athanasiou · 2018-10-24T15:58:20Z

.../elasticsearch/xpack/ml/datafeed/extractor/aggregation/AbstractAggregationDataExtractor.java

+        return ClientHelper.executeWithHeaders(context.headers, ClientHelper.ML_ORIGIN, client, searchRequestBuilder::get);
+    }
+
+    protected abstract T buildSearchRequest();


template method FTW! :-)

dimitris-athanasiou · 2018-10-24T16:04:05Z

...main/java/org/elasticsearch/xpack/ml/datafeed/extractor/aggregation/RollupDataExtractor.java

+        // For derivative aggregations the first bucket will always be null
+        // so query one extra histogram bucket back and hope there is data
+        // in that bucket
+        long histogramSearchStartTime = Math.max(0, context.start - ExtractorUtils.getHistogramIntervalMillis(context.aggs));


Most of this method is still common between the 2 implementations. There are some flaky bits here, like the -1 histogram bucket for the derivative aggregations. Would be nice to keep those in a single place.

What if we make the abstract method by buildSearchRequest(SearchSourceBuilder), then we have the common logic in the super class? We can set the types in the AggregationDataExtractor while we don't do so in the rollup?

dimitris-athanasiou · 2018-10-24T16:31:35Z

...va/org/elasticsearch/xpack/ml/datafeed/extractor/aggregation/RollupDataExtractorFactory.java

+            return false;
+        }
+        try {
+            long jobInterval = validateAndGetCalendarInterval(rollupJobGroupConfig.getInterval());


nit: I find that when using static methods like in this case, it makes the code more readable to explicitly call ExtractorUtils.validateAndGetCalendarInterval(...) rather than statically import the method. It means I don't have to wonder where this method comes from. I suppose if a method was used many times in a file it could be otherwise. Anyway, not something you need to change, just a thought.

dimitris-athanasiou · 2018-10-24T16:34:46Z

...va/org/elasticsearch/xpack/ml/datafeed/extractor/aggregation/RollupDataExtractorFactory.java

+        private final Set<String> supportedMetrics;
+        private final Set<String> supportedTerms;
+        private final Map<String, Object> datehistogramAgg;
+        private static List<String> aggsToIgnore = Arrays.asList(HistogramAggregationBuilder.NAME, DateHistogramAggregationBuilder.NAME);


dimitris-athanasiou · 2018-10-24T16:35:26Z

...va/org/elasticsearch/xpack/ml/datafeed/extractor/aggregation/RollupDataExtractorFactory.java

+        private static ParsedRollupCaps fromJobFieldCaps(Map<String, RollupFieldCaps> rollupFieldCaps, String timeField) {
+            Map<String, Object> datehistogram = null;
+            RollupFieldCaps timeFieldCaps = rollupFieldCaps.get(timeField);
+            if ((timeFieldCaps == null) == false) {


simplify: if (timeFieldCaps != null)?

I am fine with that simplification (it is my preference to use != other than == false) but the prevailing pattern I have seen throughout elastic is == false.

{expression} == false is preferred instead of !{expression}. Other than that, != is used normally.

dimitris-athanasiou

OK, rest went quickly. Only one comment.

dimitris-athanasiou · 2018-10-24T16:42:51Z

...rc/main/java/org/elasticsearch/xpack/ml/datafeed/extractor/chunked/ChunkedDataExtractor.java

+         * @throws IOException when timefield range search fails
+         */
+        private DataSummary buildDataSummary() throws IOException {
+            if (context.hasAggregations) {


How about extracting both branches into their own methods here? i.e. createDataSummaryForAggregations and createDataSummaryForScroll.

benwtrent · 2018-10-24T21:54:21Z

This is directly blocked by #34831

hendrikmuhs · 2018-10-25T08:30:36Z

...va/org/elasticsearch/xpack/ml/datafeed/extractor/aggregation/RollupDataExtractorFactory.java

+        if (rollupJobGroupConfig.hasDatehistogram() == false) {
+            return false;
+        }
+        if ("UTC".equalsIgnoreCase(rollupJobGroupConfig.getTimezone()) == false) {


out of curiosity: is that something for a follow-up? If the bucket span is compatible with shifting, it should not be a big deal to support other timezones?

++

I imagine it only really matters for bucket spans of more than 30 minutes. For ML 1 day bucket spans are always 24 hours and UTC, whereas for a date histogram with 1 day buckets (a) they could be offset from the UTC buckets the analytics expects to see and (b) if there is daylight saving then one bucket per year will be 23 hours and one will be 25 hours. Also, for 1 hour bucket spans the offset problem could occur if the timezone is N.5 hours different from UTC, e.g. India at 5.5 hours ahead. But for bucket spans up to and including 30 minutes we should be able to cope with any rollup timezone.

Or am I missing some other problem here?

I think your description is correct @droberts195. We have gone through this before and the restriction of UTC timezone is already in place for aggregations (without rollups). Ben has just copied the validation over to the rollup realm. Since things might go wrong when dealing with non-UTC, imposing the UTC restriction is a good solution. It only requires the user to NOT change the default timezone of the date histogram aggregation. It does not pose any requirements on the actual data and how they're indexed.

OK cool. Let's leave any changes in this area for a separate PR then. Definitely don't make any changes for this in this PR.

Ah, just realised my comment above is confused. This is checking the rollup was in UTC. It is worth finding out what happens if you search a rollup which uses timezone X by aggregating over timezone Y. We need that before making a call on how we deal with this.

dimitris-athanasiou

LGTM

droberts195 · 2018-10-31T09:52:16Z

.../elasticsearch/xpack/ml/datafeed/extractor/aggregation/AbstractAggregationDataExtractor.java

+    }
+
+    protected SearchResponse executeSearchRequest(T searchRequestBuilder) {
+        return ClientHelper.executeWithHeaders(context.headers, ClientHelper.ML_ORIGIN, client, searchRequestBuilder::get);


This will fail if a rollup search has been selected but the user who created the datafeed doesn't have permission to use rollups.

We have to decide how we want to handle this situation better. The two options are:

If the index pattern includes a rollup index when the datafeed is created and the user does not have permission to run rollup searches then refuse to create the datafeed

During the check on whether to use a normal search or rollup search, if the user doesn't have permission to run rollup searches then silently fall back to a normal search

I think option 1 is probably better because it avoids the possibility of a datafeed that the user thinks will use rollups silently not working as expected.

To implement that, look in TransportPutDatafeedAction.masterOperation. It currently does a HasPrivilegesRequest to check that the user can search the desired indices. If the index pattern provided includes a rollup index then it should also check privileges for rollup search (normal search privilege is still required as well because rollup search uses that internally).

There is still a theoretical flaw in this as an index pattern could be provided that does not match a rollup index at the time the datafeed is created, but does later on. For example, suppose a datafeed is created against farequote*. Initially this matches farequote-20181029, farequote-20181030, farequote-20181031, etc. But later it also matches farequote_rollup. With the current logic the datafeed would switch to rollup search at that time, and start to fail if the user did not have permission to use rollup search. But this is an edge case and I'm happy to just cover the more obvious case initially.

@droberts195 how is this behavior any different than when a user creates a datafeed in which they don't have access to that specific index?

The reason for why I have it the way it is, is because of behavior parity with how we handle permission issues when reading data from an index.

I will happily change to verify the user has admin privileges for the indices. I would just note that the Rollup folks are working towards simply requiring regular search permissions for _rollup_search so that it is treated as a simple index read, which would give it parity with other datafeed read actions.

how is this behavior any different than when a user creates a datafeed in which they don't have access to that specific index?

The problem is that there are many types of "access". Each action can be allowed or disallowed for a particular user. There are index privileges that group together multiple actions. But there's also some overlap in what actions the different privileges allow.

the Rollup folks are working towards simply requiring regular search permissions for _rollup_search

They'll do this by adjusting the actions that are allowed by the READ_AUTOMATON. But our privileges check in TransportPutDatafeedAction.java is checking specifically for the ability to run the _search action, not for the "read" index privilege. I think we should also check for the ability to use the _rollup_search action at that same point in the code if the index pattern provided includes a rollup index (which will require an extra test to be done early in TransportPutDatafeedAction).

I see what you are saying @droberts195 I was thinking of the extraction layer, not the creation layer of the datafeed. Will update shortly

benwtrent · 2018-10-31T19:11:23Z

Jenkins retest this please

benwtrent · 2018-10-31T19:50:55Z

Jenkins retest this please

droberts195

LGTM

droberts195 · 2018-11-01T09:47:51Z

Actually there's a checkstyle error in the latest commit:

[ant:checkstyle] [ERROR] /var/lib/jenkins/workspace/elastic+elasticsearch+pull-request/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportPutDatafeedAction.java:8:8: Unused import - org.elasticsearch.ResourceNotFoundException. [UnusedImports]

But I'm happy for you to merge this as soon as you get a green PR build.

* Adding rollup support for datafeeds * Fixing tests and adjusting formatting * minor formatting chagne * fixing some syntax and removing redundancies * Refactoring and fixing failing test * Refactoring, adding paranoid null check * Moving rollup into the aggregation package * making AggregationToJsonProcessor package private again * Addressing test failure * Fixing validations, chunking * Addressing failing test * rolling back RollupJobCaps changes * Adding comment and cleaning up test * Addressing review comments and test failures * Moving builder logic into separate methods * Addressing PR comments, adding test for rollup permissions * Fixing test failure * Adding rollup priv check on datafeed put * Handling missing index when getting caps * Fixing unused import

benwtrent added 9 commits October 1, 2018 14:36

Adding rollup support for datafeeds

17f3f4d

Merge branch 'master' into feature/datafeed-rollup-support

0512ab0

Fixing tests and adjusting formatting

66d2669

Merge branch 'master' into feature/datafeed-rollup-support

3032625

minor formatting chagne

38dcbae

Merge branch 'master' into feature/datafeed-rollup-support

f599381

fixing some syntax and removing redundancies

d9344bd

Refactoring and fixing failing test

97d8849

Refactoring, adding paranoid null check

2631e0a

benwtrent added >enhancement v7.0.0 :ml Machine learning labels Oct 19, 2018

benwtrent added 3 commits October 19, 2018 12:55

Moving rollup into the aggregation package

65c1e76

making AggregationToJsonProcessor package private again

dd20fd4

Addressing test failure

7fbd350

benwtrent added the WIP label Oct 22, 2018

dimitris-athanasiou reviewed Oct 23, 2018

View reviewed changes

benwtrent added 5 commits October 23, 2018 15:14

Fixing validations, chunking

da7bb28

Merge branch 'master' into feature/datafeed-rollup-support

50478f1

Addressing failing test

cb31cd8

rolling back RollupJobCaps changes

61e71a0

Adding comment and cleaning up test

97807b7

benwtrent removed the WIP label Oct 23, 2018

hendrikmuhs reviewed Oct 24, 2018

View reviewed changes

...va/org/elasticsearch/xpack/ml/datafeed/extractor/aggregation/RollupDataExtractorFactory.java Show resolved Hide resolved

hendrikmuhs reviewed Oct 24, 2018

View reviewed changes

Addressing review comments and test failures

062e298

dimitris-athanasiou reviewed Oct 24, 2018

View reviewed changes

Moving builder logic into separate methods

d6729e3

benwtrent added the v6.6.0 label Oct 24, 2018

hendrikmuhs reviewed Oct 25, 2018

View reviewed changes

benwtrent added 3 commits October 29, 2018 13:07

Addressing PR comments, adding test for rollup permissions

287175c

Merge branch 'master' into feature/datafeed-rollup-support

00cc8f2

Fixing test failure

4812485

dimitris-athanasiou approved these changes Oct 30, 2018

View reviewed changes

droberts195 reviewed Oct 31, 2018

View reviewed changes

benwtrent added 2 commits October 31, 2018 13:15

Adding rollup priv check on datafeed put

62e6d71

Handling missing index when getting caps

b908024

Merge branch 'master' into feature/datafeed-rollup-support

08aa3d3

droberts195 approved these changes Nov 1, 2018

View reviewed changes

Fixing unused import

062892a

benwtrent merged commit 2fadec5 into elastic:master Nov 1, 2018

benwtrent deleted the feature/datafeed-rollup-support branch November 1, 2018 15:02

benwtrent mentioned this pull request Jan 29, 2019

[ML] Support rolled-up indices in datafeeds #31999

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019


		GetRollupIndexCapsAction.Request request = new GetRollupIndexCapsAction.Request(datafeed.getIndices().toArray(new String[0]));

		ClientHelper.<GetRollupIndexCapsAction.Request, GetRollupIndexCapsAction.Response, GetRollupIndexCapsAction.RequestBuilder>

ML: Add support for rollup Indexes in Datafeeds #34654

ML: Add support for rollup Indexes in Datafeeds #34654

Conversation

benwtrent commented Oct 19, 2018 • edited Loading

elasticmachine commented Oct 19, 2018

benwtrent commented Oct 22, 2018

benwtrent commented Oct 22, 2018

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Oct 24, 2018

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Oct 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Oct 31, 2018

benwtrent commented Oct 31, 2018

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 commented Nov 1, 2018

benwtrent commented Oct 19, 2018 •

edited

Loading