Skip to content

Commit

Permalink
[Rollup] Only allow aggregating on multiples of configured interval (#…
Browse files Browse the repository at this point in the history
…32052)

We need to limit the search request aggregations to whole multiples
of the configured interval for both histogram and date_histogram.
Otherwise, agg buckets won't overlap with the rolled up buckets
and the results will be incorrect.

For histogram, the validation is very simple: request must be >= the config,
and modulo evenly.

Dates are more tricky.
- If both request and config are fixed dates, we can convert to millis
and treat them just like the histo
- If both are calendar, we make sure the request is >= the config with
a static lookup map that ranks the calendar values relatively.  All
calendar units are "singles", so they are evenly divisible already
- We disallow any other combination (one fixed, one calendar, etc)
  • Loading branch information
polyfractal committed Aug 29, 2018
1 parent 7d4895d commit a29af74
Show file tree
Hide file tree
Showing 8 changed files with 380 additions and 84 deletions.
3 changes: 1 addition & 2 deletions x-pack/docs/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -686,9 +686,8 @@ setups['sensor_prefab_data'] = '''
page_size: 1000
groups:
date_histogram:
delay: "7d"
field: "timestamp"
interval: "1h"
interval: "7d"
time_zone: "UTC"
terms:
fields:
Expand Down
2 changes: 2 additions & 0 deletions x-pack/docs/en/rest-api/rollup/put-job.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ started with the <<rollup-start-job,Start Job API>>.
`metrics`::
(object) Defines the metrics that should be collected for each grouping tuple. See <<rollup-job-config,rollup job config>>.

For more details about the job configuration, see <<rollup-job-config>>.

==== Authorization

You must have `manage` or `manage_rollup` cluster privileges to use this API.
Expand Down
50 changes: 45 additions & 5 deletions x-pack/docs/en/rest-api/rollup/rollup-job-config.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ PUT _xpack/rollup/job/sensor
"groups" : {
"date_histogram": {
"field": "timestamp",
"interval": "1h",
"interval": "60m",
"delay": "7d"
},
"terms": {
Expand Down Expand Up @@ -99,7 +99,7 @@ fields will then be available later for aggregating into buckets. For example,
"groups" : {
"date_histogram": {
"field": "timestamp",
"interval": "1h",
"interval": "60m",
"delay": "7d"
},
"terms": {
Expand Down Expand Up @@ -133,9 +133,9 @@ The `date_histogram` group has several parameters:
The date field that is to be rolled up.

`interval` (required)::
The interval of time buckets to be generated when rolling up. E.g. `"1h"` will produce hourly rollups. This follows standard time formatting
syntax as used elsewhere in Elasticsearch. The `interval` defines the _minimum_ interval that can be aggregated only. If hourly (`"1h"`)
intervals are configured, <<rollup-search,Rollup Search>> can execute aggregations with 1hr or greater (weekly, monthly, etc) intervals.
The interval of time buckets to be generated when rolling up. E.g. `"60m"` will produce 60 minute (hourly) rollups. This follows standard time formatting
syntax as used elsewhere in Elasticsearch. The `interval` defines the _minimum_ interval that can be aggregated only. If hourly (`"60m"`)
intervals are configured, <<rollup-search,Rollup Search>> can execute aggregations with 60m or greater (weekly, monthly, etc) intervals.
So define the interval as the smallest unit that you wish to later query.

Note: smaller, more granular intervals take up proportionally more space.
Expand All @@ -154,6 +154,46 @@ The `date_histogram` group has several parameters:
to be stored with a specific timezone. By default, rollup documents are stored in `UTC`, but this can be changed with the `time_zone`
parameter.

.Calendar vs Fixed time intervals
**********************************
Elasticsearch understands both "calendar" and "fixed" time intervals. Fixed time intervals are fairly easy to understand;
`"60s"` means sixty seconds. But what does `"1M` mean? One month of time depends on which month we are talking about,
some months are longer or shorter than others. This is an example of "calendar" time, and the duration of that unit
depends on context. Calendar units are also affected by leap-seconds, leap-years, etc.
This is important because the buckets generated by Rollup will be in either calendar or fixed intervals, and will limit
how you can query them later (see <<rollup-search-limitations-intervals, Requests must be multiples of the config>>.
We recommend sticking with "fixed" time intervals, since they are easier to understand and are more flexible at query
time. It will introduce some drift in your data during leap-events, and you will have to think about months in a fixed
quantity (30 days) instead of the actual calendar length... but it is often easier than dealing with calendar units
at query time.
Multiples of units are always "fixed" (e.g. `"2h"` is always the fixed quantity `7200` seconds. Single units can be
fixed or calendar depending on the unit:
[options="header"]
|=======
|Unit |Calendar |Fixed
|millisecond |NA |`1ms`, `10ms`, etc
|second |NA |`1s`, `10s`, etc
|minute |`1m` |`2m`, `10m`, etc
|hour |`1h` |`2h`, `10h`, etc
|day |`1d` |`2d`, `10d`, etc
|week |`1w` |NA
|month |`1M` |NA
|quarter |`1q` |NA
|year |`1y` |NA
|=======
For some units where there are both fixed and calendar, you may need to express the quantity in terms of the next
smaller unit. For example, if you want a fixed day (not a calendar day), you should specify `24h` instead of `1d`.
Similarly, if you want fixed hours, specify `60m` instead of `1h`. This is because the single quantity entails
calendar time, and limits you to querying by calendar time in the future.
**********************************

===== Terms

The `terms` group can be used on `keyword` or numeric fields, to allow bucketing via the `terms` aggregation at a later point. The `terms`
Expand Down
124 changes: 62 additions & 62 deletions x-pack/docs/en/rollup/rollup-getting-started.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,7 @@ PUT _xpack/rollup/job/sensor
"groups" : {
"date_histogram": {
"field": "timestamp",
"interval": "1h",
"delay": "7d"
"interval": "60m"
},
"terms": {
"fields": ["node"]
Expand Down Expand Up @@ -66,7 +65,7 @@ The `cron` parameter controls when and how often the job activates. When a roll
from where it left off after the last activation. So if you configure the cron to run every 30 seconds, the job will process the last 30
seconds worth of data that was indexed into the `sensor-*` indices.

If instead the cron was configured to run once a day at midnight, the job would process the last 24hours worth of data. The choice is largely
If instead the cron was configured to run once a day at midnight, the job would process the last 24 hours worth of data. The choice is largely
preference, based on how "realtime" you want the rollups, and if you wish to process continuously or move it to off-peak hours.

Next, we define a set of `groups` and `metrics`. The metrics are fairly straightforward: we want to save the min/max/sum of the `temperature`
Expand All @@ -79,7 +78,7 @@ It also allows us to run terms aggregations on the `node` field.
.Date histogram interval vs cron schedule
**********************************
You'll note that the job's cron is configured to run every 30 seconds, but the date_histogram is configured to
rollup at hourly intervals. How do these relate?
rollup at 60 minute intervals. How do these relate?
The date_histogram controls the granularity of the saved data. Data will be rolled up into hourly intervals, and you will be unable
to query with finer granularity. The cron simply controls when the process looks for new data to rollup. Every 30 seconds it will see
Expand Down Expand Up @@ -223,70 +222,71 @@ Which returns a corresponding response:
[source,js]
----
{
"took" : 93,
"timed_out" : false,
"terminated_early" : false,
"_shards" : ... ,
"hits" : {
"total" : 0,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"timeline" : {
"meta" : { },
"buckets" : [
{
"key_as_string" : "2018-01-18T00:00:00.000Z",
"key" : 1516233600000,
"doc_count" : 6,
"nodes" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "a",
"doc_count" : 2,
"max_temperature" : {
"value" : 202.0
},
"avg_voltage" : {
"value" : 5.1499998569488525
}
},
{
"key" : "b",
"doc_count" : 2,
"max_temperature" : {
"value" : 201.0
},
"avg_voltage" : {
"value" : 5.700000047683716
}
},
{
"key" : "c",
"doc_count" : 2,
"max_temperature" : {
"value" : 202.0
},
"avg_voltage" : {
"value" : 4.099999904632568
}
}
]
}
}
]
}
}
"took" : 93,
"timed_out" : false,
"terminated_early" : false,
"_shards" : ... ,
"hits" : {
"total" : 0,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"timeline" : {
"meta" : { },
"buckets" : [
{
"key_as_string" : "2018-01-18T00:00:00.000Z",
"key" : 1516233600000,
"doc_count" : 6,
"nodes" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "a",
"doc_count" : 2,
"max_temperature" : {
"value" : 202.0
},
"avg_voltage" : {
"value" : 5.1499998569488525
}
},
{
"key" : "b",
"doc_count" : 2,
"max_temperature" : {
"value" : 201.0
},
"avg_voltage" : {
"value" : 5.700000047683716
}
},
{
"key" : "c",
"doc_count" : 2,
"max_temperature" : {
"value" : 202.0
},
"avg_voltage" : {
"value" : 4.099999904632568
}
}
]
}
}
]
}
}
}
----
// TESTRESPONSE[s/"took" : 93/"took" : $body.$_path/]
// TESTRESPONSE[s/"_shards" : \.\.\. /"_shards" : $body.$_path/]

In addition to being more complicated (date histogram and a terms aggregation, plus an additional average metric), you'll notice
the date_histogram uses a `7d` interval instead of `1h`.
the date_histogram uses a `7d` interval instead of `60m`.

[float]
=== Conclusion
Expand Down
22 changes: 19 additions & 3 deletions x-pack/docs/en/rollup/rollup-search-limitations.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,25 @@ The response will tell you that the field and aggregation were not possible, bec
[float]
=== Interval Granularity

Rollups are stored at a certain granularity, as defined by the `date_histogram` group in the configuration. If data is rolled up at hourly
intervals, the <<rollup-search>> API can aggregate on any time interval hourly or greater. Intervals that are less than an hour will throw
an exception, since the data simply doesn't exist for finer granularities.
Rollups are stored at a certain granularity, as defined by the `date_histogram` group in the configuration. This means you
can only search/aggregate the rollup data with an interval that is greater-than or equal to the configured rollup interval.

For example, if data is rolled up at hourly intervals, the <<rollup-search>> API can aggregate on any time interval
hourly or greater. Intervals that are less than an hour will throw an exception, since the data simply doesn't
exist for finer granularities.

[[rollup-search-limitations-intervals]]
.Requests must be multiples of the config
**********************************
Perhaps not immediately apparent, but the interval specified in an aggregation request must be a whole
multiple of the configured interval. If the job was configured to rollup on `3d` intervals, you can only
query and aggregate on multiples of three (`3d`, `6d`, `9d`, etc).
A non-multiple wouldn't work, since the rolled up data wouldn't cleanly "overlap" with the buckets generated
by the aggregation, leading to incorrect results.
For that reason, an error is thrown if a whole multiple of the configured interval isn't found.
**********************************

Because the RollupSearch endpoint can "upsample" intervals, there is no need to configure jobs with multiple intervals (hourly, daily, etc).
It's recommended to just configure a single job with the smallest granularity that is needed, and allow the search endpoint to upsample
Expand Down
Loading

0 comments on commit a29af74

Please sign in to comment.