chunk-max-stale / metrics-max-stale per retention #614

Dieterbe · 2017-04-24T20:03:32Z

(originally suggested here #557 (comment))

currently, for every new deployment, we have to figure out what is the largest interval at which a customer will send data, and tweak max-stale settings accordingly. otherwise we run the risk of closing chunks prematurely and dropping valid points.at the same time low resolutions benefit from lower max-stale settings.

We could make our life easier and automate this step of the provisioning and just tie these settings to the retention policy.
one approach could be making retention policies like :

series-interval:retention[:chunkspan:numchunks:ready:chunk-max-stale:metric-max-stale]

OTOH this will make our schemas definitions even more noisy. we could also just introduce extra attributes, in addition to the pattern and retentions fields, describing the value as a number of chunkspans.
e.g.

[apache_busyWorkers]
pattern = ^servers\.www.*\.workers\.busyWorkers$
retentions = 1s:1d:10min:1,1m:21d,15m:5y:2h:1:false
chunk-max-stale = 5 # persist chunk after 5x the chunkspan has passed
metric-max-stale = 6 # purge from memory after 6x the chunkspan has passed

but I don't think large and small chunkspans should be tied to the same factor for two reasons:

if sender experiences an interruption in metric sending, after which it'll process the backlog, the time in which humans can restore such interruption is usually independent of interval or chunkspan of the data. and GC'd chunks also cause metricpersist messages, meaning incomplete chunks won't be overwritten if the full data comes in later. so using the factor approach is a disadvantage to high-res, small-chunkspan data. such data probably deserves proportionally higher max-stale settings.
it also depends on kafka retention. if kafka retention is 12h and chunkspan is 1h, then we should be able to wait +- 11h before sealing a chunk and saving it, to allow as much time as possible to complete a chunk, while still not risking waiting too long (and making data unrecoverable in case of a primary crash)
and while the max-stale here would correspond to the chunkspan, it's not as a factor.

The text was updated successfully, but these errors were encountered:

Dieterbe · 2017-04-25T15:04:00Z

@woodsaj said we can simply base it off the chunkspan, but I don't think that'll work. e.g. if chunkspan is 1h but we retain data in kafka for 12h, then there's no need to set max-stale to ~1h, we could wait many hours for new data to come in and complete the chunk; which is still safe since if the primary crashes it'll be able to recreate the chunk. a chunk-max-stale value of something like kafka retention minus chunkspan seems to make more sense to me.

woodsaj · 2017-04-25T15:29:50Z

we dont need per retention settings.

metrics-max-stale is only needed if users have a dynamic workload (most of our users are in this category.)
The index pruning is necessary to remove series that are no longer being sent so they dont show up in templateVars or grafana query editor. However, once grafana/grafana#8055 is implemented, we know longer need to prune from the index.

For chunk-max-stale we are trying to protect against data loss when MT restarts. For there to be no data loss, any chunk that is un-saved must exist in Kafka. So really, we need a max-chunk-age setting, as we need to consider the time the first point was seen, and not the time of the last point. If the chunk creation time is < (now - kafkaRetention) then we must save the chunk or data will be lost if MT restarts.

Dieterbe · 2017-04-25T19:25:21Z

too bad sarama/kafka doesn't have a way to query the current retention settings for a topic at runtime.
so it's on us to make sure the MT setting corresponds to the kafka setting in use. not a big deal though since we stick to the same settings consistently with rare exceptions.

note that we'll have to replace the current per-aggmetric lastWrite with a per-chunk firstWrite.
but it'll make it more safe. currently old chunks can get very stale as long as the latest one is being written to.

Dieterbe · 2018-01-26T15:46:29Z

thinking a bit more about this. I think there's 3 additional important time windows:

it's possible that a chunk becomes not-completely-kafka-backed in between 2 GC runs.
so at any GC run, we should check whether a chunk will be safe if we keep it until the next GC run.
we have to take into account how much time a GC run will need between start and actually saving the chunks to the store (could potentially be automatically measured by MT or just configured explicitly)
the amount of time we would need to restart MT so that it can safely start replaying data again.

so the formula becomes: chunk creation time is < (now - kafkaRetention +GC interval + safety-window) where safety-window addresses point 2 and 3. some write queues take more than 1h to drain, so a safe default would be something like 2h I think.

we should also assure that normally, when we close & save a chunk, no new data comes in for that chunk, which could happen if the first data came in real-time, but data towards the end is being delayed.
so, this means we can get a formula of what kafkaRetention should be.

max-age: kafkaRetention - GC-interval - safety-window # to assure no data loss in case of MT restart
chunkspan+max-delay < max-age # to get all the data into the chunk before persisting
=>
chunkspan + max-delay < kafkaRetention - GC-interval - safety-window
kafkaRetention > chunkspan + max-delay + GC-interval + safety-window

so for example if coarsest rollups have 6h spans, hourly GC runs with a safety-window of 2h and a max permissible data lag of 3h. our minimum kafka retention becomes 6h + 3h + 1h + 2h = 12h

much more than what we currently use, but also much safer I think.
this way we can define our max tolerances explicitly, and tolerate each of them to be at their worst simultaneously.

i have to think a bit more about what the implications would be if an issue starts (e.g. unable to save to cassandra, trouble consuming from kafka, etc). I'ld like to come up with a formula to determine how quickly we must increase kafka retention, and by how much. but I wanted to post this so we can start discussing. Also the ROB will change the picture a bit, probably need to put in an extra term for that.

another interesting thought: if you have a retention of say 10h, but chunks of 30min, and the data stops, then the last chunk will be saved pretty late (e.g after 6h or so) so you'd have to wait a bit long before that data shows up

Dieterbe · 2018-02-19T18:48:18Z

@woodsaj does the above make sense to you?

woodsaj · 2018-02-19T20:18:12Z

yes, but we should probably also add the segmentSize to the retention. So that way when #850 is merged we can safely start consuming from "retention - segmentSize"

stale · 2020-04-04T10:39:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Dieterbe mentioned this issue Jan 26, 2018

Rollups can get lost if data stops being published for a long interval #834

Closed

This was referenced Feb 7, 2018

GC task too eagerly closes chunks #844

Closed

set kafka offset to retention - segmentSize #850

Merged

Dieterbe mentioned this issue Mar 8, 2018

Make sure rollup data is persisted before GC #840

Merged

Dieterbe mentioned this issue Jan 10, 2020

Document how to calculate the minimum kafka retention / replay offset #1597

Closed

stale bot added the stale label Apr 4, 2020

stale bot closed this as completed Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunk-max-stale / metrics-max-stale per retention #614

chunk-max-stale / metrics-max-stale per retention #614

Dieterbe commented Apr 24, 2017 •

edited

Loading

Dieterbe commented Apr 25, 2017

woodsaj commented Apr 25, 2017

Dieterbe commented Apr 25, 2017

Dieterbe commented Jan 26, 2018 •

edited

Loading

Dieterbe commented Feb 19, 2018

woodsaj commented Feb 19, 2018

stale bot commented Apr 4, 2020

chunk-max-stale / metrics-max-stale per retention #614

chunk-max-stale / metrics-max-stale per retention #614

Comments

Dieterbe commented Apr 24, 2017 • edited Loading

Dieterbe commented Apr 25, 2017

woodsaj commented Apr 25, 2017

Dieterbe commented Apr 25, 2017

Dieterbe commented Jan 26, 2018 • edited Loading

Dieterbe commented Feb 19, 2018

woodsaj commented Feb 19, 2018

stale bot commented Apr 4, 2020

Dieterbe commented Apr 24, 2017 •

edited

Loading

Dieterbe commented Jan 26, 2018 •

edited

Loading