-
Notifications
You must be signed in to change notification settings - Fork 104
chunk-max-stale / metrics-max-stale per retention #614
Comments
@woodsaj said we can simply base it off the chunkspan, but I don't think that'll work. e.g. if chunkspan is 1h but we retain data in kafka for 12h, then there's no need to set max-stale to ~1h, we could wait many hours for new data to come in and complete the chunk; which is still safe since if the primary crashes it'll be able to recreate the chunk. a |
we dont need per retention settings. metrics-max-stale is only needed if users have a dynamic workload (most of our users are in this category.) For chunk-max-stale we are trying to protect against data loss when MT restarts. For there to be no data loss, any chunk that is un-saved must exist in Kafka. So really, we need a max-chunk-age setting, as we need to consider the time the first point was seen, and not the time of the last point. If the |
too bad sarama/kafka doesn't have a way to query the current retention settings for a topic at runtime. note that we'll have to replace the current per-aggmetric lastWrite with a per-chunk firstWrite. |
thinking a bit more about this. I think there's 3 additional important time windows:
so the formula becomes: we should also assure that normally, when we close & save a chunk, no new data comes in for that chunk, which could happen if the first data came in real-time, but data towards the end is being delayed.
so for example if coarsest rollups have 6h spans, hourly GC runs with a safety-window of 2h and a max permissible data lag of 3h. our minimum kafka retention becomes 6h + 3h + 1h + 2h = 12h much more than what we currently use, but also much safer I think. i have to think a bit more about what the implications would be if an issue starts (e.g. unable to save to cassandra, trouble consuming from kafka, etc). I'ld like to come up with a formula to determine how quickly we must increase kafka retention, and by how much. but I wanted to post this so we can start discussing. Also the ROB will change the picture a bit, probably need to put in an extra term for that. another interesting thought: if you have a retention of say 10h, but chunks of 30min, and the data stops, then the last chunk will be saved pretty late (e.g after 6h or so) so you'd have to wait a bit long before that data shows up |
@woodsaj does the above make sense to you? |
yes, but we should probably also add the segmentSize to the retention. So that way when #850 is merged we can safely start consuming from "retention - segmentSize" |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
(originally suggested here #557 (comment))
currently, for every new deployment, we have to figure out what is the largest interval at which a customer will send data, and tweak max-stale settings accordingly. otherwise we run the risk of closing chunks prematurely and dropping valid points.at the same time low resolutions benefit from lower max-stale settings.
We could make our life easier and automate this step of the provisioning and just tie these settings to the retention policy.
one approach could be making retention policies like :
OTOH this will make our schemas definitions even more noisy. we could also just introduce extra attributes, in addition to the pattern and retentions fields, describing the value as a number of chunkspans.
e.g.
but I don't think large and small chunkspans should be tied to the same factor for two reasons:
and while the max-stale here would correspond to the chunkspan, it's not as a factor.
The text was updated successfully, but these errors were encountered: