-
Notifications
You must be signed in to change notification settings - Fork 105
Keep track of last chunk save for each metric #452
Comments
Clarifications (correct me if I'm wrong):
This should also work if people are sending sparse data or due to an issue on their end have not sent data for a while, right? How would you say the "data seek back" (DSB) offset should correspond to the "metricpersist seek back" (MSB) offset? MSB should always > DSB I think, but I need to think about it more.
this new approach would still require that we update the code to add a reference to the aggmetric inside of the CWR, right? that is what would allow us to lock the aggmetric and update the property?
so the problem described there is that e.g. with 30minutely datapoints, and you go into a new chunk interval, and you want tho chunks saved, you may have to wait up to 30minutes for new data to come in which triggers the save on the older chunk. As that ticket states, you don't have this problem when you use kafka offset to seek back in time, because you can wait as long as needed for all chunks to save, and then use kafka offset seeking to replay all data that came in while you waited and didn't save yet. (when going this route, another problem may appear which is replaying all that data might take too long rendering your instance unavailable for too long) While the suggestion in OP does make seeking back in time more viable, it IMHO does not fix 357. But then again we're not sure yet if we need to fix 357 so that's OK. Also, I wonder if it makes sense to rely on kafka's log compaction for the metricpersist topic. basically it retains all the last messages of a topic and removes the earlier ones. |
Not really. Nodes really shouldnt be added back into rotation to receive queries until they have crossed the 6h (aggregation chunkspan) window. If a node is queried before this, the firstChunk in memory will be a partial chunk and so will be excluded from use (as we know it is missing data). https://github.com/raintank/metrictank/blob/master/mdata/aggmetric.go#L247-L250 . By excluding the chunk we force a lookup to cassandra, however as the chunk is incomplete there will be no data in Cassandra either. Additionally, the purpose of running 3 nodes is to provide N+2 redundancy to significantly reduce the possibility of data loss. If you have two instances running which are unable to be promoted to primary the cluster is vulnerable to data loss from a single MT crash or server reboot.
Not entirely.
yes
No. That would only be possible it we knew the data was being sent in realtime, which isnt true as there are lots of scenarios resulting in data being sent with a lag.
We should always seek max chunkSpan + some buffer.
#357 asks for forcible chunk saving to recduce data loss when restarting single node instances. However, that doesnt work. If you saved partial chunks on exit, then started up again with offset=latest the saved chunks would be re-created without the previously seen data and then re-saved.
Keeping support for NSQ is not a requirement. If keeping NSQ prevents us from delivering the features needed for our own use case, then NSQ needs to go. Making kakfa a requirement for HA clusters seems completely reasonable to me.
This is an optimization we can look at later, but is not needed right now. |
This can be split into at least 3 PRs.
|
Rather then tracking the state of each individual chunk, just keep a record of the most recent saveStart(add to write queue)/saveFinish(write to cassandra) as properties of the aggMetric. When we save chunks we always save all unsaved chunks, so we dont lose anything by tracking the save state for all chunks in one variable. issue #452
Rather then tracking the state of each individual chunk, just keep a record of the most recent saveStart(add to write queue)/saveFinish(write to cassandra) as properties of the aggMetric. When we save chunks we always save all unsaved chunks, so we dont lose anything by tracking the save state for all chunks in one variable. issue #452
Rather then tracking the state of each individual chunk, just keep a record of the most recent saveStart(add to write queue)/saveFinish(write to cassandra) as properties of the aggMetric. When we save chunks we always save all unsaved chunks, so we dont lose anything by tracking the save state for all chunks in one variable. issue #452
I haven't put as much thinking into this problem as you have, do you see a way to solve this? E.g. to avoid data loss in case people send data sparsely or with gaps. maybe based on the lastupdate field in the index? |
the only way to completely solve this problem is with a WAL. |
Rather then tracking the state of each individual chunk, just keep a record of the most recent saveStart(add to write queue)/saveFinish(write to cassandra) as properties of the aggMetric. When we save chunks we always save all unsaved chunks, so we dont lose anything by tracking the save state for all chunks in one variable. issue #452
Rather then tracking the state of each individual chunk, just keep a record of the most recent saveStart(add to write queue)/saveFinish(write to cassandra) as properties of the aggMetric. When we save chunks we always save all unsaved chunks, so we dont lose anything by tracking the save state for all chunks in one variable. issue #452
I thought a bit more about this and have a question: does the delay before a message goes into the metricpersist topic, cause any trouble? For the node starting up and forcing itself to consume all persist messages before consuming metrics, the consequence is that when it starts consuming metrics, it may consume points that will trigger persist calls and it'll try to save chunks even though those chunks may be in the write queue on (or have just been saved by) another node (even if that node has been depromoted, it'll still drain its write queues. or do we require that an ex-primary for a shardset must be fully shutdown before starting a new one? -- relevant #198 ) Is the solution to simply seek back in time a lot to make sure this effect can't manifest? |
No.
This doesnt make sense. This is the consequence of decoupling the saves from metric ingestion and so has always been the case. Nothing proposed here changes that.
If you wanted to promote secondary nodes to be primaries, the only thing that has changed is that you can now promote secondary nodes that have not been online for long (as they will only save chunks that have not already been saved due to knowing what chunks were saved before the node started). If you demote a node that has lots of chunks in its write queue and immediately promote a secondary node, things will work exactly as they always have and you run the risk of both nodes saving the chunks. But i think you are missing the point that having to manually demote/promote nodes is filled with problems and the main purpose of these changes are to:
|
Rather then tracking the state of each individual chunk, just keep a record of the most recent saveStart(add to write queue)/saveFinish(write to cassandra) as properties of the aggMetric. When we save chunks we always save all unsaved chunks, so we dont lose anything by tracking the save state for all chunks in one variable. issue #452
Rather then tracking the state of each individual chunk, just keep a record of the most recent saveStart(add to write queue)/saveFinish(write to cassandra) as properties of the aggMetric. When we save chunks we always save all unsaved chunks, so we dont lose anything by tracking the save state for all chunks in one variable. issue #452
I had a call with AJ to clear things up.
|
thinking about it more, I don't see a reason (in theory) to seek back further for MSB, since MSB "lags behind" DSB. so consuming both from the same point should always be enough. |
MSB really only needs to be set to value large enough to capture the last save for all chunks. So if the max chunkspan is 6hours and writes can take up to 1hour to complete, then we only need to seek back 7hours. This is because we only need the last latest metricPersist message for each Metric. |
A new MT instance that starts up has no knowledge of what chunks have been saved. Though we have metric_persist messages that are read from Kafka, they are consumed at a much faster rate than the data messages and are ignored if they are for a chunk that has not yet been started (ie the metric_persist is processed before the first metric for the chunk is seen, which is always the case during startup)
So if a new instance is started as primary, set to consume from kafka from some point in the past, it will re-save chunks to cassandra that have already been saved. The first chunk processed will almost always be a partial chunk and this partial chunk will be saved to cassandra overwriting the previously saved chunk that had all data.
We have worked around this by never starting a node as primary (except in single instance deployments) and always promoting an instance that has been online for a long time to the primary role when the current primary is restarted. However this is not a reliable approach
To alleviate this, I would like to keep track of the last chunk save time for each metric. This index can be rebuilt at startup be consuming from the metric_persist kafka topic. This will allow us to run dedicated Write instances of MT that can be restarted without causing data loss.
The idea I am proposing is to:
Make the metric_persist messages partitioned using the same partition key as the metrics. So a metric_persist message for a metric goes to the same partition number as the data for that metric. This will allow each MT instance to only consume metric_persist messages for the metrics that it is processing.
Track LastSavedChunk as an attribute of mdata.AggMetric rather than using the current Chunk.Saved attribute
At startup if an instance is a primary, replay all metric_persist messages from the last N minutes, where N > the the largest chunkspan used. These messages should be processed before we start consuming from the data topics.
When processing metric_persist messages, initialize missing AggMetric structs when they dont exist rather then discarding the metric_persist message like we currently do.
As well as providing a more robust and reliable startup sequence for nodes, this will also simplify the codebase in mdata.aggmetric.go and also fix:
#155
#357
The text was updated successfully, but these errors were encountered: