-
Notifications
You must be signed in to change notification settings - Fork 95
add a metrics to indicate store replication job run #286
Conversation
@@ -1236,6 +1238,7 @@ var metricDefs = map[ServiceIdx]map[int]metricDefinition{ | |||
StorageReplicationJobMaxConsecutiveFailures: {Gauge, "storage.replication-job.max-consecutive-failures"}, | |||
StorageReplicationJobCurrentFailures: {Gauge, "storage.replication-job.current-failures"}, | |||
StorageReplicationJobCurrentSuccess: {Gauge, "storage.replication-job.current-success"}, | |||
StorageReplicationJobRun: {Gauge, "storage.replication-job.run"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you need a "counter" instead of a "gauge"?
@@ -267,6 +267,7 @@ func (runner *replicationJobRunner) run() { | |||
runner.m3Client.UpdateGauge(metrics.ReplicateExtentScope, metrics.StorageReplicationJobCurrentFailures, int64(len(currentFailedJobs))) | |||
runner.m3Client.UpdateGauge(metrics.ReplicateExtentScope, metrics.StorageReplicationJobMaxConsecutiveFailures, int64(maxConsecutiveFailures)) | |||
runner.m3Client.UpdateGauge(metrics.ReplicateExtentScope, metrics.StorageReplicationJobCurrentSuccess, int64(jobsStarted)) | |||
runner.m3Client.UpdateGauge(metrics.ReplicateExtentScope, metrics.StorageReplicationJobRun, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this 'gauge' would be stuck at '1'. you need to increment a counter instead ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is a routine that runs only every ten minutes. The '1' here indicates the goroutine runs instead of stuck somewhere, so that I can use something like 'movingSum 1h' to setup an alert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still a little confused though. But with this, don't you then need to reset to zero when the run has finished? And perhaps check if there is a "0" to indicate that the run is completing and restarting, ketc.
I still feel a 'counter' can probably help more .. if the rate of change of count is zero, then that indicates a problem, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't need to reset it. After every run, the gauge metrics will increase by 1, then I use 'movingSum 1h' to get how many times the routine has ran in the past hour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, as discussed offline .. that should work, since "no report"/null corresponds to "0".
Planning to add an alert to make sure the job keeps running every 10 mins, based on this metrics.