Skip to content
This repository has been archived by the owner on Feb 18, 2021. It is now read-only.

add a metrics to indicate store replication job run #286

Merged
merged 1 commit into from
Sep 1, 2017
Merged

Conversation

datoug
Copy link
Contributor

@datoug datoug commented Aug 31, 2017

Planning to add an alert to make sure the job keeps running every 10 mins, based on this metrics.

@datoug datoug requested a review from kirg August 31, 2017 17:24
@coveralls
Copy link

coveralls commented Aug 31, 2017

Coverage Status

Coverage decreased (-0.3%) to 67.527% when pulling d3ca71b on store_counter into dd03fe5 on master.

@@ -1236,6 +1238,7 @@ var metricDefs = map[ServiceIdx]map[int]metricDefinition{
StorageReplicationJobMaxConsecutiveFailures: {Gauge, "storage.replication-job.max-consecutive-failures"},
StorageReplicationJobCurrentFailures: {Gauge, "storage.replication-job.current-failures"},
StorageReplicationJobCurrentSuccess: {Gauge, "storage.replication-job.current-success"},
StorageReplicationJobRun: {Gauge, "storage.replication-job.run"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you need a "counter" instead of a "gauge"?

@@ -267,6 +267,7 @@ func (runner *replicationJobRunner) run() {
runner.m3Client.UpdateGauge(metrics.ReplicateExtentScope, metrics.StorageReplicationJobCurrentFailures, int64(len(currentFailedJobs)))
runner.m3Client.UpdateGauge(metrics.ReplicateExtentScope, metrics.StorageReplicationJobMaxConsecutiveFailures, int64(maxConsecutiveFailures))
runner.m3Client.UpdateGauge(metrics.ReplicateExtentScope, metrics.StorageReplicationJobCurrentSuccess, int64(jobsStarted))
runner.m3Client.UpdateGauge(metrics.ReplicateExtentScope, metrics.StorageReplicationJobRun, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 'gauge' would be stuck at '1'. you need to increment a counter instead ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is a routine that runs only every ten minutes. The '1' here indicates the goroutine runs instead of stuck somewhere, so that I can use something like 'movingSum 1h' to setup an alert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still a little confused though. But with this, don't you then need to reset to zero when the run has finished? And perhaps check if there is a "0" to indicate that the run is completing and restarting, ketc.

I still feel a 'counter' can probably help more .. if the rate of change of count is zero, then that indicates a problem, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't need to reset it. After every run, the gauge metrics will increase by 1, then I use 'movingSum 1h' to get how many times the routine has ran in the past hour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, as discussed offline .. that should work, since "no report"/null corresponds to "0".

@datoug datoug merged commit 9790e57 into master Sep 1, 2017
@datoug datoug deleted the store_counter branch September 1, 2017 20:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants