Skip to content
This repository has been archived by the owner on Feb 18, 2021. It is now read-only.

add a metrics to indicate store replication job run #286

Merged
merged 1 commit into from
Sep 1, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions common/metrics/defs.go
Original file line number Diff line number Diff line change
Expand Up @@ -955,6 +955,8 @@ const (
StorageReplicationJobCurrentFailures
// StorageReplicationJobCurrentSuccess is the number of success job in current run
StorageReplicationJobCurrentSuccess
// StorageReplicationJobRun indicates the replication job runs
StorageReplicationJobRun

// -- Controller metrics -- //

Expand Down Expand Up @@ -1236,6 +1238,7 @@ var metricDefs = map[ServiceIdx]map[int]metricDefinition{
StorageReplicationJobMaxConsecutiveFailures: {Gauge, "storage.replication-job.max-consecutive-failures"},
StorageReplicationJobCurrentFailures: {Gauge, "storage.replication-job.current-failures"},
StorageReplicationJobCurrentSuccess: {Gauge, "storage.replication-job.current-success"},
StorageReplicationJobRun: {Gauge, "storage.replication-job.run"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you need a "counter" instead of a "gauge"?

},

// definitions for Controller metrics
Expand Down
1 change: 1 addition & 0 deletions services/storehost/replicationJobRunner.go
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,7 @@ func (runner *replicationJobRunner) run() {
runner.m3Client.UpdateGauge(metrics.ReplicateExtentScope, metrics.StorageReplicationJobCurrentFailures, int64(len(currentFailedJobs)))
runner.m3Client.UpdateGauge(metrics.ReplicateExtentScope, metrics.StorageReplicationJobMaxConsecutiveFailures, int64(maxConsecutiveFailures))
runner.m3Client.UpdateGauge(metrics.ReplicateExtentScope, metrics.StorageReplicationJobCurrentSuccess, int64(jobsStarted))
runner.m3Client.UpdateGauge(metrics.ReplicateExtentScope, metrics.StorageReplicationJobRun, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 'gauge' would be stuck at '1'. you need to increment a counter instead ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is a routine that runs only every ten minutes. The '1' here indicates the goroutine runs instead of stuck somewhere, so that I can use something like 'movingSum 1h' to setup an alert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still a little confused though. But with this, don't you then need to reset to zero when the run has finished? And perhaps check if there is a "0" to indicate that the run is completing and restarting, ketc.

I still feel a 'counter' can probably help more .. if the rate of change of count is zero, then that indicates a problem, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't need to reset it. After every run, the gauge metrics will increase by 1, then I use 'movingSum 1h' to get how many times the routine has ran in the past hour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, as discussed offline .. that should work, since "no report"/null corresponds to "0".


runner.logger.WithFields(bark.Fields{
`stats`: fmt.Sprintf(`total extents: %v, remote extents:%v, opened for replication: %v, primary: %v, secondary: %v, failed: %v, success: %v`,
Expand Down