-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics package to publish metrics #531
Conversation
Starting a WorkInProgress PR to get some feedback on the approach here to add metrics. Related to #84 Adding some code to the discussion on the issue. I will continue to add more metrics following the RED. |
@ashish-amarnath thanks for this PR. We've got a lot going on but at the moment (prepping for an 0.9 alpha with restic support), but we'll review it soon! |
pkg/cmd/server/server.go
Outdated
@@ -501,6 +509,13 @@ func (s *server) runControllers(config *api.Config) error { | |||
s.logger, | |||
) | |||
|
|||
go func() { | |||
http.Handle("/metrics", prometheus.Handler()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to use package-level http handler registration. Would you mind creating an http.ServeMux
, registering the handler with that, and then passing the ServeMux
as the 2nd parameter to http.ListenAndServe
?
pkg/metrics/metrics.go
Outdated
/****Metrics related to Ark server backup***********/ | ||
lastBackupStatus = prometheus.NewGauge( | ||
prometheus.GaugeOpts{ | ||
Namespace: "heptio-ark", // TODO: don't hard-code this value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the namespace needs to be flexible, I think it might make more sense to have a struct that holds on to all the metrics, e.g.
type Metrics struct {
lastBackupStatus prometheus.Guage
}
func NewMetrics(namespace string) *Metrics {
return &Metrics{
lastBackupStatus: prometheus.NewGuage(
prometheus.GaugeOpts{
Namespace: namespace,
// ...
},
),
}
}
func (m *Metrics) UpdateLastBackupResult(success bool) {
if success {
m.lastBackupStatus.Set(1)
} else {
m.lastBackupStatus.Set(0)
}
}
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense. I would have gone down that route to get rid of that TODO
that I had.
Just did a very quick first pass |
@ncdc Thanks for the quick review! I've made changes to address your comments. I will now start adding some metrics |
pkg/metrics/metrics.go
Outdated
lastBackupStatus prometheus.Gauge | ||
} | ||
|
||
var metrics *serverMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, 1 more change here, please 😄. I'd prefer not to have any package level variables.
pkg/metrics/metrics.go
Outdated
// GetMetricsMux returns a mux with the metrics handle | ||
func GetMetricsMux() *http.ServeMux { | ||
mux := http.NewServeMux() | ||
mux.Handle("/metrics", prometheus.Handler()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the new-fangled way to do this is to use promhttp.Handler()
.
pkg/metrics/metrics.go
Outdated
lastBackupStatus prometheus.Gauge | ||
} | ||
|
||
var metrics *serverMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to avoid the package-level variable. Would you mind looking at https://github.com/heptio/gimbal/blob/master/discovery/pkg/metrics/metrics.go and basing this file on that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's helpful!
pkg/cmd/server/server.go
Outdated
@@ -96,7 +98,14 @@ func NewCommand() *cobra.Command { | |||
} | |||
namespace := getServerNamespace(namespaceFlag) | |||
|
|||
s, err := newServer(namespace, fmt.Sprintf("%s-%s", c.Parent().Name(), c.Name()), pluginDir, logger) | |||
metricAddressFlag := c.Flag("metric-address") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a variable at the top of this function (where logLevelFlag
and pluginDir
are) and set it to the default address, and then please bind it similar to how we bind log-level
and plugin-dir
). Also please call the flag metrics-address
.
pkg/cmd/server/server.go
Outdated
@@ -123,8 +132,17 @@ func getServerNamespace(namespaceFlag *pflag.Flag) string { | |||
return api.DefaultNamespace | |||
} | |||
|
|||
func getMetricAddress(metricAddressFlag *pflag.Flag) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need this function any more once you do what I commented above.
pkg/apis/ark/v1/constants.go
Outdated
@@ -37,4 +37,8 @@ const ( | |||
// NamespaceScopedDir is the name of the directory containing namespace-scoped | |||
// resource within an Ark backup. | |||
NamespaceScopedDir = "namespaces" | |||
|
|||
// DefaultMetricAddress is the port at which the metric server will listen on to make the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this DefaultMetricsAddress please. I'd also move the constant to pkg/cmd/server/server.go
.
pkg/metrics/metrics.go
Outdated
"github.com/prometheus/client_golang/prometheus/promhttp" | ||
) | ||
|
||
// ServerMetrics Prometheus metrics for the server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ServerMetrics contains Prometheus metrics for the Ark server.
pkg/metrics/metrics.go
Outdated
|
||
// ServerMetrics Prometheus metrics for the server | ||
type ServerMetrics struct { | ||
/****Metrics related to Ark server backup***********/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure the comment is needed?
pkg/metrics/metrics.go
Outdated
} | ||
} | ||
|
||
// GetMetricsMux returns a mux with the metrics handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of this function and set up the mux where you spawn the goroutine in server.go so this package doesn't need to know anything about http.
pkg/cmd/server/server.go
Outdated
metricsMux := http.NewServeMux() | ||
metricsMux.Handle("/metrics", promhttp.Handler()) | ||
err := http.ListenAndServe(s.metricsAddress, metricsMux) | ||
s.logger.Fatalf("Failed to start metrics: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's only make the fatal call if err != nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also verify that if you run ark server
by hand and you do ctrl-c
that it shuts down cleanly and this doesn't keep it running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops! Thanks for catching that.
pkg/controller/backup_controller.go
Outdated
if err := controller.runBackup(backup, controller.bucket); err != nil { | ||
logContext.WithError(err).Error("backup failed") | ||
backup.Status.Phase = api.BackupPhaseFailed | ||
controller.metrics.RegisterBackupFailed(backupName, backupScheduleName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the RegisterBackupSuccess
case may need to be moved into an else
clause here, such that a backup doesn't register as failed and succeeded.
I tried to run a backup using a GCP serviceaccount that didn't have access to the bucket selected, and my metrics stated that the backup failed and succeeded.
ERRO[0436] Error uploading log file bucket=<BUCKET_REDACTED> error="rpc error: code = Unknown desc = googleapi: Error 403: <ACCOUNT_REDACTED>.iam.gserviceaccount.com does not have storage.objects.create access to <BUCKET_REDACTED>/default-backup/default-backup-logs.gz., forbidden" key=default-backup/default-backup-logs.gz logSource="pkg/cloudprovider/backup_service.go:148"
ark backup get
shows the backup as failed.
% ./_output/bin/linux/amd64/ark backup get
NAME STATUS CREATED EXPIRES SELECTOR
default-backup Failed 2018-06-12 09:52:38 -0400 EDT 29d <none>
These are the metrics:
ark_server_backup_attempt_total{backupName="heptio-ark/default-backup",schedule="UNSCHEDULED"} 1
ark_server_backup_failed_total{backupName="heptio-ark/default-backup",schedule="UNSCHEDULED"} 1
ark_server_backup_size_bytes{backupName="heptio-ark/default-backup",schedule="UNSCHEDULED"} 4865
ark_server_backup_success_total{backupName="heptio-ark/default-backup",schedule="UNSCHEDULED"} 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nrb thanks for running the test for me and catching this bug! Pushed a change with the fix.
pkg/controller/backup_controller.go
Outdated
func getBackupScheduleName(backup *api.Backup) string { | ||
scheduleName := kubeutil.LabelValue(backup, "ark-schedule") | ||
if scheduleName == "" { | ||
scheduleName = "UNSCHEDULED" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a Backup doesn't have a Schedule, we should simply not add the schedule
label to the metrics rather than adding an arbitrary value.. This will simplify the prometheus queries later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this allows users to collect metrics for all unscheduled backups easily. I don't have a strong preference on this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How I've seen this work with Prometheus/Grafana is to use label matching operators:
ark_server_backup_attempt_total{ark-schedule=".+"}
--> Matches any backup with a schedule specified
ark_server_backup_attempt_total{ark-schedule=~"${Schedule}"
--> matches any backup with schedule using a variable, defaults to ".+" for "All", ignores any backups without a schedule
ark_server_backup_attempt_total{ark-schedule!=".+"}
--> Matches all backups without a schedule
--plugin-dir string directory containing Ark plugins (default "/plugins") | ||
-h, --help help for server | ||
--log-level the level at which to log. Valid values are debug, info, warning, error, fatal, panic. (default info) | ||
--metrics-address string the address to expose prometheus metrics (default ":8085") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be simplified to just port? Or does this allow us to specify a listen address & port (e.g. localhost:8085 or $public_ip:8085)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does allow you to specify an address. In this case, it will always be localhost:8085
or just :8085
.
pkg/cmd/server/server.go
Outdated
go func() { | ||
metricsMux := http.NewServeMux() | ||
metricsMux.Handle("/metrics", promhttp.Handler()) | ||
if err := http.ListenAndServe(s.metricsAddress, metricsMux); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: It's nice to log the address/port that the metrics are using. It's helpful in troubleshooting them not working.
pkg/cmd/server/server.go
Outdated
err := http.ListenAndServe(s.metricsAddress, metricsMux) | ||
s.logger.Fatalf("Failed to start metrics: %v", err) | ||
if err := http.ListenAndServe(s.metricsAddress, metricsMux); err != nil { | ||
s.logger.Fatalf("Failed to start metrics: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s.logger.WithError(err).Fatalf("Failed to start metrics")
pkg/metrics/metrics.go
Outdated
const ( | ||
metricNamespace = "ark_server" | ||
backupSizeBytesGauge = "backup_size_bytes" | ||
backupAttemptsCount = "backup_attempts_total" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it typical to append something like total
here - it seems weird to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ncdc it's a normal thing - it can help with understanding the units of metrics. See: https://prometheus.io/docs/practices/naming/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, prom newbie here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the convention I've seen for counters. The idea is we are counting 'events' and the value is the total number of events that have occurred, hence the 'total'.
pkg/metrics/metrics.go
Outdated
metricNamespace = "ark_server" | ||
backupSizeBytesGauge = "backup_size_bytes" | ||
backupAttemptsCount = "backup_attempts_total" | ||
backupSuccessCount = "backup_success_total" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep total, this name is fine as-is. Otherwise I'd maybe call it backup_successes
Also note that we may move away from success/failure for backups. For restores, the valid phases are new, in progress, failed validation, and completed. In each restore's status, we record the number of warnings and errors that occurred, but as long as we're able to restore something, the status is always completed. We are considering doing something similar for backups. (#286 is related)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we make that change, we can call this metric backup_completed_total
. Or even have a metric per phase of the backup. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, and we can discuss it if/when we make that change.
pkg/metrics/metrics.go
Outdated
backupSizeBytesGauge = "backup_size_bytes" | ||
backupAttemptsCount = "backup_attempts_total" | ||
backupSuccessCount = "backup_success_total" | ||
backupFailedCount = "backup_failed_total" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep total, I'd call this backup_failure_total
; otherwise, backup_failures
.
pkg/metrics/metrics.go
Outdated
|
||
const ( | ||
metricNamespace = "ark_server" | ||
backupSizeBytesGauge = "backup_size_bytes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rosskukulinski please compare with your naming suggestions here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the backup_size_bytes
the size of the resulting tarbal? If so, I suggest ark_backup_tarball_size_bytes
because that's different than the PV-related sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed.
pkg/util/kube/utils.go
Outdated
@@ -28,6 +28,14 @@ import ( | |||
corev1listers "k8s.io/client-go/listers/core/v1" | |||
) | |||
|
|||
// LabelValue looks up and returns the value for a label on the object | |||
func LabelValue(objMeta metav1.Object, labelName string) string { | |||
if value, exists := objMeta.GetLabels()[labelName]; exists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that labels are map[string]string
and the zero value of any undefined key is the empty string, do we need this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
pkg/controller/backup_controller.go
Outdated
@@ -269,9 +273,16 @@ func (controller *backupController) processBackup(key string) error { | |||
|
|||
logContext.Debug("Running backup") | |||
// execution & upload of backup | |||
backupName := kubeutil.NamespaceAndName(backup) | |||
backupScheduleName := kubeutil.LabelValue(backup, "ark-schedule") | |||
controller.metrics.RegisterBackupAttempt(backupName, backupScheduleName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's no schedule, won't this generate a new metric per backup, e.g.
ark_server_backup_attempts_total{backupName="foo1", schedule=""} 1
ark_server_backup_attempts_total{backupName="foo2", schedule=""} 1
ark_server_backup_attempts_total{backupName="foo3, schedule=""} 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will. I thought that is the behaviour @rosskukulinski was preferring, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take back my initial answer. I am not sure if it will create a metric per backup. trying to get the correct answer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.. after consulting with a collegue who knows more prom than me! we might just drop the backup name altogether and just have the schedule name.
pkg/metrics/metrics.go
Outdated
Name: backupSizeBytesGauge, | ||
Help: "Size, in bytes, of a backup", | ||
}, | ||
[]string{"schedule", "backupName"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use constants for these. Also please rename backupName
to backup
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/metrics/metrics.go
Outdated
} | ||
|
||
const ( | ||
metricNamespace = "ark_server" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the _server
part of this? I'm leaning toward just ark
.
pkg/controller/backup_controller.go
Outdated
@@ -269,9 +273,15 @@ func (controller *backupController) processBackup(key string) error { | |||
|
|||
logContext.Debug("Running backup") | |||
// execution & upload of backup | |||
backupScheduleName, _ := backup.GetLabels()["ark-schedule"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the , _
pkg/controller/backup_controller.go
Outdated
errs = append(errs, err) | ||
} | ||
|
||
backupScheduleName, _ := backup.GetLabels()["ark-schedule"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
pkg/metrics/metrics.go
Outdated
const ( | ||
metricNamespace = "ark_server" | ||
backupTarballSizeBytesGauge = "backup_tarball_size_bytes" | ||
backupAttemptsCount = "backup_attempts_total" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to be consistent with singular vs plural and our nomenclature. Should this be attempt or attempts?
success or successes?
failure or failures?
pkg/metrics/metrics.go
Outdated
} | ||
|
||
const ( | ||
metricNamespace = "ark_server" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already in the metrics
package - can we call this namespace
instead?
* add a metrics package to handle metric registration and publishing * add a metricsAddress field to the server struct * make metrics a part of the server * start a metrics endpoint as part of starting the controllers * instrument backup_controller to report metrics * update cli-reference docs * update example deployments with prometheus annotations * update 'pkg/install' tooling with prometheus annotations Signed-off-by: Ashish Amarnath <ashish.amarnath@gmail.com>
I don't have any additional comments. Excited for this! |
LGTM. Just doing a final test. |
Add a metrics package to handle metric registration and publishing
Add a metricsAddress field to the server struct
Start a metrics endpoint as part of starting the controllers
Report backup status on backup completion