Prometheus metrics for backups #2095

jescarri · 2019-06-17T20:36:48Z

Currently there's no exporter for the etcd-backup-operator.

Creating this issue to link it to a PR.

Added a few metricts and an exporter to the backup operator. Related to: coreos#2095

jurgenweber · 2019-07-09T00:06:41Z

as a side note, I took your branch and my branch and built my own backup operator image, works great.

jescarri · 2019-07-09T19:34:45Z

@jurgenweber yes, it's being running in our clusters for a few weeks w/o problems :)

Thanks for testing it!

jurgenweber · 2019-07-10T02:57:56Z

Do you have any prometheus alerts/grafana dashboards you mind sharing?

Also I am finding, if the pod gets restarted the metric will disappear until a new backup is run. You can see the metrics endpoint no longer has etcd_operator_backup.* metrics, but others still do return. I think it will need to return all the time, even if it has no value. Thoughts?

jescarri · 2019-07-10T06:39:38Z

@rjtsdl sure, I can do that.

I was planning to add readiness / liveness probes later, but you are right, simple handlers can do the trick.

Added a few metricts and an exporter to the backup operator. Related to: coreos#2095 Removed gorilla mux

jescarri · 2019-07-11T04:04:48Z

@jurgenweber this is what we have right now:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
  labels:
    prometheus: k8s
    role: alert-rules
  name: etcd-backup
spec:
  groups:
  - name: etcd-backup
    rules:
    - alert: etcdBackupControllerDown
      annotations:
        summary: etcd-backup pod {{ $labels.kubernetes_pod_name }} has
          been down for 5 minutes
      expr: absent(up{app="etcd-backup-operator"}) == 1
      for: 5m
      labels:
        class: availability
        severity: p1
    - alert: etcdBackupsNOTAttempted
      annotations:
        summary: No etcd-backups hasn't been attempted for the past 30 min
      expr: rate(etcd_operator_backups_attempt_total[30m]) * 1800 < 2
      for: 5m
      labels:
        class: availability
        severity: p2
    - alert: etcdBackupsNOTSucceeding
      annotations:
        summary: No etcd-backups have succeeded the past 30 min
      expr: rate(etcd_operator_backups_success_total[30m]) * 1800 < 2
      for: 5m
      labels:
        class: availability
        severity: p2

jurgenweber · 2019-07-11T05:34:23Z

yeah, my schedule is one an hour:

        - alert: VaultEtcdLastBackup
          annotations:
            summary: The last backup was more than 1 hour ago, please check it
            description: "vault etcd {{ $labels.instance }} backup too old"
          expr: time() - etcd_operator_backup_last_success{name="vault-etcd-cluster-backup",namespace="devops",release="amazing-dog"} > 3700
          for: 10m
          labels:
            severity: critical
        - alert: VaultEtcdBackupFailed
          annotations:
            summary: The backup has failed, we check for the last 3 successful backup attempts. Check that it is work.
            description: "vault etcd {{ $labels.instance }} backup has failed"
          expr: increase(etcd_operator_backups_success_total{name="vault-etcd-cluster-backup",namespace="devops",release="amazing-dog"}[3h]) == 3
          for: 10m
          labels:
            severity: critical

jescarri added a commit to jescarri/etcd-operator that referenced this issue Jun 17, 2019

etcd-backup-operator: add prometheus-metrics.

89274f4

Added a few metricts and an exporter to the backup operator. Related to: coreos#2095

jescarri added a commit to jescarri/etcd-operator that referenced this issue Jun 17, 2019

etcd-backup-operator: add prometheus-metrics.

c508235

Added a few metricts and an exporter to the backup operator. Related to: coreos#2095

jescarri added a commit to jescarri/etcd-operator that referenced this issue Jul 11, 2019

etcd-backup-operator: add prometheus-metrics.

e4e63ab

Added a few metricts and an exporter to the backup operator. Related to: coreos#2095 Removed gorilla mux

jescarri added a commit to jescarri/etcd-operator that referenced this issue Jul 11, 2019

etcd-backup-operator: add prometheus-metrics.

6309d49

Added a few metricts and an exporter to the backup operator. Related to: coreos#2095 Removed gorilla mux

Marlinc mentioned this issue Nov 12, 2019

etcd-backup-operator: add prometheus-metrics. cbws/etcd-operator#32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus metrics for backups #2095

Prometheus metrics for backups #2095

jescarri commented Jun 17, 2019

jurgenweber commented Jul 9, 2019 •

edited

Loading

jescarri commented Jul 9, 2019

jurgenweber commented Jul 10, 2019 •

edited

Loading

jescarri commented Jul 10, 2019

jescarri commented Jul 11, 2019

jurgenweber commented Jul 11, 2019

Prometheus metrics for backups #2095

Prometheus metrics for backups #2095

Comments

jescarri commented Jun 17, 2019

jurgenweber commented Jul 9, 2019 • edited Loading

jescarri commented Jul 9, 2019

jurgenweber commented Jul 10, 2019 • edited Loading

jescarri commented Jul 10, 2019

jescarri commented Jul 11, 2019

jurgenweber commented Jul 11, 2019

jurgenweber commented Jul 9, 2019 •

edited

Loading

jurgenweber commented Jul 10, 2019 •

edited

Loading