Skip to content
This repository has been archived by the owner on Mar 28, 2020. It is now read-only.

Prometheus metrics for backups #2095

Open
jescarri opened this issue Jun 17, 2019 · 6 comments
Open

Prometheus metrics for backups #2095

jescarri opened this issue Jun 17, 2019 · 6 comments

Comments

@jescarri
Copy link

Currently there's no exporter for the etcd-backup-operator.

Creating this issue to link it to a PR.

jescarri added a commit to jescarri/etcd-operator that referenced this issue Jun 17, 2019
Added a few metricts and an exporter to the backup operator.

Related to: coreos#2095
jescarri added a commit to jescarri/etcd-operator that referenced this issue Jun 17, 2019
Added a few metricts and an exporter to the backup operator.

Related to: coreos#2095
@jurgenweber
Copy link

jurgenweber commented Jul 9, 2019

as a side note, I took your branch and my branch and built my own backup operator image, works great.

@jescarri
Copy link
Author

jescarri commented Jul 9, 2019

@jurgenweber yes, it's being running in our clusters for a few weeks w/o problems :)

Thanks for testing it!

@jurgenweber
Copy link

jurgenweber commented Jul 10, 2019

Do you have any prometheus alerts/grafana dashboards you mind sharing?

Also I am finding, if the pod gets restarted the metric will disappear until a new backup is run. You can see the metrics endpoint no longer has etcd_operator_backup.* metrics, but others still do return. I think it will need to return all the time, even if it has no value. Thoughts?

@jescarri
Copy link
Author

@rjtsdl sure, I can do that.

I was planning to add readiness / liveness probes later, but you are right, simple handlers can do the trick.

jescarri added a commit to jescarri/etcd-operator that referenced this issue Jul 11, 2019
Added a few metricts and an exporter to the backup operator.

Related to: coreos#2095

Removed gorilla mux
jescarri added a commit to jescarri/etcd-operator that referenced this issue Jul 11, 2019
Added a few metricts and an exporter to the backup operator.

Related to: coreos#2095

Removed gorilla mux
@jescarri
Copy link
Author

@jurgenweber this is what we have right now:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
  labels:
    prometheus: k8s
    role: alert-rules
  name: etcd-backup
spec:
  groups:
  - name: etcd-backup
    rules:
    - alert: etcdBackupControllerDown
      annotations:
        summary: etcd-backup pod {{ $labels.kubernetes_pod_name }} has
          been down for 5 minutes
      expr: absent(up{app="etcd-backup-operator"}) == 1
      for: 5m
      labels:
        class: availability
        severity: p1
    - alert: etcdBackupsNOTAttempted
      annotations:
        summary: No etcd-backups hasn't been attempted for the past 30 min
      expr: rate(etcd_operator_backups_attempt_total[30m]) * 1800 < 2
      for: 5m
      labels:
        class: availability
        severity: p2
    - alert: etcdBackupsNOTSucceeding
      annotations:
        summary: No etcd-backups have succeeded the past 30 min
      expr: rate(etcd_operator_backups_success_total[30m]) * 1800 < 2
      for: 5m
      labels:
        class: availability
        severity: p2

@jurgenweber
Copy link

yeah, my schedule is one an hour:

        - alert: VaultEtcdLastBackup
          annotations:
            summary: The last backup was more than 1 hour ago, please check it
            description: "vault etcd {{ $labels.instance }} backup too old"
          expr: time() - etcd_operator_backup_last_success{name="vault-etcd-cluster-backup",namespace="devops",release="amazing-dog"} > 3700
          for: 10m
          labels:
            severity: critical
        - alert: VaultEtcdBackupFailed
          annotations:
            summary: The backup has failed, we check for the last 3 successful backup attempts. Check that it is work.
            description: "vault etcd {{ $labels.instance }} backup has failed"
          expr: increase(etcd_operator_backups_success_total{name="vault-etcd-cluster-backup",namespace="devops",release="amazing-dog"}[3h]) == 3
          for: 10m
          labels:
            severity: critical

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants