Skip to content

Commit

Permalink
mixin: Fix alert about unhealthy sidecar
Browse files Browse the repository at this point in the history
The alert was giving the wrong information as the $value contained
the number of pods that failing to send heartbeat instead of the actual
number of seconds that each sidecar was being unhealthy.

Also the 5 minute interval is probably too low as on large deployments
prometheus could take much longer to come up online and for sidecar to
become actually useful.

As such, we can simply subtract the timestamp of the last heartbeat from
the current time and fire if we are lagging for more than 10 minutes.

Signed-off-by: Markos Chandras <markos@chandras.me>
  • Loading branch information
hwoarang committed Aug 4, 2020
1 parent 040b69b commit 8960536
Show file tree
Hide file tree
Showing 5 changed files with 30 additions and 38 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ We use *breaking* word for marking changes that are not backward compatible (rel
- [#2936](https://github.com/thanos-io/thanos/pull/2936) Compact: Fix ReplicaLabelRemover panic when replicaLabels are not specified.
- [#2956](https://github.com/thanos-io/thanos/pull/2956) Store: Fix fetching of chunks bigger than 16000 bytes.
- [#2970](https://github.com/thanos-io/thanos/pull/2970) Store: Upgrade minio-go/v7 to fix slowness when running on EKS.
- [#2929](https://github.com/thanos-io/thanos/pull/2929) Mixin: Fix expression for 'unhealthy sidecar' alert and also increase the timeout for 10 minutes.

### Added

Expand Down
2 changes: 1 addition & 1 deletion examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,7 @@ rules:
message: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value
}} seconds.
expr: |
count(time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 300) > 0
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job) >= 600
labels:
severity: critical
```
Expand Down
2 changes: 1 addition & 1 deletion examples/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ groups:
message: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{
$value }} seconds.
expr: |
count(time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 300) > 0
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job) >= 600
labels:
severity: critical
- name: thanos-store.rules
Expand Down
61 changes: 26 additions & 35 deletions examples/alerts/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,47 +22,35 @@ tests:
exp_samples:
- labels: '{}'
value: 120
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job)
eval_time: 2m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
- labels: '{job="thanos-sidecar"}'
value: 43
- labels: '{pod="thanos-sidecar-pod-1"}'
value: 42
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
eval_time: 5m
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job)
eval_time: 10m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
- labels: '{job="thanos-sidecar"}'
value: 0
- labels: '{pod="thanos-sidecar-pod-1"}'
value: 0
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
eval_time: 6m
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job)
eval_time: 11m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
value: 0
- labels: '{pod="thanos-sidecar-pod-1"}'
- labels: '{job="thanos-sidecar"}'
value: 0
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
eval_time: 5m
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job)
eval_time: 10m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
value: 300
- labels: '{pod="thanos-sidecar-pod-1"}'
value: 300
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
eval_time: 6m
- labels: '{job="thanos-sidecar"}'
value: 600
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job)
eval_time: 11m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
value: 360
- labels: '{pod="thanos-sidecar-pod-1"}'
value: 360
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod) >= 300
- labels: '{job="thanos-sidecar"}'
value: 660
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job) >= 600
eval_time: 12m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
value: 720
- labels: '{pod="thanos-sidecar-pod-1"}'
- labels: '{job="thanos-sidecar"}'
value: 720
alert_rule_test:
- eval_time: 1m
Expand All @@ -71,24 +59,27 @@ tests:
alertname: ThanosSidecarUnhealthy
- eval_time: 3m
alertname: ThanosSidecarUnhealthy
- eval_time: 5m
- eval_time: 10m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
exp_annotations:
message: 'Thanos Sidecar is unhealthy for 2 seconds.'
- eval_time: 6m
message: 'Thanos Sidecar thanos-sidecar is unhealthy for 600 seconds.'
- eval_time: 11m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
exp_annotations:
message: 'Thanos Sidecar is unhealthy for 2 seconds.'
message: 'Thanos Sidecar thanos-sidecar is unhealthy for 660 seconds.'
- eval_time: 12m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
exp_annotations:
message: 'Thanos Sidecar is unhealthy for 2 seconds.'
message: 'Thanos Sidecar thanos-sidecar is unhealthy for 720 seconds.'
2 changes: 1 addition & 1 deletion mixin/alerts/sidecar.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
message: 'Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.',
},
expr: |||
count(time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) by (job, pod) >= 300) > 0
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) by (job) >= 600
||| % thanos.sidecar,
labels: {
severity: 'critical',
Expand Down

0 comments on commit 8960536

Please sign in to comment.