Skip to content

Commit

Permalink
Merge pull request grafana/cortex-jsonnet#405 from grafana/alert-on-s…
Browse files Browse the repository at this point in the history
…tuck-rollout

Add CortexRolloutStuck alert
  • Loading branch information
pracucci authored Oct 14, 2021
2 parents 4de2e29 + ea3274f commit fd975db
Show file tree
Hide file tree
Showing 2 changed files with 70 additions and 0 deletions.
61 changes: 61 additions & 0 deletions jsonnet/mimir-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,67 @@
},
],
},
{
name: 'cortex-rollout-alerts',
rules: [
{
alert: 'CortexRolloutStuck',
expr: |||
(
max without (revision) (
kube_statefulset_status_current_revision
unless
kube_statefulset_status_update_revision
)
*
(
kube_statefulset_replicas
!=
kube_statefulset_status_replicas_updated
)
) and (
changes(kube_statefulset_status_replicas_updated[15m])
==
0
)
* on(%s) group_left max by(%s) (cortex_build_info)
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
The {{ $labels.statefulset }} rollout is stuck in %(alert_aggregation_variables)s.
||| % $._config,
},
},
{
alert: 'CortexRolloutStuck',
expr: |||
(
kube_deployment_spec_replicas
!=
kube_deployment_status_replicas_updated
) and (
changes(kube_deployment_status_replicas_updated[15m])
==
0
)
* on(%s) group_left max by(%s) (cortex_build_info)
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: |||
The {{ $labels.deployment }} rollout is stuck in %(alert_aggregation_variables)s.
||| % $._config,
},
},
],
},
{
name: 'cortex-provisioning',
rules: [
Expand Down
9 changes: 9 additions & 0 deletions jsonnet/mimir-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -724,6 +724,15 @@ When an alertmanager cannot read the state for a tenant from storage it gets log
- The state could not be merged because it might be invalid and could not be decoded. This could indicate data corruption and therefore a bug in the reading or writing of the state, and would need further investigation.
- The state could not be read from storage. This could be due to a networking issue such as a timeout or an authentication and authorization issue with the remote object store.
### CortexRolloutStuck
This alert fires when a Cortex service rollout is stuck, which means the number of updated replicas doesn't match the expected one and looks there's no progress in the rollout. The alert monitors services deployed as Kubernetes `StatefulSet` and `Deployment`.
How to **investigate**:
- Run `kubectl -n <namespace> get pods -l name=<statefulset|deployment>` to get a list of running pods
- Ensure there's no pod in a failing state (eg. `Error`, `OOMKilled`, `CrashLoopBackOff`)
- Ensure there's no pod `NotReady` (the number of ready containers should match the total number of containers, eg. `1/1` or `2/2`)
- Run `kubectl -n <namespace> describe statefulset <name>` or `kubectl -n <namespace> describe deployment <name>` and look at "Pod Status" and "Events" to get more information
## Cortex routes by path
Expand Down

0 comments on commit fd975db

Please sign in to comment.