You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Evicted pods can be in phase failed and I would expect this alert to catch this.
Happy to submit a PR to resolve, but wanted to know whether I am right to think that KubePodNotReady should catch the {phase=Failed} or of there should be a new alert KubePodFailed or similar?
Current alert:
{
expr: |||
sum by (namespace, pod) (kube_pod_status_phase{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s, phase=~"Pending|Unknown"}) > 0
||| % $._config,
labels: {
severity: 'critical',
},
annotations: {
message: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than an hour.',
},
'for': '1h',
alert: 'KubePodNotReady',
},
There is also a KubeJobFailed alert that may influence the decision for how to manage Failed pods.
The text was updated successfully, but these errors were encountered:
jdbaldry
changed the title
KubePodNotReady alert does not catch {phase=Failed}
KubePodNotReady alert does not catch kube_pod_status_phase{phase=Failed}
Jun 13, 2019
From our perspective, this is not the right approach. Failed pods are not "not ready" they are "failed". Currently, any failed pod will be considered not ready, independent of the reason for failure (could be any exit code != 0). This is especially annoying in a dev environment where a lot of devs are firing pods (or in bigdata environments batch jobs) against the cluster. All these failed pods are now critical.
I would revert the MR and instead propose a warning level alert that fires whenever a pod failed.
Evicted pods can be in phase failed and I would expect this alert to catch this.
Happy to submit a PR to resolve, but wanted to know whether I am right to think that KubePodNotReady should catch the {phase=Failed} or of there should be a new alert KubePodFailed or similar?
Current alert:
There is also a KubeJobFailed alert that may influence the decision for how to manage Failed pods.
The text was updated successfully, but these errors were encountered: