KubePodNotReady alert does not catch kube_pod_status_phase{phase=Failed} #215

jdbaldry · 2019-06-13T07:01:56Z

Evicted pods can be in phase failed and I would expect this alert to catch this.

Happy to submit a PR to resolve, but wanted to know whether I am right to think that KubePodNotReady should catch the {phase=Failed} or of there should be a new alert KubePodFailed or similar?

Current alert:

{
            expr: |||
              sum by (namespace, pod) (kube_pod_status_phase{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s, phase=~"Pending|Unknown"}) > 0
            ||| % $._config,
            labels: {
              severity: 'critical',
            },
            annotations: {
              message: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than an hour.',
            },
            'for': '1h',
            alert: 'KubePodNotReady',
},

There is also a KubeJobFailed alert that may influence the decision for how to manage Failed pods.

{
            alert: 'KubeJobFailed',
            expr: |||
              kube_job_status_failed{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s}  > 0
            ||| % $._config,
            'for': '1h',
            labels: {
              severity: 'warning',
            },
            annotations: {
              message: 'Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.',
            },
},

The text was updated successfully, but these errors were encountered:

metalmatze · 2019-06-24T12:07:29Z

Happy to adjust the KubePodNotReady alert. Do you want to send a PR?

pascalwhoop · 2019-11-19T12:30:43Z

From our perspective, this is not the right approach. Failed pods are not "not ready" they are "failed". Currently, any failed pod will be considered not ready, independent of the reason for failure (could be any exit code != 0). This is especially annoying in a dev environment where a lot of devs are firing pods (or in bigdata environments batch jobs) against the cluster. All these failed pods are now critical.

I would revert the MR and instead propose a warning level alert that fires whenever a pod failed.

jdbaldry changed the title ~~KubePodNotReady alert does not catch {phase=Failed}~~ KubePodNotReady alert does not catch kube_pod_status_phase{phase=Failed} Jun 13, 2019

jdbaldry mentioned this issue Jun 25, 2019

Add phase=~"Failed" to KubePodNotReady alert #221

Merged

brancz closed this as completed in #221 Jun 25, 2019

pascalwhoop mentioned this issue Nov 19, 2019

revert failed pod marked as NotReady #296

Merged

aslafy-z mentioned this issue Jan 27, 2023

fix: revert failed pods marked as NotReady #821

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KubePodNotReady alert does not catch kube_pod_status_phase{phase=Failed} #215

KubePodNotReady alert does not catch kube_pod_status_phase{phase=Failed} #215

jdbaldry commented Jun 13, 2019 •

edited

Loading

metalmatze commented Jun 24, 2019

pascalwhoop commented Nov 19, 2019 •

edited

Loading

KubePodNotReady alert does not catch kube_pod_status_phase{phase=Failed} #215

KubePodNotReady alert does not catch kube_pod_status_phase{phase=Failed} #215

Comments

jdbaldry commented Jun 13, 2019 • edited Loading

metalmatze commented Jun 24, 2019

pascalwhoop commented Nov 19, 2019 • edited Loading

jdbaldry commented Jun 13, 2019 •

edited

Loading

pascalwhoop commented Nov 19, 2019 •

edited

Loading