Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubePodNotReady alert does not catch kube_pod_status_phase{phase=Failed} #215

Closed
jdbaldry opened this issue Jun 13, 2019 · 2 comments · Fixed by #221
Closed

KubePodNotReady alert does not catch kube_pod_status_phase{phase=Failed} #215

jdbaldry opened this issue Jun 13, 2019 · 2 comments · Fixed by #221

Comments

@jdbaldry
Copy link
Contributor

jdbaldry commented Jun 13, 2019

Evicted pods can be in phase failed and I would expect this alert to catch this.

Happy to submit a PR to resolve, but wanted to know whether I am right to think that KubePodNotReady should catch the {phase=Failed} or of there should be a new alert KubePodFailed or similar?

Current alert:

{
            expr: |||
              sum by (namespace, pod) (kube_pod_status_phase{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s, phase=~"Pending|Unknown"}) > 0
            ||| % $._config,
            labels: {
              severity: 'critical',
            },
            annotations: {
              message: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than an hour.',
            },
            'for': '1h',
            alert: 'KubePodNotReady',
},

There is also a KubeJobFailed alert that may influence the decision for how to manage Failed pods.

{
            alert: 'KubeJobFailed',
            expr: |||
              kube_job_status_failed{%(prefixedNamespaceSelector)s%(kubeStateMetricsSelector)s}  > 0
            ||| % $._config,
            'for': '1h',
            labels: {
              severity: 'warning',
            },
            annotations: {
              message: 'Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.',
            },
},
@jdbaldry jdbaldry changed the title KubePodNotReady alert does not catch {phase=Failed} KubePodNotReady alert does not catch kube_pod_status_phase{phase=Failed} Jun 13, 2019
@metalmatze
Copy link
Member

Happy to adjust the KubePodNotReady alert. Do you want to send a PR?

@pascalwhoop
Copy link
Contributor

pascalwhoop commented Nov 19, 2019

From our perspective, this is not the right approach. Failed pods are not "not ready" they are "failed". Currently, any failed pod will be considered not ready, independent of the reason for failure (could be any exit code != 0). This is especially annoying in a dev environment where a lot of devs are firing pods (or in bigdata environments batch jobs) against the cluster. All these failed pods are now critical.

I would revert the MR and instead propose a warning level alert that fires whenever a pod failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants