Add kube_pod_container_status_last_terminated_reason #535

jutley · 2018-09-04T20:11:22Z

What this PR does / why we need it:
This PR introduces a new metric: kube_pod_container_status_last_terminated_reason.

Currently, we have the kube_pod_container_status_terminated_reason, but this always returns to 0 once a container starts back up. This means that we will only have a couple data points, if any at all, around the reason for a container termination. As a result, we cannot alert when a container crashes for a specific reason (we'd like to alert based on OOMs).

This is brought up in this issue: #344

Which issue(s) this PR fixes:
Fixes #344

jutley · 2018-09-04T21:08:44Z

/assign @andyxning

andyxning · 2018-09-05T03:18:57Z

/lgtm
/approve

k8s-ci-robot · 2018-09-05T03:19:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyxning, jutley

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andyxning]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mxinden · 2018-09-05T06:55:47Z

Thanks for following up with this @jutley! 👍

olvesh · 2018-10-26T07:12:05Z

I can see that #535 was merged, but there has been no release since august. Would it be ok to cut a release of kube-state-metrics soon?

brancz · 2018-10-26T13:06:25Z

we are soon going to start cutting pre releases of the new release

andredantasrocha · 2018-11-19T07:03:32Z

Hi @brancz any updates on the new release? This new metric will be very handy to us!

Thanks,
Andre

mxinden · 2018-11-19T08:40:33Z

#577 should be the last thing needed, before we can cut the first alpha release.

…

On Mon, 19 Nov 2018, 08:03 andredantasrocha ***@***.*** wrote: Hi @brancz <https://github.com/brancz> any updates on the new release? This new metric will be very handy to us! Thanks, Andre — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#535 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGuKsyJwcQx-Ye8NR_1j81BvzBOObd3Dks5uwlfFgaJpZM4WZrmY> .

aiman-alsari · 2018-12-12T17:23:39Z

I've used the alpha version of 1.5.0 to try this out. Whilst it is useful for alerting on a single occurrence of a particular reason e.g OomKill, it makes it very difficult to do anything more complex like "only alert if there have been 2 OomKills in the space of an hour". Sometimes a single kill is a valid scenario, so isn't worth alerting on. To be able to do this cleanly, there needs to be a restart counter per reason rather than a simple flag as is now.

It is possible to do more complex queries by using a multiplication of the difference in restart count * the last reason flag, but it makes the PromQL a lot trickier in general.

Just my 2c

jutley · 2018-12-12T17:57:14Z

@aiman-alsari I agree that having counts for each error type could be useful. Unfortunately, the Kubernetes API does not provide that type of information. All it offers is 1) current state, 2) last state, and 3) error count. Since this project creates Prometheus metrics that reflect data from the Kubernetes API, I don't think this suggestion fits within the project scope.

If you'd like, you can still file an issue. I don't think posting on this PR will get much more attention since it has already been merged and is nearly released.

As an alternative, you can create an alert that looks like:

kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 2
AND
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

This will fire if there have been at least 2 restarts in the last hour, and the last_terminated_reason is OOMKilled. It's not perfect, but will probably get you what you want in most cases. If there are other regular causes for the container to terminate, you can change the latter expression to:

min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h]) == 1

This will cause the alert to only fire if the last terminated reason for the last hour is OOMKilled. It will take a minimum of an hour to fire, so in most cases I would probably prefer the first example.

gjtempleton · 2019-01-04T16:57:23Z

@jutley Am I right in thinking that to get the points to match up neatly between the kube_pod_container_status_restarts_total and kube_pod_container_status_last_terminated_reason metric paths so that your first example would work the reason tag will need to be stripped out, e.g. with the without clause?

EvilCreamsicle · 2019-03-29T16:02:14Z

I'm able to get almost what I need with this metric, but does anyone know if there is a metric that displays the actual container exit code?
I'm trying to filter on a specific segfault error, so I need to know if an error was an exit code 139.

brancz · 2019-04-01T08:20:41Z

@EvilCreamsicle it is unlikely that kube-state-metrics will expose that fine granular detail, as it would cause too high churn of time-series data.

onrylmz · 2020-09-07T09:17:34Z

@aiman-alsari I agree that having counts for each error type could be useful. Unfortunately, the Kubernetes API does not provide that type of information. All it offers is 1) current state, 2) last state, and 3) error count. Since this project creates Prometheus metrics that reflect data from the Kubernetes API, I don't think this suggestion fits within the project scope.

If you'd like, you can still file an issue. I don't think posting on this PR will get much more attention since it has already been merged and is nearly released.

As an alternative, you can create an alert that looks like:
kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 2
AND
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
This will fire if there have been at least 2 restarts in the last hour, and the last_terminated_reason is OOMKilled. It's not perfect, but will probably get you what you want in most cases. If there are other regular causes for the container to terminate, you can change the latter expression to:
min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h]) == 1
This will cause the alert to only fire if the last terminated reason for the last hour is OOMKilled. It will take a minimum of an hour to fire, so in most cases I would probably prefer the first example.

Only one remark,

kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 2
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

return different results in terms of the labels. So, prometheus cannot match the two vectors provided from those expressions. I solved the issue with ignoring keyword.

kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 1 AND ignoring(reason) kube_pod_container_status_last_terminated_reason{reason='ContainerCannotRun'} > 0

Here is the reference link --> https://prometheus.io/docs/prometheus/latest/querying/operators/#one-to-one-vector-matches

aszubarev · 2021-02-24T16:22:05Z

@aiman-alsari I agree that having counts for each error type could be useful. Unfortunately, the Kubernetes API does not provide that type of information. All it offers is 1) current state, 2) last state, and 3) error count. Since this project creates Prometheus metrics that reflect data from the Kubernetes API, I don't think this suggestion fits within the project scope.
If you'd like, you can still file an issue. I don't think posting on this PR will get much more attention since it has already been merged and is nearly released.
As an alternative, you can create an alert that looks like:
kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 2
AND
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
This will fire if there have been at least 2 restarts in the last hour, and the last_terminated_reason is OOMKilled. It's not perfect, but will probably get you what you want in most cases. If there are other regular causes for the container to terminate, you can change the latter expression to:
min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h]) == 1
This will cause the alert to only fire if the last terminated reason for the last hour is OOMKilled. It will take a minimum of an hour to fire, so in most cases I would probably prefer the first example.
Only one remark,

kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 2

kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

return different results in terms of the labels. So, prometheus cannot match the two vectors provided from those expressions. I solved the issue with ignoring keyword.

kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 1 AND ignoring(reason) kube_pod_container_status_last_terminated_reason{reason='ContainerCannotRun'} > 0

Here is the reference link --> https://prometheus.io/docs/prometheus/latest/querying/operators/#one-to-one-vector-matches

I think, it is better
(kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 1) * ignoring(reason) kube_pod_container_status_last_terminated_reason{reason='ContainerCannotRun'} > 0.

Experimentally found that when using ignoring with operator and the second vector ignores the filter reason. The result is greater than 0, while it is not true

leoskyrocker · 2022-06-02T02:00:50Z

@Zubant The second vector is not ignoring the reason filter in my case (using the AND operator as opposed to the multiplication * operator)

Add kube_pod_container_status_last_terminated_reason with test

d171c3e

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 4, 2018

k8s-ci-robot requested review from andyxning and brancz September 4, 2018 20:11

Add documentation for kube_pod_container_status_last_terminated_reason

e2afd1e

k8s-ci-robot assigned andyxning Sep 4, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 5, 2018

k8s-ci-robot merged commit 5119063 into kubernetes:master Sep 5, 2018

stephen-soltesz mentioned this pull request Sep 5, 2018

Add alerts for containers terminated by OOM (or other reasons) m-lab/prometheus-support#213

Open

rocketraman mentioned this pull request Oct 30, 2018

Enable alerting on OOMs kubernetes-monitoring/kubernetes-mixin#112

Closed

jutley mentioned this pull request Nov 8, 2018

kube_pod_container_status_terminated_reason not show pod OOMkilled correctly #570

Closed

xrstf mentioned this pull request Jan 16, 2019

define alert for containers that have been OOMKilled kubermatic/kubermatic#2639

Merged

This was referenced Nov 7, 2019

[Stupid question] improve readme xing/kubernetes-oom-event-generator#6

Closed

Update alerting information in README xing/kubernetes-oom-event-generator#8

Merged

jutley mentioned this pull request Mar 4, 2021

internal/store/pod.go: Only create last_terminated containers series if containers are terminated state #1397

Merged

sanjaruzic mentioned this pull request Sep 9, 2021

Metric kube_pod_container_status_last_terminated_reason representation in Metricbeat elastic/beats#27840

Closed

tetianakravchenko mentioned this pull request Feb 9, 2022

Add kubernetes.container.status.last.reason metric elastic/beats#30306

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kube_pod_container_status_last_terminated_reason #535

Add kube_pod_container_status_last_terminated_reason #535

jutley commented Sep 4, 2018

jutley commented Sep 4, 2018

andyxning commented Sep 5, 2018

k8s-ci-robot commented Sep 5, 2018

mxinden commented Sep 5, 2018

olvesh commented Oct 26, 2018

brancz commented Oct 26, 2018

andredantasrocha commented Nov 19, 2018

mxinden commented Nov 19, 2018 via email

aiman-alsari commented Dec 12, 2018

jutley commented Dec 12, 2018

gjtempleton commented Jan 4, 2019

EvilCreamsicle commented Mar 29, 2019

brancz commented Apr 1, 2019

onrylmz commented Sep 7, 2020

aszubarev commented Feb 24, 2021 •

edited

Loading

leoskyrocker commented Jun 2, 2022 •

edited

Loading

Add kube_pod_container_status_last_terminated_reason #535

Add kube_pod_container_status_last_terminated_reason #535

Conversation

jutley commented Sep 4, 2018

jutley commented Sep 4, 2018

andyxning commented Sep 5, 2018

k8s-ci-robot commented Sep 5, 2018

mxinden commented Sep 5, 2018

olvesh commented Oct 26, 2018

brancz commented Oct 26, 2018

andredantasrocha commented Nov 19, 2018

mxinden commented Nov 19, 2018 via email

aiman-alsari commented Dec 12, 2018

jutley commented Dec 12, 2018

gjtempleton commented Jan 4, 2019

EvilCreamsicle commented Mar 29, 2019

brancz commented Apr 1, 2019

onrylmz commented Sep 7, 2020

aszubarev commented Feb 24, 2021 • edited Loading

leoskyrocker commented Jun 2, 2022 • edited Loading

aszubarev commented Feb 24, 2021 •

edited

Loading

leoskyrocker commented Jun 2, 2022 •

edited

Loading