Duration of kube_pod_container_status_terminated_reason metrics #344

kutsav · 2018-01-15T04:19:30Z

I understand that this metrics is keeping track of pods which were terminated and the reason for it, but the metrics store data of how many days? Like the data will be of last two days or last 12 hours?

brancz · 2018-01-15T09:23:21Z

Prometheus metrics always only reflect the current state. In this case, it reflects whatever the Pod object says, that the Kubernetes API returns. The value for the reason label are OOMKilled|Error|Completed|ContainerCannotRun. Note there is always a metric for all of those reasons, but only the metric that has the value 1, is the current state.

mheggeseth · 2018-01-23T17:40:45Z

I find it really unlikely for a Prom scrape to catch a pod container while its current state is terminated. It's much more likely that lastState has terminated while the current state has either running or waiting.

jonaz · 2018-01-24T15:09:40Z

I agree here. Trying to make an alert when pods are OOMKilled. But its only in that state for milliseconds so kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0does not work. Is there a way i can check lastState instead? @brancz

brancz · 2018-01-24T16:47:36Z

Yeah we can absolutely add a metric that exposes the Last state.

jonaz · 2018-01-29T14:26:31Z

Anyone working on this? We need it to send alert to the correct team when their app OOMs in production...

I could give it a try if no one has planned to add lastState.

What would be a good prometheus metrics structure for this?

    Last State:		Terminated
      Reason:		OOMKilled
      Exit Code:	137
      Started:		Mon, 29 Jan 2018 08:39:12 +0100
      Finished:		Mon, 29 Jan 2018 14:17:52 +0100

Using 0/1 Guage
using Finished as a unixtimestamp Guage
keeping local state if Finished changes and then increment a Counter
something smart...

andredantasrocha · 2018-04-26T06:49:16Z

Hey guys, any updates on that? I am trying to implement a very similar alert... 😄

benoneb · 2018-05-07T21:16:13Z

Hi guys, no one got this solved?

ivelichkovich · 2018-05-14T20:43:20Z

You can turn it into a time series and use sum_over_time. To catch if anything has been OOMKilled in the last hour for example:
sum_over_time(kube_pod_container_status_terminated_reason{reason="OOMKilled",namespace="core-teams-dev"}[1h])

akram · 2018-06-03T10:55:02Z

+1

Where can we get this information as a durable manner ?

dingobaby · 2018-07-18T21:33:52Z

+1
Also very interested in being able to find the last state reason. Additionally is there is reason why all the kube_pod_container_status_* metrics dont include the node label? Being able to track if particular nodes have higher than average pod failures is also something Im interested in.

brancz · 2018-07-19T07:29:06Z

@dingobaby you can join the kube_pod_info metric onto any other metric of Pods, and that way add all that additional meta information. If that's too tedious to type, then I recommend to just write a recording rule for it.

3h4x · 2018-08-10T13:05:41Z

I would love to monitor OOM with kube-state-metrics but in current state it's impossible.
🙏

svitlanacs · 2018-08-22T01:53:14Z

I would like to add one more vote on this thread.

Here are the tasks I'd like to achieve:
(1) alert whenever a certain pod goes into error state (either via waiting_reason or terminated_reason),
(2) build out timestamp-based historical view of what was happening with a particular pod:
Container Creating -> ErrImagePull -> Running -> Terminating -> Running

While last state will help with task #1, is there anything we could do about enabling task #2 via kube-state-metrics as well?

mxinden · 2018-08-23T10:04:18Z

I guess one could introduce a kube_pod_container_status_last_terminated_reason metric. It seems like @brancz is fine accepting a pull request for this.

For anyone wanting to tackle this, I am happy to help. Changes would need to touch pod.go. cs.LastTerminationState should give you everything you need.

benoittgt · 2022-06-08T11:06:27Z

Hello

I am curious how do you monitor this metric ? I use changes at the moment but it doesn't properly report first seen terminated reason.

sum by(container, reason) (changes(kube_pod_container_status_last_terminated_reason{container="xxxx"}[$__rate_interval])) > 0

CharlieC3 · 2023-02-28T22:11:16Z

@benoittgt I was able to get your query to work as expected by just removing the > 0 at the end:
sum by(container, reason) (changes(kube_pod_container_status_last_terminated_reason{container="xxxx"}[$__rate_interval]))

stephen-soltesz mentioned this issue Sep 4, 2018

Add alerts for containers terminated by OOM (or other reasons) m-lab/prometheus-support#213

Open

jutley mentioned this issue Sep 4, 2018

Add kube_pod_container_status_last_terminated_reason #535

Merged

k8s-ci-robot closed this as completed in #535 Sep 5, 2018

nikopen mentioned this issue Nov 15, 2018

Log something about OOMKilled containers kubernetes/kubernetes#69676

Open

hjacobs mentioned this issue Jan 9, 2019

Update kube-state-metrics to 1.5.0+ zalando-incubator/kubernetes-on-aws#1678

Closed

wozniakjan mentioned this issue Jan 31, 2019

[WIP] Native pod termination counter - proposed implementation #651

Closed

m1o1 mentioned this issue Apr 8, 2021

"Evicted" pods don't register metrics #1389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duration of kube_pod_container_status_terminated_reason metrics #344

Duration of kube_pod_container_status_terminated_reason metrics #344

kutsav commented Jan 15, 2018

brancz commented Jan 15, 2018

mheggeseth commented Jan 23, 2018 •

edited

Loading

jonaz commented Jan 24, 2018 •

edited

Loading

brancz commented Jan 24, 2018

jonaz commented Jan 29, 2018 •

edited

Loading

andredantasrocha commented Apr 26, 2018 •

edited

Loading

benoneb commented May 7, 2018 •

edited

Loading

ivelichkovich commented May 14, 2018

akram commented Jun 3, 2018 •

edited

Loading

dingobaby commented Jul 18, 2018

brancz commented Jul 19, 2018

3h4x commented Aug 10, 2018

svitlanacs commented Aug 22, 2018

mxinden commented Aug 23, 2018

benoittgt commented Jun 8, 2022 •

edited

Loading

CharlieC3 commented Feb 28, 2023

Duration of kube_pod_container_status_terminated_reason metrics #344

Duration of kube_pod_container_status_terminated_reason metrics #344

Comments

kutsav commented Jan 15, 2018

brancz commented Jan 15, 2018

mheggeseth commented Jan 23, 2018 • edited Loading

jonaz commented Jan 24, 2018 • edited Loading

brancz commented Jan 24, 2018

jonaz commented Jan 29, 2018 • edited Loading

andredantasrocha commented Apr 26, 2018 • edited Loading

benoneb commented May 7, 2018 • edited Loading

ivelichkovich commented May 14, 2018

akram commented Jun 3, 2018 • edited Loading

dingobaby commented Jul 18, 2018

brancz commented Jul 19, 2018

3h4x commented Aug 10, 2018

svitlanacs commented Aug 22, 2018

mxinden commented Aug 23, 2018

benoittgt commented Jun 8, 2022 • edited Loading

CharlieC3 commented Feb 28, 2023

mheggeseth commented Jan 23, 2018 •

edited

Loading

jonaz commented Jan 24, 2018 •

edited

Loading

jonaz commented Jan 29, 2018 •

edited

Loading

andredantasrocha commented Apr 26, 2018 •

edited

Loading

benoneb commented May 7, 2018 •

edited

Loading

akram commented Jun 3, 2018 •

edited

Loading

benoittgt commented Jun 8, 2022 •

edited

Loading