Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duration of kube_pod_container_status_terminated_reason metrics #344

Closed
kutsav opened this issue Jan 15, 2018 · 16 comments · Fixed by #535
Closed

Duration of kube_pod_container_status_terminated_reason metrics #344

kutsav opened this issue Jan 15, 2018 · 16 comments · Fixed by #535

Comments

@kutsav
Copy link

kutsav commented Jan 15, 2018

I understand that this metrics is keeping track of pods which were terminated and the reason for it, but the metrics store data of how many days? Like the data will be of last two days or last 12 hours?

@brancz
Copy link
Member

brancz commented Jan 15, 2018

Prometheus metrics always only reflect the current state. In this case, it reflects whatever the Pod object says, that the Kubernetes API returns. The value for the reason label are OOMKilled|Error|Completed|ContainerCannotRun. Note there is always a metric for all of those reasons, but only the metric that has the value 1, is the current state.

@mheggeseth
Copy link

mheggeseth commented Jan 23, 2018

I find it really unlikely for a Prom scrape to catch a pod container while its current state is terminated. It's much more likely that lastState has terminated while the current state has either running or waiting.

@jonaz
Copy link

jonaz commented Jan 24, 2018

I agree here. Trying to make an alert when pods are OOMKilled. But its only in that state for milliseconds so kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0does not work. Is there a way i can check lastState instead? @brancz

@brancz
Copy link
Member

brancz commented Jan 24, 2018

Yeah we can absolutely add a metric that exposes the Last state.

@jonaz
Copy link

jonaz commented Jan 29, 2018

Anyone working on this? We need it to send alert to the correct team when their app OOMs in production...

I could give it a try if no one has planned to add lastState.

What would be a good prometheus metrics structure for this?

    Last State:		Terminated
      Reason:		OOMKilled
      Exit Code:	137
      Started:		Mon, 29 Jan 2018 08:39:12 +0100
      Finished:		Mon, 29 Jan 2018 14:17:52 +0100

  • Using 0/1 Guage
  • using Finished as a unixtimestamp Guage
  • keeping local state if Finished changes and then increment a Counter
  • something smart...

@andredantasrocha
Copy link

andredantasrocha commented Apr 26, 2018

Hey guys, any updates on that? I am trying to implement a very similar alert... 😄

@benoneb
Copy link

benoneb commented May 7, 2018

Hi guys, no one got this solved?

@ivelichkovich
Copy link

You can turn it into a time series and use sum_over_time. To catch if anything has been OOMKilled in the last hour for example:
sum_over_time(kube_pod_container_status_terminated_reason{reason="OOMKilled",namespace="core-teams-dev"}[1h])

@akram
Copy link

akram commented Jun 3, 2018

+1

Where can we get this information as a durable manner ?

@dingobaby
Copy link

+1
Also very interested in being able to find the last state reason. Additionally is there is reason why all the kube_pod_container_status_* metrics dont include the node label? Being able to track if particular nodes have higher than average pod failures is also something Im interested in.

@brancz
Copy link
Member

brancz commented Jul 19, 2018

@dingobaby you can join the kube_pod_info metric onto any other metric of Pods, and that way add all that additional meta information. If that's too tedious to type, then I recommend to just write a recording rule for it.

@3h4x
Copy link

3h4x commented Aug 10, 2018

I would love to monitor OOM with kube-state-metrics but in current state it's impossible.
🙏

@svitlanacs
Copy link

I would like to add one more vote on this thread.

Here are the tasks I'd like to achieve:
(1) alert whenever a certain pod goes into error state (either via waiting_reason or terminated_reason),
(2) build out timestamp-based historical view of what was happening with a particular pod:
Container Creating -> ErrImagePull -> Running -> Terminating -> Running

While last state will help with task #1, is there anything we could do about enabling task #2 via kube-state-metrics as well?

@mxinden
Copy link
Contributor

mxinden commented Aug 23, 2018

I guess one could introduce a kube_pod_container_status_last_terminated_reason metric. It seems like @brancz is fine accepting a pull request for this.

For anyone wanting to tackle this, I am happy to help. Changes would need to touch pod.go. cs.LastTerminationState should give you everything you need.

@benoittgt
Copy link

benoittgt commented Jun 8, 2022

Hello

I am curious how do you monitor this metric ? I use changes at the moment but it doesn't properly report first seen terminated reason.

sum by(container, reason) (changes(kube_pod_container_status_last_terminated_reason{container="xxxx"}[$__rate_interval])) > 0 

Screenshot_2022-06-08_at_12_13_08

@CharlieC3
Copy link

@benoittgt I was able to get your query to work as expected by just removing the > 0 at the end:
sum by(container, reason) (changes(kube_pod_container_status_last_terminated_reason{container="xxxx"}[$__rate_interval]))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.