-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add kube_pod_container_status_last_terminated_reason #535
Add kube_pod_container_status_last_terminated_reason #535
Conversation
/assign @andyxning |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andyxning, jutley The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks for following up with this @jutley! 👍 |
I can see that #535 was merged, but there has been no release since august. Would it be ok to cut a release of kube-state-metrics soon? |
we are soon going to start cutting pre releases of the new release |
Hi @brancz any updates on the new release? This new metric will be very handy to us! Thanks, |
#577 should be the
last thing needed, before we can cut the first alpha release.
…On Mon, 19 Nov 2018, 08:03 andredantasrocha ***@***.*** wrote:
Hi @brancz <https://github.com/brancz> any updates on the new release?
This new metric will be very handy to us!
Thanks,
Andre
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#535 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGuKsyJwcQx-Ye8NR_1j81BvzBOObd3Dks5uwlfFgaJpZM4WZrmY>
.
|
I've used the alpha version of 1.5.0 to try this out. Whilst it is useful for alerting on a single occurrence of a particular reason e.g OomKill, it makes it very difficult to do anything more complex like "only alert if there have been 2 OomKills in the space of an hour". Sometimes a single kill is a valid scenario, so isn't worth alerting on. To be able to do this cleanly, there needs to be a restart counter per reason rather than a simple flag as is now. It is possible to do more complex queries by using a multiplication of the difference in restart count * the last reason flag, but it makes the PromQL a lot trickier in general. Just my 2c |
@aiman-alsari I agree that having counts for each error type could be useful. Unfortunately, the Kubernetes API does not provide that type of information. All it offers is 1) current state, 2) last state, and 3) error count. Since this project creates Prometheus metrics that reflect data from the Kubernetes API, I don't think this suggestion fits within the project scope. If you'd like, you can still file an issue. I don't think posting on this PR will get much more attention since it has already been merged and is nearly released. As an alternative, you can create an alert that looks like:
This will fire if there have been at least 2 restarts in the last hour, and the last_terminated_reason is OOMKilled. It's not perfect, but will probably get you what you want in most cases. If there are other regular causes for the container to terminate, you can change the latter expression to:
This will cause the alert to only fire if the last terminated reason for the last hour is OOMKilled. It will take a minimum of an hour to fire, so in most cases I would probably prefer the first example. |
@jutley Am I right in thinking that to get the points to match up neatly between the |
I'm able to get almost what I need with this metric, but does anyone know if there is a metric that displays the actual container exit code? |
@EvilCreamsicle it is unlikely that kube-state-metrics will expose that fine granular detail, as it would cause too high churn of time-series data. |
Only one remark,
return different results in terms of the labels. So, prometheus cannot match the two vectors provided from those expressions. I solved the issue with ignoring keyword.
Here is the reference link --> https://prometheus.io/docs/prometheus/latest/querying/operators/#one-to-one-vector-matches |
I think, it is better Experimentally found that when using |
@Zubant The second vector is not ignoring the |
What this PR does / why we need it:
This PR introduces a new metric:
kube_pod_container_status_last_terminated_reason
.Currently, we have the
kube_pod_container_status_terminated_reason
, but this always returns to 0 once a container starts back up. This means that we will only have a couple data points, if any at all, around the reason for a container termination. As a result, we cannot alert when a container crashes for a specific reason (we'd like to alert based on OOMs).This is brought up in this issue: #344
Which issue(s) this PR fixes:
Fixes #344