Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log something about OOMKilled containers #69676

Open
sylr opened this issue Oct 11, 2018 · 84 comments
Open

Log something about OOMKilled containers #69676

sylr opened this issue Oct 11, 2018 · 84 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@sylr
Copy link
Contributor

sylr commented Oct 11, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

Container gets killed because it tries to use more memory than allowed.

What you expected to happen:

Have an OOMKilled event tied to the pod and logs about this

/sig node

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 11, 2018
@sylr
Copy link
Contributor Author

sylr commented Oct 11, 2018

@dims
Copy link
Member

dims commented Nov 14, 2018

related discussion in sig-schedling slack https://kubernetes.slack.com/messages/C09TP78DV/

@sylr
Copy link
Contributor Author

sylr commented Nov 14, 2018

https://kubernetes.slack.com/archives/C09TP78DV/p1539255149000100 to be exact but it might fall outside the slack allowed history for free tier pretty soon.

@nikopen
Copy link
Contributor

nikopen commented Nov 15, 2018

FYI there is a related Prometheus metric that is able to grep OOMKilled:
it's kube_pod_container_status_terminated_reason and kube_pod_container_status_last_terminated_reason

https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation/pod-metrics.md

reason=<OOMKilled|Error|Completed|ContainerCannotRun>

It might be a bit tricky to implement as an alert, thus kube_pod_container_status_last_terminated_reason was created to grep the previous status as the OOMKilled can be so fast that is not grepped.
kubernetes/kube-state-metrics#344

Still, it would be nice if a logline was printed by k8s by default.

@nikopen
Copy link
Contributor

nikopen commented Nov 15, 2018

kube_pod_container_status_last_terminated_reason is not yet released in a version, I'll ping to see if they're able to cut a release

@anderson4u2
Copy link

@nikopen Hi, I just tried kube_pod_container_status_last_terminated_reason in version 1.4.0 and it seems to work? i.e. use sum_over_time(kube_pod_container_status_terminated_reason{reason!="Completed"}[5m]) > 0 to detect a recent non-graceful termination.

@nikopen
Copy link
Contributor

nikopen commented Nov 19, 2018

Indeed, it seems to work :)

@brancz do you know why this happens? also tried it in 1.3.1.

    - alert: OOMKilled
      expr: sum_over_time(kube_pod_container_status_terminated_reason{reason="OOMKilled"}[5m]) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        description:  Pod {{$labels.pod}} in {{$labels.namespace}} got OOMKilled

@brancz
Copy link
Member

brancz commented Nov 20, 2018

Because the kube_pod_container_status_terminated_reason metric was only introduced in 1.4: https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.4.0-rc.0

@nikopen
Copy link
Contributor

nikopen commented Nov 20, 2018

@brancz my test is in 1.3.1 and kube_pod_container_status_terminated_reason works, was it already there perhaps?

@brancz
Copy link
Member

brancz commented Nov 20, 2018

Strange, you're right, looks like we have a mistake in the changelog. The code seems to be there in 1.3.1: https://github.com/kubernetes/kube-state-metrics/blob/v1.3.1/collectors/pod.go#L125

@boosty
Copy link

boosty commented Nov 22, 2018

@anderson4u2 I am a bit confused by your last comment. You wrote:

just tried kube_pod_container_status_last_terminated_reason in version 1.4.0

But in the example below you use kube_pod_container_status_terminated_reason, not kube_pod_container_status_last_terminated_reason.

So as far as I see, the new (very useful) metric kube_pod_container_status_last_terminated_reason is still unreleased.

@BrianChristie
Copy link

BrianChristie commented Nov 28, 2018

This has been discussed in #sig-instrumentation on Slack and was brought up on the sig-node call yesterday to determine a path forward.

There are two requests:

  1. Have an OOMKilled event tied to the Pod (as noted by @sylr)
  2. Have a count of termination reason by Pod in the Kubelet (or cAdvisor?), exposed to Prometheus as a monotonically increasing counter

To summarize what's currently available in kube-state-metrics:

  • kube_pod_container_status_terminated_reason
    This is a (binary) gauge which has a value of 1 for the current reason, and 0 for all other reasons. As soon as the Pod restarts, all reasons go to 0.

  • kube_pod_container_status_last_terminated_reason
    Same as above for the prior reason, so it's available after the Pod restarts.

  • kube_pod_container_status_restarts_total
    A count of the restarts, with no detail on the reason.

The issues are:

  1. There is no way to get a count of the reasons over time (for alerting and debugging).
  2. Some termination reasons will never be recorded by Prometheus when the reason changes before the next Prometheus scrape.

For example, given a Pod that is sometimes being OOMKilled, and sometimes crashing, it's desired to be able to view the historical termination reasons over time.

As a note: it was discussed and it appears the design of kube-state-metrics prevents aggregating the reason gauge into counters, and it's preferred if this happens at the source.

Implementing both of the above requests will significantly improve the ability of cluster-users and monitoring vendors to debug when Pods are failing.

Can @kubernetes/sig-node-feature-requests provide some guidance on the next steps here?

CC: @dchen1107

@dims
Copy link
Member

dims commented Nov 28, 2018

long-term-issue (note to self)

@brancz
Copy link
Member

brancz commented Nov 28, 2018

cc @kubernetes/sig-node-feature-requests

@WanLinghao
Copy link
Contributor

/cc

@hypnoglow
Copy link
Contributor

This query combines container restart and termination reason:

sum by (pod, container, reason) (kube_pod_container_status_last_terminated_reason{})
* on (pod,container) group_left
sum by (pod, container) (changes(kube_pod_container_status_restarts_total{}[1m]))

@boosty
Copy link

boosty commented Jan 19, 2019

Our team came up with a custom controller to implement the idea of having an OOMKilled event tied to the Pod. Please find it here: https://github.com/xing/kubernetes-oom-event-generator

From the README:
The Controller listens to the Kubernetes API for "Container Started" events and searches for those claiming they were OOMKilled previously. For matching ones an Event is generated as Warning with the reason PreviousContainerWasOOMKilled.

We would be very happy to get feedback on it.

@bjhaid
Copy link
Contributor

bjhaid commented Jan 22, 2019

Meta point, there are several times we have had OOM events show up in dmesg that cadvisor missed them and they did not end up as events...

@brancz
Copy link
Member

brancz commented Jan 23, 2019

@bjhaid fwiw you can use mtail against dmesg to produce metrics about oomkill messages.

@bjhaid
Copy link
Contributor

bjhaid commented Jan 23, 2019

@brancz thanks! I just wanted to point out that cadvisor misses OOM events and that should be considered when relying on it...

@brancz
Copy link
Member

brancz commented Jan 23, 2019

I see. If not already that should be reported to cAdvisor :)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@dgrisonnet
Copy link
Member

Is this still relevant after #108004? It seems to me that it is covering the gaps kube-state-metrics has with OOMKilled events.

@lukeschlather
Copy link

lukeschlather commented Feb 5, 2023

The problem here is that a pod can disappear and there's no record of why. A metric is useful in that it lets you know something is wrong but it doesn't actually tell you what is wrong. K8s shouldn't be killing pods without leaving a record of why it killed which pod in an obvious place.

@dims
Copy link
Member

dims commented Feb 5, 2023

@lukeschlather for the record, the kernel kills pods, not k8s. that's the whole problem with this issue :(

please google for ( "oom kill kernel" )

@lukeschlather
Copy link

Are the memory requests and limits just cgroups under the hood?

@dims
Copy link
Member

dims commented Feb 6, 2023

https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-limits-are-run

@lukeschlather ⬆️

@michelefa1988
Copy link

Following, same issue here

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2023
@pierluigilenoci
Copy link

/remove-lifecycle stale

@lukeschlather
Copy link

In GKE the only evidence I can find that there's been a problem is often that I can find a log like

Memory cgroup out of memory: Killed process 1472246 (node) total-vm:22290752kB, anon-rss:1000160kB, file-rss:47604kB, shmem-rss:0kB, UID:0 pgtables:17840kB oom_score_adj:985

Looking at the kubernetes side I can see that the pod was restarted, and I'm sure that it was restarted because the process was OOMkilled. But even in the metrics, I think it was killed before any metric could be created that showed it was using a concerning amount of memory. Can that "Memory cgroup out of memory" be modified to emit a log that gives the actual pod name? Or does k8s need to look for Memory cgroup out of memory events? Or is there some other piece of data I'm missing that makes it clear why that specific container was restarted?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 17, 2024
@pierluigilenoci
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2024
@dgrisonnet
Copy link
Member

I briefly looked through cadvisor's code responsible for reporting the metric and it is looking directly into /dev/kmsg to surface the OOM Kill events performed by the Kernel oom killer https://github.com/google/cadvisor/blob/master/utils/oomparser/oomparser.go.

As far as I can tell it should catch all the OOM kill events, so maybe you are facing a corner case? The parser is relying on a couple of regex that only works for Linux 5.0+ https://github.com/google/cadvisor/blob/master/utils/oomparser/oomparser.go#L30-L33 so it would be worth checking your kernel logs and verifying if and how the log message was produced there.

@dgrisonnet
Copy link
Member

dgrisonnet commented Apr 19, 2024

A metric is useful in that it lets you know something is wrong but it doesn't actually tell you what is wrong.

A log wouldn't point you to the culprit either, the oom killer just knows about processes oom score and will take decisions based on that when memory is overloaded.

If you want to know why the memory usage increased to the point that some processes had to be killed, there are other metrics you can look at. For example the container-level memory usage metrics can tell you which pods on a particular node had a spike in memory utilization.

The goal of the OOM killed metric is to tell you exactly which container got OOM killed in case you want to know what happened to your containers, nothing more. But there are other metrics that are meant to answer your other questions.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2024
@frittentheke
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2024
@dgrisonnet
Copy link
Member

@frittentheke are you seeing any gaps in how we report OOMKill that should still be covered by Kubernetes?

Based on my previous comments, I would be inclined to close this issue as fixed by #108004.

@frittentheke
Copy link

frittentheke commented Jul 25, 2024

@frittentheke are you seeing any gaps in how we report OOMKill that should still be covered by Kubernetes?
Based on my previous comments, I would be inclined to close this issue as fixed by #108004.

Thanks a bunch for asking @dgrisonnet!

I believe the OP (@sylr) originally was asking for an event to be "logged" or rather emitted by the Kubelet:

What you expected to happen:
Have an OOMKilled event tied to the pod and logs about this

I somehow believe this is not an entirely unreasonable idea. Going OOM is quite an event for a container ;-)

But the request / idea was likely due to the reasons behind the issue kubernetes/kube-state-metrics#2153 and the ephemeral nature of the last terminated reason (being the source for the kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}).

Yes, I believe #108004 does indeed fix the issue of loosing any OOM event by count.

But coming back to the idea of (also) having an event "logged", I believe it's still a good idea to NOT rely on metrics and counters that would need to be scraped, but to also expose them via the API somehow to they are available for debugging via kubectl. This could be an event, as suggested, or some status field containing the timestamp of last restart for each reason, maybe lastRestart_{REASON}.

Currently it's just that the Pod resource only contains the current state and lastState. Any information about the incarnation before that experiencing an OOM event is lost and not available straight via the API.

  containerStatuses:                                                                                                                                                                                                                  
  - containerID: containerd://e667b1666416ab5a463bd0deb6cfcee36122fe2ff5ea0552cae0ac5f15338941                                                                                                                                        
    image: docker.io/polinux/stress:latest                                                                                                                                                                                            
    imageID: docker.io/polinux/stress@sha256:b6144f84f9c15dac80deb48d3a646b55c7043ab1d83ea0a697c09097aaad21aa                                                                                                                         
    lastState:                                                                                                                                                                                                                        
      terminated:                                                                                                                                                                                                                     
        containerID: containerd://7fb5dd4e73d6412472f9f1db5f18a3bec289a50b675922c63fcd68b1e2959e16                                                                                                                                    
        exitCode: 137                                                                                                                                                                                                                 
        finishedAt: "2024-07-25T16:42:47Z"                                                                                                                                                                                            
        reason: OOMKilled                                                                                                                                                                                                             
        startedAt: "2024-07-25T16:42:47Z"                                                                                                                                                                                             
    name: memory-demo-ctr                                                                                                                                                                                                             
    ready: false                                                                                                                                                                                                                      
    restartCount: 2                                                                                                                                                                                                                   
    started: false                                                                                                                                                                                                                    
    state:                                                                                                                                                                                                                            
      terminated:                                                                                                                                                                                                                     
        containerID: containerd://e667b1666416ab5a463bd0deb6cfcee36122fe2ff5ea0552cae0ac5f15338941                                                                                                                                    
        exitCode: 137                                                                                                                                                                                                                 
        finishedAt: "2024-07-25T16:43:02Z"                                                                                                                                                                                            
        reason: OOMKilled                                                                                                                                                                                                             
        startedAt: "2024-07-25T16:43:02Z"        

@kevinnoel-be
Copy link

Or it doesn't really (didn't actually check, only following the links): google/cadvisor#3015

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2024
@frittentheke
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests