Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS nodes lose readiness when containers exhaust memory #1145

Open
dasrirez opened this issue Jan 9, 2023 · 23 comments
Open

EKS nodes lose readiness when containers exhaust memory #1145

dasrirez opened this issue Jan 9, 2023 · 23 comments

Comments

@dasrirez
Copy link

dasrirez commented Jan 9, 2023

What happened:

When our applications consume too much memory, K8s nodes on EKS clusters lose readiness and become completely inoperable for extended periods of time. This means that instead of being rescheduled immediately, pods remain stuck in a pending state, resulting in noticeable downtime. This does not happen on GKE clusters.

What you expected to happen:

Nodes should never lose readiness, instead the containers should be restarted and/or the pods should be OOMKilled.

How to reproduce it (as minimally and precisely as possible):

  1. Provision a single node EKS cluster running an EC2 instance type of m5.large.
  2. Apply the following deployment resource.
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      annotations:
      labels:
        app: stress-ng
      name: stress-ng
    spec:
      selector:
        matchLabels:
          app: stress-ng
      template:
        metadata:
          labels:
            app: stress-ng
        spec:
          containers:
          - args:
            - -c
            - stress-ng --bigheap 0
            command:
            - /bin/bash
            image: alexeiled/stress-ng:latest-ubuntu
            name: stress-ng
    
  3. Observe the node lose readiness.
    $ k get nodes
    NAME                        STATUS   ROLES    AGE     VERSION
    ip-10-11-0-6.ec2.internal   NotReady   <none>   2m14s   v1.24.7-eks-fb459a0
    

Anything else we need to know?:

We've traced the cause of this problem to the memory reservation of kubelet, which is set by the bootstrap script here.

'. += {kubeReserved: {"cpu": $cpu_millicores_to_reserve, "ephemeral-storage": "1Gi", "memory": $mebibytes_to_reserve}}' $KUBELET_CONFIG)" > $KUBELET_CONFIG

This is what kubeReserved is set to by default on this cluster.

  "kubeReserved": {
    "cpu": "70m",
    "ephemeral-storage": "1Gi",
    "memory": "574Mi"
  }

Note the memory reservation, on a GKE cluster this value would be 1.8Gi https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#eviction_threshold.

When running with kubeReserved.memory="574Mi", kubelet logs indicate PLEG errors during memory exhaustion and the node loses readiness.

[root@ip-10-11-0-6 ~]# journalctl -u kubelet | grep -i pleg | grep -v SyncLoop | grep -v Generic
Jan 09 20:01:32 ip-10-11-0-6.ec2.internal kubelet[27632]: E0109 20:01:32.726177   27632 kubelet.go:2013] "Skipping pod synchronization" err="PLEG is not healthy: pleg has yet to be successful"
Jan 09 20:06:25 ip-10-11-0-6.ec2.internal kubelet[4077]: E0109 20:06:25.149411    4077 kubelet.go:2013] "Skipping pod synchronization" err="PLEG is not healthy: pleg has yet to be successful"
Jan 09 20:08:01 ip-10-11-0-6.ec2.internal kubelet[12256]: I0109 20:08:01.432398   12256 setters.go:546] "Node became not ready" node="ip-10-11-0-6.ec2.internal" condition={Type:Ready Status:False LastHeartbeatTime:2023-01-09 20:08:01.424711657 +0000 UTC m=+1.288302571 LastTransitionTime:2023-01-09 20:08:01.424711657 +0000 UTC m=+1.288302571 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
Jan 09 20:08:01 ip-10-11-0-6.ec2.internal kubelet[12256]: E0109 20:08:01.635341   12256 kubelet.go:2013] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"

The problem does not occur when the node is bootstrapped with kubeReserved.memory="1.8Gi". Also seems to be fine running kubeReserved.memory="1Gi", but that value is arbitrary and not tested.

Environment:

  • AWS Region: us-east-1
  • Instance Type(s): m5.large
  • EKS Platform version: eks.3
  • Kubernetes version: 1.24
  • AMI Version: ami-0c84934009677b6d5
  • Kernel: Linux ip-10-11-0-6.ec2.internal 5.4.219-126.411.amzn2.x86_64 #1 SMP Wed Nov 2 17:44:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Release information:
BASE_AMI_ID="ami-0c90e1cacda57dac6"
BUILD_TIME="Sat Nov 12 04:15:25 UTC 2022"
BUILD_KERNEL="5.4.219-126.411.amzn2.x86_64"
ARCH="x86_64"
@dasrirez
Copy link
Author

dasrirez commented Jan 9, 2023

Noting that the problem occurs in both containerd and docker based AMIs.

@cartermckinnon
Copy link
Member

cartermckinnon commented Jan 9, 2023

Thanks for the detailed issue! We definitely need to revise our kubeReserved values, memory being a simple function of $MAX_PODS should probably be revisited.

The GKE values seem fairly conservative to me; reserving ~23% (up from ~7% currently) of the available memory on smaller instance types isn't a change we should make without our own testing.

@dasrirez
Copy link
Author

Ack, thanks for taking a look at this issue! I think the GKE values can serve as a safe upper bound on the reserve limits, but they can probably be better optimized as mentioned. The fact that the node was able to operate with just 1Gi of memory reserved instead of 1.8Gi seems to be a quick hint for this.

It looks GKE values were used until this patch #419 was merged but that was a long time ago and tested on 1.14 clusters. It might be worth while to revert back to the GKE values as a quick fix until a better model is tested/developed though I say that without a good understanding of the timeframe for the latter fix.

@maximethebault
Copy link

maximethebault commented Jan 10, 2023

Also seeing the issue: #1098

Will close my issue in favor of this more detailed issue. It's good to see some attention on this, it's worrisome to see this happen on production workload when you use all AWS-recommended / default settings and when you always thought the EKS + EKS AMI combo would protect you against these kind of situations thanks to the automated configuration of memory reservation. Even more so after reading the K8S docs about system reserved memory, kube reserved memory, soft eviction, hard eviction: you feel confident there would be multiple barriers to breach before getting into that situation. Yet here we are.

Thanks for looking into this @cartermckinnon!

For what it's worth, here are the settings we ended up using to protect against this (Karpenter allows us to easily add these values when launching nodes):
systemReserved: 300Mi
evictionSoft: memory.available: 3%
evictionHard: memory.available: 2%

These values are completely empiric and far from having been scientifically determined, nor were they tested on a lot of instance types / architectures / container runtimes / etc, but they did the job for us :)

@bryanasdev000
Copy link

Same here, alt ought in my scenario are many small pods, every time that there's an OOM outside cgroup the node goes down and generally takes half an hour to be back.

A good default value will be hard to set without sacrificing the node capacity, but I think a good step may be like @maximethebault with karpenter, allowing we to configure it with an LT, similar to using containerd instead of docker or setting max pods (which by the way, we use at 110 with Cilium CNI).

Related aws/karpenter-provider-aws#1803.

@stevehipwell
Copy link
Contributor

CC @bwagner5

@stevehipwell
Copy link
Contributor

@davidroth
Copy link

davidroth commented Feb 28, 2023

Experienced the same problem with t3.medium instances.
Can somebody explain why this happens even though there is reserved memory for kubelet? It feels strange that although there is reserved memory for the system processes, a single pod can easily kill the whole node.

@stevehipwell
Copy link
Contributor

@davidroth the node logic is tied to the old assumption that you can only run as many pods as there are ENI IPs, which was always slightly incorrect (host port pods) but with the introduction of IP prefixes is now very much incorrect (in comparison to the accepted values everywhere else). I think this issue is most likely to impact smaller instances due to the under provisioning of kube reserved. We've used custom node logic limiting all pods to 110 pods and using the GKE memory calculation and haven't seen an issue since. You will also want to make sure your pods are setting valid requests/limits.

@schmee-hg
Copy link

schmee-hg commented Mar 3, 2023

Just experienced this as well today in our test environment, also with t3.medium instances. If this is not going to be solved in the near term I think there should be a big warning somewhere in the EKS docs (or even in the console) that t3.medium and other instances affected by this are not suitable for production workloads, cause this could have been a disaster if it was production.

@seyal84
Copy link

seyal84 commented Jun 23, 2023

Already experiencing this in eks v1.23 and m5.8xlarge instance type . When is it actually going to be resolved ? I still feel like we are not hitting the root cause of this issue.

@jortkoopmans
Copy link

Also seeing this issue on t3.medium on stock settings using AWS Managed Nodegroups (1.24). Without using prefix delegation.
The discussion on how to deal with user defined --max-pods is very important, because this is gaining popularity (also through adoption of Karpenter). There are several tickets on this, but generally we need an alternative to using the fixed max-pod-eni.txt values.

But this issue is more critical as it is happening also with stock settings, t3.medium having 17 pods, using (latest) standard AL2 ami. With some memory pressure, it will flap readiness or even crash.

@davidroth
Copy link

davidroth commented Feb 14, 2024

@jortkoopmans I discovered that it helps to check that all pods have their memory limits configured correctly.
It is important that the configured requested memory is the same as the configured limit memory.

Example:

 resources:
    requests:
      memory: "400Mi"
    limits:
      memory: "400Mi"

In the future, with cgroups v2 and the completion of the Quality of Service for Memory Resources, it will probably be possible to configure memory limits higher than the requested memory, as cgroup v2 memory throttling kicks in.

Until now, I have only been able to achieve stable nodes by setting the requests to limits..

@stevehipwell
Copy link
Contributor

@davidroth in a correctly configured Kubernetes cluster the pod resources shouldn't be able to make a node become unstable; that what system and kube reserved should be handling.

At the same time; as memory is (currently) uncompressible it's good practice to use the same value for limits as requests. A likely side effect of this is that all pods have additional overhead reserved which the node can then use to increase on it's default reserved values. This is exactly what we saw while we were still using the default kube reserved values with prefix delegation supporting 110 pods per node; clusters with pods that had resources exactly specified were much less likely to have a node problem than those with unconstrained or highly burstable pods.

Our solution to this was to implement the GKE kube reserved memory calculation as part of our node configuration which completely stopped this issue (FYI AKS also uses this and EKS & AKS use the GKE CPU calculation). The GKE calculation takes a proportion of node resources and I've not seen a node become unstable from this cause since we made this change.

However with Karpenter only supporting a fixed value for kube reserved we've been struggling to find a solution so that we can get Karpenter into production. At the same time as this AKS announced a (still pending for v1.29) move towards a per-pod calculation of kube reserved, but with a much higher per-pod cost that the EKS (20Mb vs 11Mb), which made me re-evaluate the problem.

So back to first principals we get the following statements.

  • A per-pod kube reserved calculation (+ a fixed value) should be correct
    • This is expensive as a node supporting the K8s default 110 pods would need a significant memory reservation
  • The current EKS calculation isn't correct/safe
    • Pods have an overhead greater than 11Mb (based on failure on node using the ENI mode and according to AKS 20Mb)
    • The high initial fixed value (255Mb) makes the calculation more robust for nodes with fewer pods
    • ENI mode is "more" correct by virtue of having a low pod density so the fixed value covers the gap better
    • Prefix mode is very incorrect
  • The GKE algorithm is robust but potentially wasteful
    • As it's not per-pod

My summary of the above is that for a "correct" solution the resources need to be configured per pod, but due to other considerations this is very hard and the general solution has been to use second order effects to build a good enough solution. These second order effects require certain constraints to be in place and once they aren't there is the potential for nodes to become unstable; the examples here for the EKS calculation are small nodes supporting 110 pods with a large number of those pods being deployed as burstable, and large nodes with more than about 25 pods using all of the available node resources.

As a number of things have changed since the above algorithms were created I think that we can solve this problem without needing to rely on second order effects. If we combine a static system & kube reserved configuration for the nodes and then introduce a runtime class with pod overheads we can directly model the system as it is rather than how it could be. This results in nodes which can better make use of their available resources as resources are only reserved when they're needed. The issue here is that there isn't a concept of a default runtime class so that would need to be added with a webhook (but mutating policy via CEL is WIP).

@cartermckinnon could we get some numbers published about the node resource utilisation; ideally with no pods running and then per pod.

@zip-chanko
Copy link

Experiencing the same. I also notice that there isn't anything set for systemReserved in the config. I am thinking a temporary workaround to increase the evictionHard. Do we know if this is achievable without baking a new AMI?

kubelet-config.json
{
  "kind": "KubeletConfiguration",
  "apiVersion": "kubelet.config.k8s.io/v1beta1",
  "address": "0.0.0.0",
  "authentication": {
    "anonymous": {
      "enabled": false
    },
    "webhook": {
      "cacheTTL": "2m0s",
      "enabled": true
    },
    "x509": {
      "clientCAFile": "/etc/kubernetes/pki/ca.crt"
    }
  },
  "authorization": {
    "mode": "Webhook",
    "webhook": {
      "cacheAuthorizedTTL": "5m0s",
      "cacheUnauthorizedTTL": "30s"
    }
  },
  "clusterDomain": "cluster.local",
  "hairpinMode": "hairpin-veth",
  "readOnlyPort": 0,
  "cgroupDriver": "systemd",
  "cgroupRoot": "/",
  "featureGates": {
    "RotateKubeletServerCertificate": true,
    "KubeletCredentialProviders": true
  },
  "protectKernelDefaults": true,
  "serializeImagePulls": false,
  "serverTLSBootstrap": true,
  "tlsCipherSuites": [
    "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",
    "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
    "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305",
    "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
    "TLS_RSA_WITH_AES_256_GCM_SHA384",
    "TLS_RSA_WITH_AES_128_GCM_SHA256"
  ],
  "registryPullQPS": 20,
  "registryBurst": 40,
  "clusterDNS": [
    "172.20.0.10"
  ],
  "kubeAPIQPS": 10,
  "kubeAPIBurst": 20,
  "evictionHard": {
    "memory.available": "100Mi",
    "nodefs.available": "10%",
    "nodefs.inodesFree": "5%"
  },
  "kubeReserved": {
    "cpu": "110m",
    "ephemeral-storage": "1Gi",
    "memory": "2829Mi"
  },
  "maxPods": 234,
  "providerID": "aws:///ap-southeast-2c/i-123456789",
  "systemReservedCgroup": "/system",
  "kubeReservedCgroup": "/runtime"
}

@stevehipwell
Copy link
Contributor

@zip-chanko as the limits aren't being enforced the combined values for kube and system reserved can be treated as a single unit.

@tooptoop4
Copy link

dis a doozie

@irizzant
Copy link

irizzant commented Sep 26, 2024

Our solution to this was to implement the GKE kube reserved memory calculation as part of our node configuration which completely stopped this issue (FYI AKS also uses this and EKS & AKS use the GKE CPU calculation). The GKE calculation takes a proportion of node resources and I've not seen a node become unstable from this cause since we made this change.

@stevehipwell I suppose that you used custom user data to do this, correct?

I see that it's possible to pass custom user data in Karpenter, even to Bottlerocket.
One could potentially use a bootstrap container like this:


[settings.bootstrap-containers.bear]
source = "<URI to ECR Repository for this Bootstrap Container>"
mode = "once"
user-data = "IyEvdXNyL2Jpbi9lbnYgc2gKc2V0IC1ldW8gcGlwZWZhaWwKCiMgQ3JlYXRlIHRoZSBkaXJlY3RvcnkKbWtkaXIgLXAgL3Zhci9saWIvbXlfZGlyZWN0b3J5CgojIFNldCBBUEkgY2xpZW50IGNvbmZpZ3VyYXRpb25zCmFwaWNsaWVudCBzZXQgLS1qc29uICd7InNldHRpbmdzIjogeyJvY2ktZGVmYXVsdHMiOiB7InJlc291cmNlLWxpbWl0cyI6IHsibWF4LW9wZW4tZmlsZXMiOiB7InNvZnQtbGltaXQiOiA0Mjk0OTY3Mjk2LCAiaGFyZC1saW1pdCI6IDg1ODk5MzQ1OTJ9fX19fScKCiMgTG9hZCBrZXJuZWwgbW9kdWxlCmlmIGNvbW1hbmQgLXYgbW9kcHJvYmUgPiAvZGV2L251bGwgMj4mMTsgdGhlbgogIG1vZHByb2JlIGR1bW15CmZpCgplY2hvICJVc2VyLWRhdGEgc2NyaXB0IGV4ZWN1dGVkLiIK"

So is it a good approach setting hardcoded values for system and kube reserved resources ?
Or use a Bottlerocket bootstrap container is the way to go?

@stevehipwell
Copy link
Contributor

@irizzant sorry for the slow reply, I've been on annual leave. Initially we used custom user date to implement the GKE algorithm for AL2 nodes (pre Bottlerocket) managed by ASGs. When Bottlerocket was released we pivoted to implementing this calculation in Terraform and passing in the kubelet args via user data; this is the pattern we currently use in production. To support Karpenter we pivoted again and are currently in the process of final testing of a fixed per pod overhead algorithm; as all of our nodes currently support a max of 110 pods this results in a fixed reserved value for all nodes.

RE Bottlerocket I think 1.22.0 introduced bootstrap commands which should allow dynamic reserved calculations without requiring a custom OCI image; but there doesn't appear to be any documentation for this yet.

@dims
Copy link
Member

dims commented Oct 4, 2024

@stevehipwell this one? https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/bootstrap-commands/README.md

@stevehipwell
Copy link
Contributor

Thanks @dims that's the one. I expected this to be in the main Bottlerocket docs (so I opened bottlerocket-os/bottlerocket#4232), but I didn't think to check back here. Should it also either be in the primary Bottlerocket docs or at least referenced there?

@irizzant
Copy link

irizzant commented Oct 4, 2024

Wow thanks @stevehipwell @dims !
By using bootstrap commands it looks like you can send Bottlerocket API commands with apiclient, like
commands = [[ "apiclient", "set", "motd=helloworld"]]

So you can actually set the values for kube reserved, but I don't understand how this can be set dynamically based on the instance type in Karpenter NodeClasses.

@dims
Copy link
Member

dims commented Oct 4, 2024

@stevehipwell ack, let's let the bottlerocket team review the issue when they get a chance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests