EKS nodes lose readiness when containers exhaust memory #1145

dasrirez · 2023-01-09T21:10:25Z

What happened:

When our applications consume too much memory, K8s nodes on EKS clusters lose readiness and become completely inoperable for extended periods of time. This means that instead of being rescheduled immediately, pods remain stuck in a pending state, resulting in noticeable downtime. This does not happen on GKE clusters.

What you expected to happen:

Nodes should never lose readiness, instead the containers should be restarted and/or the pods should be OOMKilled.

How to reproduce it (as minimally and precisely as possible):

Provision a single node EKS cluster running an EC2 instance type of m5.large.

Apply the following deployment resource.

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
  labels:
    app: stress-ng
  name: stress-ng
spec:
  selector:
    matchLabels:
      app: stress-ng
  template:
    metadata:
      labels:
        app: stress-ng
    spec:
      containers:
      - args:
        - -c
        - stress-ng --bigheap 0
        command:
        - /bin/bash
        image: alexeiled/stress-ng:latest-ubuntu
        name: stress-ng

Observe the node lose readiness.

$ k get nodes
NAME                        STATUS   ROLES    AGE     VERSION
ip-10-11-0-6.ec2.internal   NotReady   <none>   2m14s   v1.24.7-eks-fb459a0

Anything else we need to know?:

We've traced the cause of this problem to the memory reservation of kubelet, which is set by the bootstrap script here.

amazon-eks-ami/files/bootstrap.sh

Line 452 in eab112a

    
             '. += {kubeReserved: {"cpu": $cpu_millicores_to_reserve, "ephemeral-storage": "1Gi", "memory": $mebibytes_to_reserve}}' $KUBELET_CONFIG)" > $KUBELET_CONFIG

This is what kubeReserved is set to by default on this cluster.

  "kubeReserved": {
    "cpu": "70m",
    "ephemeral-storage": "1Gi",
    "memory": "574Mi"
  }

Note the memory reservation, on a GKE cluster this value would be 1.8Gi https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#eviction_threshold.

When running with kubeReserved.memory="574Mi", kubelet logs indicate PLEG errors during memory exhaustion and the node loses readiness.

[root@ip-10-11-0-6 ~]# journalctl -u kubelet | grep -i pleg | grep -v SyncLoop | grep -v Generic
Jan 09 20:01:32 ip-10-11-0-6.ec2.internal kubelet[27632]: E0109 20:01:32.726177   27632 kubelet.go:2013] "Skipping pod synchronization" err="PLEG is not healthy: pleg has yet to be successful"
Jan 09 20:06:25 ip-10-11-0-6.ec2.internal kubelet[4077]: E0109 20:06:25.149411    4077 kubelet.go:2013] "Skipping pod synchronization" err="PLEG is not healthy: pleg has yet to be successful"
Jan 09 20:08:01 ip-10-11-0-6.ec2.internal kubelet[12256]: I0109 20:08:01.432398   12256 setters.go:546] "Node became not ready" node="ip-10-11-0-6.ec2.internal" condition={Type:Ready Status:False LastHeartbeatTime:2023-01-09 20:08:01.424711657 +0000 UTC m=+1.288302571 LastTransitionTime:2023-01-09 20:08:01.424711657 +0000 UTC m=+1.288302571 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
Jan 09 20:08:01 ip-10-11-0-6.ec2.internal kubelet[12256]: E0109 20:08:01.635341   12256 kubelet.go:2013] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"

The problem does not occur when the node is bootstrapped with kubeReserved.memory="1.8Gi". Also seems to be fine running kubeReserved.memory="1Gi", but that value is arbitrary and not tested.

Environment:

AWS Region: us-east-1
Instance Type(s): m5.large
EKS Platform version: eks.3
Kubernetes version: 1.24
AMI Version: ami-0c84934009677b6d5
Kernel: Linux ip-10-11-0-6.ec2.internal 5.4.219-126.411.amzn2.x86_64 #1 SMP Wed Nov 2 17:44:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Release information:

BASE_AMI_ID="ami-0c90e1cacda57dac6"
BUILD_TIME="Sat Nov 12 04:15:25 UTC 2022"
BUILD_KERNEL="5.4.219-126.411.amzn2.x86_64"
ARCH="x86_64"

The text was updated successfully, but these errors were encountered:

dasrirez · 2023-01-09T21:21:43Z

Noting that the problem occurs in both containerd and docker based AMIs.

cartermckinnon · 2023-01-09T21:40:41Z

Thanks for the detailed issue! We definitely need to revise our kubeReserved values, memory being a simple function of $MAX_PODS should probably be revisited.

The GKE values seem fairly conservative to me; reserving ~23% (up from ~7% currently) of the available memory on smaller instance types isn't a change we should make without our own testing.

dasrirez · 2023-01-10T15:05:12Z

Ack, thanks for taking a look at this issue! I think the GKE values can serve as a safe upper bound on the reserve limits, but they can probably be better optimized as mentioned. The fact that the node was able to operate with just 1Gi of memory reserved instead of 1.8Gi seems to be a quick hint for this.

It looks GKE values were used until this patch #419 was merged but that was a long time ago and tested on 1.14 clusters. It might be worth while to revert back to the GKE values as a quick fix until a better model is tested/developed though I say that without a good understanding of the timeframe for the latter fix.

maximethebault · 2023-01-10T22:46:42Z

Also seeing the issue: #1098

Will close my issue in favor of this more detailed issue. It's good to see some attention on this, it's worrisome to see this happen on production workload when you use all AWS-recommended / default settings and when you always thought the EKS + EKS AMI combo would protect you against these kind of situations thanks to the automated configuration of memory reservation. Even more so after reading the K8S docs about system reserved memory, kube reserved memory, soft eviction, hard eviction: you feel confident there would be multiple barriers to breach before getting into that situation. Yet here we are.

Thanks for looking into this @cartermckinnon!

For what it's worth, here are the settings we ended up using to protect against this (Karpenter allows us to easily add these values when launching nodes):
systemReserved: 300Mi
evictionSoft: memory.available: 3%
evictionHard: memory.available: 2%

These values are completely empiric and far from having been scientifically determined, nor were they tested on a lot of instance types / architectures / container runtimes / etc, but they did the job for us :)

bryanasdev000 · 2023-01-21T21:54:40Z

Same here, alt ought in my scenario are many small pods, every time that there's an OOM outside cgroup the node goes down and generally takes half an hour to be back.

A good default value will be hard to set without sacrificing the node capacity, but I think a good step may be like @maximethebault with karpenter, allowing we to configure it with an LT, similar to using containerd instead of docker or setting max pods (which by the way, we use at 110 with Cilium CNI).

Related aws/karpenter-provider-aws#1803.

stevehipwell · 2023-01-25T17:46:21Z

CC @bwagner5

stevehipwell · 2023-01-25T17:52:25Z

There are some discussions about this on the following issues.

davidroth · 2023-02-28T09:29:33Z

Experienced the same problem with t3.medium instances.
Can somebody explain why this happens even though there is reserved memory for kubelet? It feels strange that although there is reserved memory for the system processes, a single pod can easily kill the whole node.

stevehipwell · 2023-02-28T12:27:16Z

@davidroth the node logic is tied to the old assumption that you can only run as many pods as there are ENI IPs, which was always slightly incorrect (host port pods) but with the introduction of IP prefixes is now very much incorrect (in comparison to the accepted values everywhere else). I think this issue is most likely to impact smaller instances due to the under provisioning of kube reserved. We've used custom node logic limiting all pods to 110 pods and using the GKE memory calculation and haven't seen an issue since. You will also want to make sure your pods are setting valid requests/limits.

schmee-hg · 2023-03-03T15:31:18Z

Just experienced this as well today in our test environment, also with t3.medium instances. If this is not going to be solved in the near term I think there should be a big warning somewhere in the EKS docs (or even in the console) that t3.medium and other instances affected by this are not suitable for production workloads, cause this could have been a disaster if it was production.

seyal84 · 2023-06-23T05:28:15Z

Already experiencing this in eks v1.23 and m5.8xlarge instance type . When is it actually going to be resolved ? I still feel like we are not hitting the root cause of this issue.

jortkoopmans · 2024-02-14T07:11:41Z

Also seeing this issue on t3.medium on stock settings using AWS Managed Nodegroups (1.24). Without using prefix delegation.
The discussion on how to deal with user defined --max-pods is very important, because this is gaining popularity (also through adoption of Karpenter). There are several tickets on this, but generally we need an alternative to using the fixed max-pod-eni.txt values.

But this issue is more critical as it is happening also with stock settings, t3.medium having 17 pods, using (latest) standard AL2 ami. With some memory pressure, it will flap readiness or even crash.

davidroth · 2024-02-14T07:31:41Z

@jortkoopmans I discovered that it helps to check that all pods have their memory limits configured correctly.
It is important that the configured requested memory is the same as the configured limit memory.

Example:

 resources:
    requests:
      memory: "400Mi"
    limits:
      memory: "400Mi"

In the future, with cgroups v2 and the completion of the Quality of Service for Memory Resources, it will probably be possible to configure memory limits higher than the requested memory, as cgroup v2 memory throttling kicks in.

Until now, I have only been able to achieve stable nodes by setting the requests to limits..

stevehipwell · 2024-02-14T10:21:30Z

@davidroth in a correctly configured Kubernetes cluster the pod resources shouldn't be able to make a node become unstable; that what system and kube reserved should be handling.

At the same time; as memory is (currently) uncompressible it's good practice to use the same value for limits as requests. A likely side effect of this is that all pods have additional overhead reserved which the node can then use to increase on it's default reserved values. This is exactly what we saw while we were still using the default kube reserved values with prefix delegation supporting 110 pods per node; clusters with pods that had resources exactly specified were much less likely to have a node problem than those with unconstrained or highly burstable pods.

Our solution to this was to implement the GKE kube reserved memory calculation as part of our node configuration which completely stopped this issue (FYI AKS also uses this and EKS & AKS use the GKE CPU calculation). The GKE calculation takes a proportion of node resources and I've not seen a node become unstable from this cause since we made this change.

However with Karpenter only supporting a fixed value for kube reserved we've been struggling to find a solution so that we can get Karpenter into production. At the same time as this AKS announced a (still pending for v1.29) move towards a per-pod calculation of kube reserved, but with a much higher per-pod cost that the EKS (20Mb vs 11Mb), which made me re-evaluate the problem.

So back to first principals we get the following statements.

A per-pod kube reserved calculation (+ a fixed value) should be correct
- This is expensive as a node supporting the K8s default 110 pods would need a significant memory reservation
The current EKS calculation isn't correct/safe
- Pods have an overhead greater than 11Mb (based on failure on node using the ENI mode and according to AKS 20Mb)
- The high initial fixed value (255Mb) makes the calculation more robust for nodes with fewer pods
- ENI mode is "more" correct by virtue of having a low pod density so the fixed value covers the gap better
- Prefix mode is very incorrect
The GKE algorithm is robust but potentially wasteful
- As it's not per-pod

My summary of the above is that for a "correct" solution the resources need to be configured per pod, but due to other considerations this is very hard and the general solution has been to use second order effects to build a good enough solution. These second order effects require certain constraints to be in place and once they aren't there is the potential for nodes to become unstable; the examples here for the EKS calculation are small nodes supporting 110 pods with a large number of those pods being deployed as burstable, and large nodes with more than about 25 pods using all of the available node resources.

As a number of things have changed since the above algorithms were created I think that we can solve this problem without needing to rely on second order effects. If we combine a static system & kube reserved configuration for the nodes and then introduce a runtime class with pod overheads we can directly model the system as it is rather than how it could be. This results in nodes which can better make use of their available resources as resources are only reserved when they're needed. The issue here is that there isn't a concept of a default runtime class so that would need to be added with a webhook (but mutating policy via CEL is WIP).

@cartermckinnon could we get some numbers published about the node resource utilisation; ideally with no pods running and then per pod.

zip-chanko · 2024-07-01T04:52:36Z

Experiencing the same. I also notice that there isn't anything set for systemReserved in the config. I am thinking a temporary workaround to increase the evictionHard. Do we know if this is achievable without baking a new AMI?

kubelet-config.json

{
  "kind": "KubeletConfiguration",
  "apiVersion": "kubelet.config.k8s.io/v1beta1",
  "address": "0.0.0.0",
  "authentication": {
    "anonymous": {
      "enabled": false
    },
    "webhook": {
      "cacheTTL": "2m0s",
      "enabled": true
    },
    "x509": {
      "clientCAFile": "/etc/kubernetes/pki/ca.crt"
    }
  },
  "authorization": {
    "mode": "Webhook",
    "webhook": {
      "cacheAuthorizedTTL": "5m0s",
      "cacheUnauthorizedTTL": "30s"
    }
  },
  "clusterDomain": "cluster.local",
  "hairpinMode": "hairpin-veth",
  "readOnlyPort": 0,
  "cgroupDriver": "systemd",
  "cgroupRoot": "/",
  "featureGates": {
    "RotateKubeletServerCertificate": true,
    "KubeletCredentialProviders": true
  },
  "protectKernelDefaults": true,
  "serializeImagePulls": false,
  "serverTLSBootstrap": true,
  "tlsCipherSuites": [
    "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",
    "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
    "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305",
    "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
    "TLS_RSA_WITH_AES_256_GCM_SHA384",
    "TLS_RSA_WITH_AES_128_GCM_SHA256"
  ],
  "registryPullQPS": 20,
  "registryBurst": 40,
  "clusterDNS": [
    "172.20.0.10"
  ],
  "kubeAPIQPS": 10,
  "kubeAPIBurst": 20,
  "evictionHard": {
    "memory.available": "100Mi",
    "nodefs.available": "10%",
    "nodefs.inodesFree": "5%"
  },
  "kubeReserved": {
    "cpu": "110m",
    "ephemeral-storage": "1Gi",
    "memory": "2829Mi"
  },
  "maxPods": 234,
  "providerID": "aws:///ap-southeast-2c/i-123456789",
  "systemReservedCgroup": "/system",
  "kubeReservedCgroup": "/runtime"
}

stevehipwell · 2024-07-04T11:18:55Z

@zip-chanko as the limits aren't being enforced the combined values for kube and system reserved can be treated as a single unit.

tooptoop4 · 2024-09-05T11:01:13Z

dis a doozie

irizzant · 2024-09-26T13:24:00Z

Our solution to this was to implement the GKE kube reserved memory calculation as part of our node configuration which completely stopped this issue (FYI AKS also uses this and EKS & AKS use the GKE CPU calculation). The GKE calculation takes a proportion of node resources and I've not seen a node become unstable from this cause since we made this change.

@stevehipwell I suppose that you used custom user data to do this, correct?

I see that it's possible to pass custom user data in Karpenter, even to Bottlerocket.
One could potentially use a bootstrap container like this:


[settings.bootstrap-containers.bear]
source = "<URI to ECR Repository for this Bootstrap Container>"
mode = "once"
user-data = "IyEvdXNyL2Jpbi9lbnYgc2gKc2V0IC1ldW8gcGlwZWZhaWwKCiMgQ3JlYXRlIHRoZSBkaXJlY3RvcnkKbWtkaXIgLXAgL3Zhci9saWIvbXlfZGlyZWN0b3J5CgojIFNldCBBUEkgY2xpZW50IGNvbmZpZ3VyYXRpb25zCmFwaWNsaWVudCBzZXQgLS1qc29uICd7InNldHRpbmdzIjogeyJvY2ktZGVmYXVsdHMiOiB7InJlc291cmNlLWxpbWl0cyI6IHsibWF4LW9wZW4tZmlsZXMiOiB7InNvZnQtbGltaXQiOiA0Mjk0OTY3Mjk2LCAiaGFyZC1saW1pdCI6IDg1ODk5MzQ1OTJ9fX19fScKCiMgTG9hZCBrZXJuZWwgbW9kdWxlCmlmIGNvbW1hbmQgLXYgbW9kcHJvYmUgPiAvZGV2L251bGwgMj4mMTsgdGhlbgogIG1vZHByb2JlIGR1bW15CmZpCgplY2hvICJVc2VyLWRhdGEgc2NyaXB0IGV4ZWN1dGVkLiIK"

So is it a good approach setting hardcoded values for system and kube reserved resources ?
Or use a Bottlerocket bootstrap container is the way to go?

stevehipwell · 2024-10-04T09:29:22Z

@irizzant sorry for the slow reply, I've been on annual leave. Initially we used custom user date to implement the GKE algorithm for AL2 nodes (pre Bottlerocket) managed by ASGs. When Bottlerocket was released we pivoted to implementing this calculation in Terraform and passing in the kubelet args via user data; this is the pattern we currently use in production. To support Karpenter we pivoted again and are currently in the process of final testing of a fixed per pod overhead algorithm; as all of our nodes currently support a max of 110 pods this results in a fixed reserved value for all nodes.

RE Bottlerocket I think 1.22.0 introduced bootstrap commands which should allow dynamic reserved calculations without requiring a custom OCI image; but there doesn't appear to be any documentation for this yet.

dims · 2024-10-04T09:57:38Z

@stevehipwell this one? https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/bootstrap-commands/README.md

stevehipwell · 2024-10-04T10:54:52Z

Thanks @dims that's the one. I expected this to be in the main Bottlerocket docs (so I opened bottlerocket-os/bottlerocket#4232), but I didn't think to check back here. Should it also either be in the primary Bottlerocket docs or at least referenced there?

irizzant · 2024-10-04T11:16:08Z

Wow thanks @stevehipwell @dims !
By using bootstrap commands it looks like you can send Bottlerocket API commands with apiclient, like
commands = [[ "apiclient", "set", "motd=helloworld"]]

So you can actually set the values for kube reserved, but I don't understand how this can be set dynamically based on the instance type in Karpenter NodeClasses.

dims · 2024-10-04T11:31:04Z

@stevehipwell ack, let's let the bottlerocket team review the issue when they get a chance!

maximethebault mentioned this issue Jan 10, 2023

Memory thrashing / nodes go Unready #1098

Closed

bryanasdev000 mentioned this issue Jan 22, 2023

Revisit kube-reserved calculation for containerd #1141

Open

maximethebault mentioned this issue Mar 26, 2023

kubelet - PLEG is not healthy - flapping between ready and notReady #1228

Closed

maximethebault mentioned this issue May 4, 2023

Node Repair kubernetes-sigs/karpenter#750

Open

cartermckinnon mentioned this issue May 10, 2024

kubelet doesn't restart if crashed #1792

Closed

wwvela mentioned this issue Jul 9, 2024

nice and ionice kubelet and containerd #1870

Open

irizzant mentioned this issue Oct 4, 2024

Decouple MaxPods from CNI Logic bottlerocket-os/bottlerocket#1721

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS nodes lose readiness when containers exhaust memory #1145

EKS nodes lose readiness when containers exhaust memory #1145

dasrirez commented Jan 9, 2023

dasrirez commented Jan 9, 2023

cartermckinnon commented Jan 9, 2023 •

edited

Loading

dasrirez commented Jan 10, 2023

maximethebault commented Jan 10, 2023 •

edited

Loading

bryanasdev000 commented Jan 21, 2023

stevehipwell commented Jan 25, 2023

stevehipwell commented Jan 25, 2023

davidroth commented Feb 28, 2023 •

edited

Loading

stevehipwell commented Feb 28, 2023

schmee-hg commented Mar 3, 2023 •

edited

Loading

seyal84 commented Jun 23, 2023

jortkoopmans commented Feb 14, 2024

davidroth commented Feb 14, 2024 •

edited

Loading

stevehipwell commented Feb 14, 2024

zip-chanko commented Jul 1, 2024

stevehipwell commented Jul 4, 2024

tooptoop4 commented Sep 5, 2024

irizzant commented Sep 26, 2024 •

edited

Loading

stevehipwell commented Oct 4, 2024

dims commented Oct 4, 2024

stevehipwell commented Oct 4, 2024

irizzant commented Oct 4, 2024

dims commented Oct 4, 2024

EKS nodes lose readiness when containers exhaust memory #1145

EKS nodes lose readiness when containers exhaust memory #1145

Comments

dasrirez commented Jan 9, 2023

dasrirez commented Jan 9, 2023

cartermckinnon commented Jan 9, 2023 • edited Loading

dasrirez commented Jan 10, 2023

maximethebault commented Jan 10, 2023 • edited Loading

bryanasdev000 commented Jan 21, 2023

stevehipwell commented Jan 25, 2023

stevehipwell commented Jan 25, 2023

davidroth commented Feb 28, 2023 • edited Loading

stevehipwell commented Feb 28, 2023

schmee-hg commented Mar 3, 2023 • edited Loading

seyal84 commented Jun 23, 2023

jortkoopmans commented Feb 14, 2024

davidroth commented Feb 14, 2024 • edited Loading

stevehipwell commented Feb 14, 2024

zip-chanko commented Jul 1, 2024

stevehipwell commented Jul 4, 2024

tooptoop4 commented Sep 5, 2024

irizzant commented Sep 26, 2024 • edited Loading

stevehipwell commented Oct 4, 2024

dims commented Oct 4, 2024

stevehipwell commented Oct 4, 2024

irizzant commented Oct 4, 2024

dims commented Oct 4, 2024

cartermckinnon commented Jan 9, 2023 •

edited

Loading

maximethebault commented Jan 10, 2023 •

edited

Loading

davidroth commented Feb 28, 2023 •

edited

Loading

schmee-hg commented Mar 3, 2023 •

edited

Loading

davidroth commented Feb 14, 2024 •

edited

Loading

irizzant commented Sep 26, 2024 •

edited

Loading