-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS nodes lose readiness when containers exhaust memory #1145
Comments
Noting that the problem occurs in both |
Thanks for the detailed issue! We definitely need to revise our The GKE values seem fairly conservative to me; reserving ~23% (up from ~7% currently) of the available memory on smaller instance types isn't a change we should make without our own testing. |
Ack, thanks for taking a look at this issue! I think the GKE values can serve as a safe upper bound on the reserve limits, but they can probably be better optimized as mentioned. The fact that the node was able to operate with just 1Gi of memory reserved instead of 1.8Gi seems to be a quick hint for this. It looks GKE values were used until this patch #419 was merged but that was a long time ago and tested on 1.14 clusters. It might be worth while to revert back to the GKE values as a quick fix until a better model is tested/developed though I say that without a good understanding of the timeframe for the latter fix. |
Also seeing the issue: #1098 Will close my issue in favor of this more detailed issue. It's good to see some attention on this, it's worrisome to see this happen on production workload when you use all AWS-recommended / default settings and when you always thought the EKS + EKS AMI combo would protect you against these kind of situations thanks to the automated configuration of memory reservation. Even more so after reading the K8S docs about system reserved memory, kube reserved memory, soft eviction, hard eviction: you feel confident there would be multiple barriers to breach before getting into that situation. Yet here we are. Thanks for looking into this @cartermckinnon! For what it's worth, here are the settings we ended up using to protect against this (Karpenter allows us to easily add these values when launching nodes): These values are completely empiric and far from having been scientifically determined, nor were they tested on a lot of instance types / architectures / container runtimes / etc, but they did the job for us :) |
Same here, alt ought in my scenario are many small pods, every time that there's an OOM outside cgroup the node goes down and generally takes half an hour to be back. A good default value will be hard to set without sacrificing the node capacity, but I think a good step may be like @maximethebault with karpenter, allowing we to configure it with an LT, similar to using containerd instead of docker or setting max pods (which by the way, we use at 110 with Cilium CNI). Related aws/karpenter-provider-aws#1803. |
CC @bwagner5 |
There are some discussions about this on the following issues. |
Experienced the same problem with t3.medium instances. |
@davidroth the node logic is tied to the old assumption that you can only run as many pods as there are ENI IPs, which was always slightly incorrect (host port pods) but with the introduction of IP prefixes is now very much incorrect (in comparison to the accepted values everywhere else). I think this issue is most likely to impact smaller instances due to the under provisioning of kube reserved. We've used custom node logic limiting all pods to 110 pods and using the GKE memory calculation and haven't seen an issue since. You will also want to make sure your pods are setting valid requests/limits. |
Just experienced this as well today in our test environment, also with t3.medium instances. If this is not going to be solved in the near term I think there should be a big warning somewhere in the EKS docs (or even in the console) that t3.medium and other instances affected by this are not suitable for production workloads, cause this could have been a disaster if it was production. |
Already experiencing this in eks v1.23 and m5.8xlarge instance type . When is it actually going to be resolved ? I still feel like we are not hitting the root cause of this issue. |
Also seeing this issue on t3.medium on stock settings using AWS Managed Nodegroups (1.24). Without using prefix delegation. But this issue is more critical as it is happening also with stock settings, t3.medium having 17 pods, using (latest) standard AL2 ami. With some memory pressure, it will flap readiness or even crash. |
@jortkoopmans I discovered that it helps to check that all pods have their memory limits configured correctly. Example: resources:
requests:
memory: "400Mi"
limits:
memory: "400Mi" In the future, with cgroups v2 and the completion of the Quality of Service for Memory Resources, it will probably be possible to configure memory limits higher than the requested memory, as cgroup v2 memory throttling kicks in. Until now, I have only been able to achieve stable nodes by setting the requests to limits.. |
@davidroth in a correctly configured Kubernetes cluster the pod resources shouldn't be able to make a node become unstable; that what system and kube reserved should be handling. At the same time; as memory is (currently) uncompressible it's good practice to use the same value for limits as requests. A likely side effect of this is that all pods have additional overhead reserved which the node can then use to increase on it's default reserved values. This is exactly what we saw while we were still using the default kube reserved values with prefix delegation supporting 110 pods per node; clusters with pods that had resources exactly specified were much less likely to have a node problem than those with unconstrained or highly burstable pods. Our solution to this was to implement the GKE kube reserved memory calculation as part of our node configuration which completely stopped this issue (FYI AKS also uses this and EKS & AKS use the GKE CPU calculation). The GKE calculation takes a proportion of node resources and I've not seen a node become unstable from this cause since we made this change. However with Karpenter only supporting a fixed value for kube reserved we've been struggling to find a solution so that we can get Karpenter into production. At the same time as this AKS announced a (still pending for v1.29) move towards a per-pod calculation of kube reserved, but with a much higher per-pod cost that the EKS (20Mb vs 11Mb), which made me re-evaluate the problem. So back to first principals we get the following statements.
My summary of the above is that for a "correct" solution the resources need to be configured per pod, but due to other considerations this is very hard and the general solution has been to use second order effects to build a good enough solution. These second order effects require certain constraints to be in place and once they aren't there is the potential for nodes to become unstable; the examples here for the EKS calculation are small nodes supporting 110 pods with a large number of those pods being deployed as burstable, and large nodes with more than about 25 pods using all of the available node resources. As a number of things have changed since the above algorithms were created I think that we can solve this problem without needing to rely on second order effects. If we combine a static system & kube reserved configuration for the nodes and then introduce a runtime class with pod overheads we can directly model the system as it is rather than how it could be. This results in nodes which can better make use of their available resources as resources are only reserved when they're needed. The issue here is that there isn't a concept of a default runtime class so that would need to be added with a webhook (but mutating policy via CEL is WIP). @cartermckinnon could we get some numbers published about the node resource utilisation; ideally with no pods running and then per pod. |
Experiencing the same. I also notice that there isn't anything set for kubelet-config.json{
"kind": "KubeletConfiguration",
"apiVersion": "kubelet.config.k8s.io/v1beta1",
"address": "0.0.0.0",
"authentication": {
"anonymous": {
"enabled": false
},
"webhook": {
"cacheTTL": "2m0s",
"enabled": true
},
"x509": {
"clientCAFile": "/etc/kubernetes/pki/ca.crt"
}
},
"authorization": {
"mode": "Webhook",
"webhook": {
"cacheAuthorizedTTL": "5m0s",
"cacheUnauthorizedTTL": "30s"
}
},
"clusterDomain": "cluster.local",
"hairpinMode": "hairpin-veth",
"readOnlyPort": 0,
"cgroupDriver": "systemd",
"cgroupRoot": "/",
"featureGates": {
"RotateKubeletServerCertificate": true,
"KubeletCredentialProviders": true
},
"protectKernelDefaults": true,
"serializeImagePulls": false,
"serverTLSBootstrap": true,
"tlsCipherSuites": [
"TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",
"TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
"TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305",
"TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
"TLS_RSA_WITH_AES_256_GCM_SHA384",
"TLS_RSA_WITH_AES_128_GCM_SHA256"
],
"registryPullQPS": 20,
"registryBurst": 40,
"clusterDNS": [
"172.20.0.10"
],
"kubeAPIQPS": 10,
"kubeAPIBurst": 20,
"evictionHard": {
"memory.available": "100Mi",
"nodefs.available": "10%",
"nodefs.inodesFree": "5%"
},
"kubeReserved": {
"cpu": "110m",
"ephemeral-storage": "1Gi",
"memory": "2829Mi"
},
"maxPods": 234,
"providerID": "aws:///ap-southeast-2c/i-123456789",
"systemReservedCgroup": "/system",
"kubeReservedCgroup": "/runtime"
} |
@zip-chanko as the limits aren't being enforced the combined values for kube and system reserved can be treated as a single unit. |
dis a doozie |
@stevehipwell I suppose that you used custom user data to do this, correct? I see that it's possible to pass custom user data in Karpenter, even to Bottlerocket.
So is it a good approach setting hardcoded values for system and kube reserved resources ? |
@irizzant sorry for the slow reply, I've been on annual leave. Initially we used custom user date to implement the GKE algorithm for AL2 nodes (pre Bottlerocket) managed by ASGs. When Bottlerocket was released we pivoted to implementing this calculation in Terraform and passing in the kubelet args via user data; this is the pattern we currently use in production. To support Karpenter we pivoted again and are currently in the process of final testing of a fixed per pod overhead algorithm; as all of our nodes currently support a max of 110 pods this results in a fixed reserved value for all nodes. RE Bottlerocket I think 1.22.0 introduced bootstrap commands which should allow dynamic reserved calculations without requiring a custom OCI image; but there doesn't appear to be any documentation for this yet. |
Thanks @dims that's the one. I expected this to be in the main Bottlerocket docs (so I opened bottlerocket-os/bottlerocket#4232), but I didn't think to check back here. Should it also either be in the primary Bottlerocket docs or at least referenced there? |
Wow thanks @stevehipwell @dims ! So you can actually set the values for kube reserved, but I don't understand how this can be set dynamically based on the instance type in Karpenter NodeClasses. |
@stevehipwell ack, let's let the bottlerocket team review the issue when they get a chance! |
What happened:
When our applications consume too much memory, K8s nodes on EKS clusters lose readiness and become completely inoperable for extended periods of time. This means that instead of being rescheduled immediately, pods remain stuck in a pending state, resulting in noticeable downtime. This does not happen on GKE clusters.
What you expected to happen:
Nodes should never lose readiness, instead the containers should be restarted and/or the pods should be
OOMKilled
.How to reproduce it (as minimally and precisely as possible):
m5.large
.Anything else we need to know?:
We've traced the cause of this problem to the memory reservation of kubelet, which is set by the bootstrap script here.
amazon-eks-ami/files/bootstrap.sh
Line 452 in eab112a
This is what
kubeReserved
is set to by default on this cluster.Note the memory reservation, on a GKE cluster this value would be 1.8Gi https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#eviction_threshold.
When running with
kubeReserved.memory="574Mi"
, kubelet logs indicate PLEG errors during memory exhaustion and the node loses readiness.The problem does not occur when the node is bootstrapped with
kubeReserved.memory="1.8Gi"
. Also seems to be fine runningkubeReserved.memory="1Gi"
, but that value is arbitrary and not tested.Environment:
us-east-1
m5.large
eks.3
1.24
ami-0c84934009677b6d5
Linux ip-10-11-0-6.ec2.internal 5.4.219-126.411.amzn2.x86_64 #1 SMP Wed Nov 2 17:44:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: