aws-node is restarting (Crashing, exiting on 137) sporadically which causes all pods on that node to stuck on ContainerCreating state. #1425

daganida88 · 2021-04-13T12:37:17Z

What happened:
aws-node is restarting (Crashing, exiting on 137) sporadically which causes all pods on that node to stuck on ContainerCreating state.

Attach logs
aws-node {"level":"info","ts":"2021-04-13T12:24:31.015Z","caller":"entrypoint.sh","msg":"Install CNI binary.."} │
│ aws-node {"level":"info","ts":"2021-04-13T12:24:31.031Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "} │
│ aws-node {"level":"info","ts":"2021-04-13T12:24:31.032Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
3:31
aws-vpc-cni-init + PLUGIN_BINS='loopback portmap bandwidth aws-cni-support.sh' │
│ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │
│ aws-vpc-cni-init + '[' '!' -f loopback ']' │
│ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │
│ aws-vpc-cni-init + '[' '!' -f portmap ']' │
│ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │
│ aws-vpc-cni-init + '[' '!' -f bandwidth ']' │
│ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │
│ aws-vpc-cni-init + '[' '!' -f aws-cni-support.sh ']' │
│ aws-vpc-cni-init Copying CNI plugin binaries ... │
│ aws-vpc-cni-init + HOST_CNI_BIN_PATH=/host/opt/cni/bin │
│ aws-vpc-cni-init + echo 'Copying CNI plugin binaries ... ' │
│ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │
│ aws-vpc-cni-init + install loopback /host/opt/cni/bin │
│ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │
│ aws-vpc-cni-init + install portmap /host/opt/cni/bin │
│ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │
│ aws-vpc-cni-init + install bandwidth /host/opt/cni/bin │
│ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │
│ aws-vpc-cni-init + install aws-cni-support.sh /host/opt/cni/bin │
│ aws-vpc-cni-init + echo 'Configure rp_filter loose... ' │
│ aws-vpc-cni-init Configure rp_filter loose... │
│ aws-vpc-cni-init ++ curl -X PUT http://169.254.169.254/latest/api/token -H 'X-aws-ec2-metadata-token-ttl-seconds: 60' │
│ aws-vpc-cni-init % Total % Received % Xferd Average Speed Time Time Time Current │
│ aws-vpc-cni-init Dload Upload Total Spent Left Speed │
│ aws-vpc-cni-init 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 56 100 56 0 0 56000 0 --:--:-- --:--:-- --:--:-- 56000 │
│ aws-vpc-cni-init + TOKEN=AQAEABqkV8qO_waLfVT_6TITDHDBvmAy3jkblGe9YXSpR-irRxvwJQ== │
│ aws-vpc-cni-init ++ curl -H 'X-aws-ec2-metadata-token: AQAEABqkV8qO_waLfVT_6TITDHDBvmAy3jkblGe9YXSpR-irRxvwJQ==' http://169.254.169.254/latest/meta-data/local-ipv4 │
│ aws-vpc-cni-init % Total % Received % Xferd Average Speed Time Time Time Current │
│ aws-vpc-cni-init Dload Upload Total Spent Left Speed │
│ aws-vpc-cni-init 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 13 100 13 0 0 13000 0 --:--:-- --:--:-- --:--:-- 13000 │
│ aws-vpc-cni-init + HOST_IP=172.19.224.26 │
│ aws-vpc-cni-init ++ ip -4 -o a │
│ aws-vpc-cni-init ++ awk '{print $2}' │
│ aws-vpc-cni-init ++ grep 172.19.224.26/ │
│ aws-vpc-cni-init + PRIMARY_IF=eth0 │
│ aws-vpc-cni-init + sysctl -w net.ipv4.conf.eth0.rp_filter=2 │
│ aws-vpc-cni-init net.ipv4.conf.eth0.rp_filter = 2 │
│ aws-vpc-cni-init + cat /proc/sys/net/ipv4/conf/eth0/rp_filter │
│ aws-vpc-cni-init 2 │
│ aws-vpc-cni-init + '[' false == true ']' │
│ aws-vpc-cni-init + sysctl -e -w net.ipv4.tcp_early_demux=1 │
│ aws-vpc-cni-init net.ipv4.tcp_early_demux = 1 │
│ aws-vpc-cni-init CNI init container done │
│ aws-vpc-cni-init + echo 'CNI init container done'

What you expected to happen:
I expected the pod not to crush

How to reproduce it (as minimally and precisely as possible):
Sometimes it happens sometimes it's not. but when it happens, until we don't kill the pod, the node will stay on zombie state where all pods stuck on ContainerCreating
Anything else we need to know?:

Running on EKS 1.16.15
Linux
Cni version: 1.7.5

jayanthvn · 2021-04-13T18:02:18Z

Hi @daganida88

Exit code 137 is OOM (out of memory), so does it happen on startup or after several hours of aws-node running. Can you also please check if any other process is consuming high memory, since memory allotted to the Docker containers is limited by the total available memory in the host machine and when usage increases, the available free memory may be insufficient for all the containers and hence containers may crash.

jayanthvn · 2021-04-20T18:18:55Z

Hi @daganida88

Can you please confirm if any other process is consuming high memory?

jayanthvn · 2021-05-03T22:09:51Z

Hi @daganida88

If you could share kubelet logs when the issue happens, it would help us debug.

Thanks!

trucnguyenlam · 2021-06-15T07:55:57Z

Please address this issue, we're experiencing the same error (it is sporadical) and there is no way we can reproduce that.

jayanthvn · 2021-06-15T16:44:47Z

I understand this issue happens sporadically but if you could share the information on CNI version, node and pod scale, if there is any churn with pod scale then we can try to repro the issue. Also next time when you see the issue if you could run sudo bash /opt/cni/bin/aws-cni-support.sh on the node then we can review the kubelet logs and memory consumption on the node. Also if you have any tool to monitor aws-node pod's memory consumption and verify if it not growing over time then that would rule out aws-node issue.

Thanks!

trucnguyenlam · 2021-06-15T21:32:50Z

@jayanthvn cni version is 1.7.5, nodes is controlled by autoscaler as recommended by eks documentation (node type is m5.large)
Unfortunately, our nodes don't have ssm installed by default so we can't log in to run the command.

I'll have a look on monitoring aws-node memory consumption to check if that is the case.

Cheers!

brettplarson · 2021-06-15T23:42:33Z

I spoke to our AWS support rep regarding this issue and he seemed to believe that this was expected and due to the CNI container starting before the kube-proxy container became healthy. Just my 2 cents.

…

On Tue, Jun 15, 2021 at 4:33 PM Truc Nguyen Lam ***@***.***> wrote: @jayanthvn <https://github.com/jayanthvn> cni version is 1.7.5, nodes is controlled by autoscaler as recommended by eks documentation (node type is m5.large) Unfortunately, our nodes don't have ssm installed by default so we can't log in to run the command. I'll have a look on monitoring aws-node memory consumption to check if that is the case. Cheers! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1425 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIB7WSY44YL6VSZHBU26NV3TS7BJBANCNFSM423IKGDA> .

trucnguyenlam · 2021-06-16T07:11:29Z

I'm still searching for an effective automatic approach to handle this issue, currently the only way to fix it is to manually terminate the node that is stuck.

Legion2 · 2021-07-22T11:38:08Z

We upgraded our cluster from 1.19 to 1.20 and upgraded the unmanaged vpc-cni addon to vpc-cni 1.8.0 managed EKS Addon. Now we also experience this Issue. Somethime the containers also gets killed by its failing health and readiness probes.

  Normal   Killing    85m                    kubelet  Container aws-node failed liveness probe, will be restarted
  Normal   Pulling    85m (x2 over 106m)     kubelet  Pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.8.0-eksbuild.1"
  Normal   Created    85m (x2 over 106m)     kubelet  Created container aws-node
  Normal   Pulled     85m                    kubelet  Successfully pulled image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.8.0-eksbuild.1" in 158.21763ms
  Normal   Started    85m (x2 over 106m)     kubelet  Started container aws-node
  Warning  Unhealthy  11m (x31 over 106m)    kubelet  Readiness probe failed:
  Warning  Unhealthy  4m57s (x28 over 100m)  kubelet  Liveness probe failed:

The node is a m5.large and it hosts 29 (of 29 possible) Pods. At the time of the last crash the node had a memory usage of 63.3% (332 MiB free memory). The pod has a constant memory usage of around 46.3 MiB as shown in the graph below:

The restart at 10 AM was triggered by a failed health probe and the restart at 11 AM was triggered by 137 exit code.

In our cluster we use resource requests and limits and the aws-node pods do not define a memory request. So they will be killed first if there is not enough memory on the node.

Note: The node was initially started with a different cni-plugin version, because of the Kubernetes version upgrade

Legion2 · 2021-07-23T14:32:30Z

The strange thing is, that the health probes fail without a reason message after ... probe failed:

mickael-ange · 2021-07-26T13:46:52Z

For a month, after upgrading 4 of our EKS clusters from 1.17 to 1.20 and moving to aws-managed EKS node groups, we see appearing Readiness and Liveness failures for aws-node DaemonSet pods after we deploy/update our application deployments running in another namespace.

The error does not occur at each deployment update, only sporadically.

In the new cluster, I use the default addon, I have not upgraded them.

kubectl -n kube-system describe pod aws-node-cc26m

Normal   Scheduled  49m                default-scheduler                                         Successfully assigned kube-system/aws-node-cc26m to ip-172-26-9-103.ap-southeast-1.compute.internal
  Normal   Pulling    49m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Pulling image "602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni-init:v1.7.5-eksbuild.1"
  Normal   Pulled     49m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Successfully pulled image "602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni-init:v1.7.5-eksbuild.1" in 6.284508527s
  Normal   Created    49m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Created container aws-vpc-cni-init
  Normal   Started    49m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Started container aws-vpc-cni-init
  Normal   Pulling    49m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Pulling image "602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1"
  Normal   Pulled     48m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Successfully pulled image "602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1" in 2.246459067s
  Normal   Created    48m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Created container aws-node
  Normal   Started    48m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Started container aws-node
  Warning  Unhealthy  17m (x2 over 17m)  kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Readiness probe failed:
  Warning  Unhealthy  7m52s              kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Liveness probe failed:

I also think it is strange to not have the reason message after Liveness probe failed: and Readiness probe failed:

Regards

Legion2 · 2021-07-26T14:47:27Z

I changed the timeoutSeconds of the aws-node livenessProbe to 5s. Since then (2 days) the pods did not restarted but the probes failed sometimes.

I also found that, all previous container restarts can be related to very high CPU utilization of the node (100%) over 30 seconds duration or longer. This is enough time for three health probe timeouts which trigger a container restart.

I think because the pod only has a cpu resource request of 10m it does not get enough cpu time to respond to the health probe in under 1 second. To guarantee such low responds times the pod requires more guaranteed cpu time. However, such a low response time is not required all the time and restarting the pod does not resolve slow response times when the node has very high cpu utilization. So increasing the health probe timeout is the only solution for this problem.

sarbajitdutta · 2021-08-17T12:46:46Z

@jayanthvn We have the same issue. We updated from 1.19 to 1.20 and also using CNI addons v1.8.0. aws-node keeps on restarting when cluster-autoscaler scaled up. I have tried to increase timeout for liveness probe but does not help.

jayanthvn · 2021-08-17T15:54:32Z

Sorry for the delayed response.

@Legion2 - Thanks for letting us know, I will look into it.

@sarbajitdutta - You mean on the new nodes the aws-node keeps restarting? Can you please share me the log lines from /var/log/aws-routed-eni/ipamd.log before a restart? Each restart will start with log line - "Starting L-..."

mikkotimoharju · 2021-08-19T10:25:07Z

We are also encountering these random livenessProbe failures after upgrading to EKS 1.21. Increasing the timeout helps with the restarts. The reason why this has started to happen now is a bug fix in kubernetes 1.20: Fixing Kubelet Exec Probe Timeouts and KEP-1972: kubelet exec probe timeouts. Previously the liveness and readiness probes did not respect the timeout for exec probes and the probe ran indefinitely. Now the default timeout of 1s is respected, but is sometimes too short causing it to fail and pods to restart.

jayanthvn · 2021-08-19T17:13:15Z

Thanks @mikkotimoharju for sharing this. We will look into it on adjusting the timeouts.

laupow · 2021-08-25T14:30:18Z

Another version of this story:

EKS 1.19 (nodes still at 1.17)
CNI is self-managed for now
running v1.8.0-eksbuild.1

When aws-cluster-autoscaler triggers an scale up that results in 10-12 new nodes, we see the CNI plugin crash (exit 137) about the same time.

CNI restarts (a 1d rate graph)

CNI pods are crashing at the beginning of the scale event (not at scale down)

HPA scale (showing pods)

These scale events resulted in 10-12 new nodes coming online together.

Containers:
  aws-node:
    Container ID:   docker://5f77d4ef41eb097df676032c48e76745662a0c679c9018b34cf75d9ed67ee446
    Image:          602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.8.0-eksbuild.1
    Image ID:       docker-pullable://602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni@sha256:bbc7bd93b5c71ef6525d6d91883442a89e3b1d0e82781750f2cecc8bddc9e1a1
    Port:           61678/TCP
    Host Port:      61678/TCP
    State:          Running
      Started:      Tue, 24 Aug 2021 13:00:55 -0500
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 24 Aug 2021 12:59:18 -0500
      Finished:     Tue, 24 Aug 2021 13:00:54 -0500
    Ready:          True
    Restart Count:  1
    Requests:
      cpu:      10m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=60s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=1s timeout=1s period=10s #success=1 #failure=3

We aren't seeing stuck pods in ContainerCreating, though.

jayanthvn · 2021-08-31T18:07:49Z

Hi,

There are 2 issues mentioned here -

With EKS 1.21, the timeout seems to be honored because of the bug fix in 1.20 as mentioned here - aws-node is restarting (Crashing, exiting on 137) sporadically which causes all pods on that node to stuck on ContainerCreating state. #1425 (comment). This can be configured based on the load on the system. We can document this similar to Ref: Configure Probes [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/] /cc @haouc
During auto scaling event aws-node has crashed with error code 137 which is OOM, if you can please open a support case and share us the cni/kubelet logs it would be helpful to confirm the behavior. /cc @sarbajitdutta and @laupow

Thanks.

achevuru · 2021-08-31T19:13:18Z

In an ideal scenario, default timeoutSeconds value (1) should be good enough as long as the aws-node pod is not starved of CPU resources. So, even with exec probe timeout being enforced from 1.20+ we should be Ok and these values can be adjusted based on the use case (i.e.,) if you expect higher cpu load on the system, CPU resource req and timeout values can be adjusted accordingly. We can document this.

laupow · 2021-09-01T19:57:54Z

@jayanthvn thanks, case opened (8817394681)

From logs looks like the container can't communicate with the cluster at boot, and therefore fail healthchecks:

{"level":"info","ts":"2021-09-01T18:06:51.513Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.8.0-eksbuild.1  ..."}
{"level":"info","ts":"2021-09-01T18:06:51.514Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2021-09-01T18:08:24.001Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.8.0-eksbuild.1  ..."}
{"level":"info","ts":"2021-09-01T18:08:24.002Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2021-09-01T18:08:24.010Z","caller":"aws-k8s-agent/main.go:42","msg":"Successful communication with the Cluster! Cluster Version is: v1.19+. git version: v1.19.13-eks-8df270. git tree state: clean. commit: 8df2700a72a2598fa3a67c05126fa158fd839620. platform: linux/amd64"}
{"level":"warn","ts":"2021-09-01T18:08:27.030Z","caller":"awssession/session.go:64","msg":"HTTP_TIMEOUT env is not set or set to less than 10 seconds, defaulting to httpTimeout to 10sec"}
{"level":"debug","ts":"2021-09-01T18:08:27.031Z","caller":"ipamd/ipamd.go:298","msg":"Discovered region: us-east-1"}
{"level":"info","ts":"2021-09-01T18:08:27.031Z","caller":"ipamd/ipamd.go:298","msg":"Custom networking false"}
{"level":"debug","ts":"2021-09-01T18:08:27.032Z","caller":"awsutils/awsutils.go:357","msg":"Found availability zone: us-east-1f "}

jayanthvn · 2021-09-01T20:24:54Z

Thanks @laupow

Can you please check kube-proxy pod logs and confirm if you are seeing logs similar to below -

E0905 23:12:40.388938       7 node.go:124] Failed to retrieve node info: nodes "ip-*" not found
E0905 23:12:41.516857       7 node.go:124] Failed to retrieve node info: nodes "ip-*" not found
E0905 23:12:43.567271       7 node.go:124] Failed to retrieve node info: nodes "ip-*" not found
E0905 23:12:48.167166       7 node.go:124] Failed to retrieve node info: nodes "ip-*" not found
E0905 23:12:56.325941       7 node.go:124] Failed to retrieve node info: nodes "ip-*" not found
E0905 23:13:14.684106       7 node.go:124] Failed to retrieve node info: nodes "ip-*" not found

laupow · 2021-09-01T20:59:30Z

Can you please check kube-proxy pod logs and confirm if you are seeing logs similar to below -

Unfortunately I don't see those logs on any nodes with crashed CNI pods. I only see Successfully retrieved node IP. I attached full logs to the ticket.

deva1987m · 2021-09-02T13:24:31Z

Hi,
Do we see similar issue in AWS EKS 1.21 also ?

vioan · 2021-09-02T15:08:26Z

Any update on this issue? We see the same problem. Will updating to 1.21 fix it?

sstoyanovucsd · 2021-09-03T18:54:52Z

@achevuru, the default liveness probe timeout of 1s is not enough on any cluster that runs burstable workloads allowed to scale up beyond the resources of a node. Maybe the default timeout value should be adjusted? It is also possible that bumping up resources.requests.cpu to more than 10m cores will be sufficient to allow the probe not to be starved of CPU resources and to always finish within the default 1 second.

jayanthvn · 2021-09-03T19:27:29Z

@laupow - sure we will check in the ticket if kube-proxy is taking time to start leading to aws-node to restart.

@deva1987m and @vioan - Sorry this thread is deviated into 2 issue as mentioned above. Are you referring to OOM killed (137) of aws-node pods? If so can you please check CPU utilization on the instance. Also I feel we need to set cpu and Memory limits for aws-node.

sarbajitdutta · 2021-09-07T16:47:37Z

@jayanthvn When cluster-autoscaler scales up Readiness probe fails for aws-node. Its stuck here -

I0907 16:40:46.644131      10 request.go:621] Throttling request took 1.045715446s, request: GET:https://172.20.0.1:443/apis/apiextensions.k8s.io/v1beta1?timeout=32s

The successful ones have this log messages -

{"level":"info","ts":"2021-09-07T16:40:49.507Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2021-09-07T16:40:49.511Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2021-09-07T16:40:49.512Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

Error message - Readiness probe failed: {"level":"info","ts":"2021-09-07T16:46:36.839Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
This causes the node to be in NotReady status forever. I am using latest version of cni and k8 version is 1.20

AlKapkone · 2021-09-08T10:53:54Z

Hello, any updates on this issue?
Please, fix it ASAP! Thanks!

crechandan · 2022-03-30T12:16:06Z

@crechandan for us upgrading cni worked like a charm. aws recommended cni version for k8s v1.20 is 1.10.1 (see table here). we do not use addon so we basically apply yaml from the official cni repo: current release aws-k8s-cni.yaml

@dkrzyszczyk
Thanks for response.

Yes, when we are using aws-cni deployment via link >> https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html via kubectl apply command. This is working for me as well as correct version for aws cni is getting picked.

But issue is only when we are trying to deploy aws-cni eks-addon using CloudFormation >> eks addon feature:

EksClusterAddon:
Type: AWS::EKS::Addon
Properties:
AddonName: !Ref EksAddonName
ClusterName: !Ref EksClusterName
ResolveConflicts: OVERWRITE
ServiceAccountRoleArn: !Ref EksServiceRoleArn

Installing aws-cni via cloudformation seems picking old versions (which seems not compatible).

We are just trying to simplify our EKS addon deployment by using this CloudFormation EKS add on feature. When I am deploying kube-proxy & core-dns addon via Cloudformation : its working perfectly fine. Any idea, when this issue regarding aws-cni shall be resolved ?

jayanthvn · 2022-03-30T14:57:07Z

@crechandan - The issues you are seeing is when you are using EKS Managed Add-on the CNI version installed is 1.7.5 and with 1.7.5 is aws-node never starting or randomly you are seeing liveness/readiness probes failures?

I see you have the same question here - #1930 (comment)

For installing 1.10.1 EKS Managed Add-on did you try giving the AddonVersion? If the Add-on version is not mentioned then default 1.7.5 add-on will be installed.

{
  "Type" : "AWS::EKS::Addon",
  "Properties" : {
      "AddonName" : String,
      "AddonVersion" : String,
      "ClusterName" : String,
      "ResolveConflicts" : String,
      "ServiceAccountRoleArn" : String,
      "Tags" : [ Tag, ... ]
    }
}

christophermetz · 2022-03-30T15:17:50Z

We are facing the same issue with EKS 1.21 and vpc-cni 1.10.1.eksbuild and 1.10.2.eksbuild. The issue appears on nodes with high CPU and/or Memory utilization.

We solved it as follows:

Uninstall the vpc-cni addon managed by AWS as it controlls the aws-node daemonset. So all patches made to the aws-node daemonset will be reverted. Deletion can be done using the AWS console. Make sure to check option "Preserve on cluster" to keep the pods running.
Now add apply of aws-k8s-cni.yaml to your automation (Terraform in our case). Increase livenessProbe as per your needs and make QoS Guaranteed by setting limits and requests equally.

crechandan · 2022-03-31T05:07:33Z

@crechandan - The issues you are seeing is when you are using EKS Managed Add-on the CNI version installed is 1.7.5 and with 1.7.5 is aws-node never starting or randomly you are seeing liveness/readiness probes failures?

I see you have the same question here - #1930 (comment)

For installing 1.10.1 EKS Managed Add-on did you try giving the AddonVersion? If the Add-on version is not mentioned then default 1.7.5 add-on will be installed.
{
  "Type" : "AWS::EKS::Addon",
  "Properties" : {
      "AddonName" : String,
      "AddonVersion" : String,
      "ClusterName" : String,
      "ResolveConflicts" : String,
      "ServiceAccountRoleArn" : String,
      "Tags" : [ Tag, ... ]
    }
}

Yes, I can see aws cni is failing with livliness/readiness probes failure and never starting.

I tried below Cloud Formation with version this time for aws cni eks addon:

EksClusterAddon:
Type: AWS::EKS::Addon
Properties:
AddonName: vpc-cni
AddonVersion: v1.10.1-eksbuild.1
ClusterName: test-cluster
ResolveConflicts: OVERWRITE
ServiceAccountRoleArn: *********

Still it is not working. Logs for aws cni pod mentined below:

kubectl logs aws-node-ppft6 -n kube-system
{"level":"info","ts":"2022-03-31T05:04:05.116Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-03-31T05:04:05.122Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-03-31T05:04:05.160Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-03-31T05:04:05.169Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-03-31T05:04:07.263Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
I0331 05:04:07.373968      11 request.go:621] Throttling request took 1.046796523s, request: GET:https://172.20.0.1:443/apis/extensions/v1beta1?timeout=32s
{"level":"info","ts":"2022-03-31T05:04:09.279Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:11.297Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:13.313Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:15.329Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:17.342Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:19.355Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:21.367Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:23.416Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:25.446Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:27.466Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:29.480Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:31.507Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:33.539Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-31T05:04:35.567Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

All pods on EKS cluster is waiting for vpc cni to be up and & running (all are struck in pending state)

Kube Proxy Log:

kubectl logs kube-proxy-j5g4r  -n kube-system
I0331 04:57:06.966039       1 flags.go:59] FLAG: --add-dir-header="false"
I0331 04:57:06.966311       1 flags.go:59] FLAG: --alsologtostderr="false"
I0331 04:57:06.966320       1 flags.go:59] FLAG: --bind-address="0.0.0.0"
I0331 04:57:06.966331       1 flags.go:59] FLAG: --bind-address-hard-fail="false"
I0331 04:57:06.966339       1 flags.go:59] FLAG: --cleanup="false"
I0331 04:57:06.966345       1 flags.go:59] FLAG: --cleanup-ipvs="true"
I0331 04:57:06.966350       1 flags.go:59] FLAG: --cluster-cidr=""
I0331 04:57:06.966359       1 flags.go:59] FLAG: --config="/var/lib/kube-proxy-config/config"
I0331 04:57:06.966365       1 flags.go:59] FLAG: --config-sync-period="15m0s"
I0331 04:57:06.966373       1 flags.go:59] FLAG: --conntrack-max-per-core="32768"
I0331 04:57:06.966385       1 flags.go:59] FLAG: --conntrack-min="131072"
I0331 04:57:06.966391       1 flags.go:59] FLAG: --conntrack-tcp-timeout-close-wait="1h0m0s"
I0331 04:57:06.966397       1 flags.go:59] FLAG: --conntrack-tcp-timeout-established="24h0m0s"
I0331 04:57:06.966404       1 flags.go:59] FLAG: --detect-local-mode=""
I0331 04:57:06.966416       1 flags.go:59] FLAG: --feature-gates=""
I0331 04:57:06.966426       1 flags.go:59] FLAG: --healthz-bind-address="0.0.0.0:10256"
I0331 04:57:06.966434       1 flags.go:59] FLAG: --healthz-port="10256"
I0331 04:57:06.966439       1 flags.go:59] FLAG: --help="false"
I0331 04:57:06.966444       1 flags.go:59] FLAG: --hostname-override=""
I0331 04:57:06.966449       1 flags.go:59] FLAG: --iptables-masquerade-bit="14"
I0331 04:57:06.966454       1 flags.go:59] FLAG: --iptables-min-sync-period="1s"
I0331 04:57:06.966458       1 flags.go:59] FLAG: --iptables-sync-period="30s"
I0331 04:57:06.966463       1 flags.go:59] FLAG: --ipvs-exclude-cidrs="[]"
I0331 04:57:06.966476       1 flags.go:59] FLAG: --ipvs-min-sync-period="0s"
I0331 04:57:06.966481       1 flags.go:59] FLAG: --ipvs-scheduler=""
I0331 04:57:06.966487       1 flags.go:59] FLAG: --ipvs-strict-arp="false"
I0331 04:57:06.966492       1 flags.go:59] FLAG: --ipvs-sync-period="30s"
I0331 04:57:06.966497       1 flags.go:59] FLAG: --ipvs-tcp-timeout="0s"
I0331 04:57:06.966502       1 flags.go:59] FLAG: --ipvs-tcpfin-timeout="0s"
I0331 04:57:06.966506       1 flags.go:59] FLAG: --ipvs-udp-timeout="0s"
I0331 04:57:06.966511       1 flags.go:59] FLAG: --kube-api-burst="10"
I0331 04:57:06.966516       1 flags.go:59] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I0331 04:57:06.966522       1 flags.go:59] FLAG: --kube-api-qps="5"
I0331 04:57:06.966533       1 flags.go:59] FLAG: --kubeconfig=""
I0331 04:57:06.966539       1 flags.go:59] FLAG: --log-backtrace-at=":0"
I0331 04:57:06.966551       1 flags.go:59] FLAG: --log-dir=""
I0331 04:57:06.966557       1 flags.go:59] FLAG: --log-file=""
I0331 04:57:06.966562       1 flags.go:59] FLAG: --log-file-max-size="1800"
I0331 04:57:06.966568       1 flags.go:59] FLAG: --log-flush-frequency="5s"
I0331 04:57:06.966573       1 flags.go:59] FLAG: --logtostderr="true"
I0331 04:57:06.966578       1 flags.go:59] FLAG: --masquerade-all="false"
I0331 04:57:06.966584       1 flags.go:59] FLAG: --master=""
I0331 04:57:06.966588       1 flags.go:59] FLAG: --metrics-bind-address="127.0.0.1:10249"
I0331 04:57:06.966595       1 flags.go:59] FLAG: --metrics-port="10249"
I0331 04:57:06.966601       1 flags.go:59] FLAG: --nodeport-addresses="[]"
I0331 04:57:06.966617       1 flags.go:59] FLAG: --one-output="false"
I0331 04:57:06.966624       1 flags.go:59] FLAG: --oom-score-adj="-999"
I0331 04:57:06.966630       1 flags.go:59] FLAG: --profiling="false"
I0331 04:57:06.966636       1 flags.go:59] FLAG: --proxy-mode=""
I0331 04:57:06.966643       1 flags.go:59] FLAG: --proxy-port-range=""
I0331 04:57:06.966650       1 flags.go:59] FLAG: --show-hidden-metrics-for-version=""
I0331 04:57:06.966656       1 flags.go:59] FLAG: --skip-headers="false"
I0331 04:57:06.966661       1 flags.go:59] FLAG: --skip-log-headers="false"
I0331 04:57:06.966666       1 flags.go:59] FLAG: --stderrthreshold="2"
I0331 04:57:06.966673       1 flags.go:59] FLAG: --udp-timeout="250ms"
I0331 04:57:06.966679       1 flags.go:59] FLAG: --v="2"
I0331 04:57:06.966685       1 flags.go:59] FLAG: --version="false"
I0331 04:57:06.966695       1 flags.go:59] FLAG: --vmodule=""
I0331 04:57:06.966705       1 flags.go:59] FLAG: --write-config-to=""
I0331 04:57:06.969232       1 feature_gate.go:243] feature gates: &{map[]}
I0331 04:57:06.969398       1 feature_gate.go:243] feature gates: &{map[]}
_E0331 04:57:07.080991       1 node.go:161] Failed to retrieve node info: nodes "ip*" not found
E0331 04:57:08.230062       1 node.go:161] Failed to retrieve node info: nodes "ip-*" not found
E0331 04:57:10.610745       1 node.go:161] Failed to retrieve node info: nodes "ip-*" not found
E0331 04:57:15.115012       1 node.go:161] Failed to retrieve node info: nodes "ip-*" not found
E0331 04:57:24.624415       1 node.go:161] Failed to retrieve node info: nodes "ip-*" not found
E0331 04:57:43.468204       1 node.go:161] Failed to retrieve node info: nodes "ip-*" not found_
I0331 04:57:43.468254       1 server_others.go:437] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag
I0331 04:57:43.468284       1 server_others.go:142] kube-proxy node IP is an IPv4 address (127.0.0.1), assume IPv4 operation
I0331 04:57:43.488108       1 server_others.go:182] DetectLocalMode: 'ClusterCIDR'
I0331 04:57:43.488342       1 server_others.go:185] Using iptables Proxier.
W0331 04:57:43.488463       1 server_others.go:461] detect-local-mode set to ClusterCIDR, but no cluster CIDR defined
I0331 04:57:43.488484       1 server_others.go:472] detect-local-mode: ClusterCIDR , defaulting to no-op detect-local
I0331 04:57:43.488666       1 utils.go:321] Changed sysctl "net/ipv4/conf/all/route_localnet": 0 -> 1
I0331 04:57:43.488944       1 proxier.go:287] iptables(IPv4) masquerade mark: 0x00004000
I0331 04:57:43.489110       1 proxier.go:334] iptables(IPv4) sync params: minSyncPeriod=1s, syncPeriod=30s, burstSyncs=2
I0331 04:57:43.489287       1 proxier.go:346] iptables(IPv4) supports --random-fully
I0331 04:57:43.489770       1 server.go:650] Version: v1.20.7-eks-0da02c
I0331 04:57:43.490403       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0331 04:57:43.490527       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0331 04:57:43.491445       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0331 04:57:43.491576       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I0331 04:57:43.491858       1 config.go:315] Starting service config controller
I0331 04:57:43.491889       1 shared_informer.go:240] Waiting for caches to sync for service config
I0331 04:57:43.491923       1 config.go:224] Starting endpoint slice config controller
I0331 04:57:43.491943       1 shared_informer.go:240] Waiting for caches to sync for endpoint slice config
I0331 04:57:43.492178       1 reflector.go:219] Starting reflector *v1.Service (15m0s) from k8s.io/client-go/informers/factory.go:134
I0331 04:57:43.492799       1 reflector.go:219] Starting reflector *v1beta1.EndpointSlice (15m0s) from k8s.io/client-go/informers/factory.go:134
I0331 04:57:43.501265       1 service.go:275] Service kube-system/kube-dns updated: 2 ports
I0331 04:57:43.501536       1 service.go:275] Service kube-system/tiller-deploy updated: 1 ports
I0331 04:57:43.501709       1 service.go:275] Service kube-system/metrics-server updated: 1 ports
I0331 04:57:43.501826       1 service.go:275] Service velero/velero updated: 1 ports
I0331 04:57:43.501910       1 service.go:275] Service kube-system/calico-typha updated: 1 ports
I0331 04:57:43.502003       1 service.go:275] Service default/kubernetes updated: 1 ports
I0331 04:57:43.592109       1 shared_informer.go:247] Caches are synced for service config
I0331 04:57:43.592154       1 shared_informer.go:247] Caches are synced for endpoint slice config
I0331 04:57:43.592463       1 service.go:390] Adding new service port "default/kubernetes:https" at 172.20.0.1:443/TCP
I0331 04:57:43.592499       1 service.go:390] Adding new service port "kube-system/kube-dns:dns" at 172.20.0.10:53/UDP
I0331 04:57:43.592518       1 service.go:390] Adding new service port "kube-system/kube-dns:dns-tcp" at 172.20.0.10:53/TCP
I0331 04:57:43.592539       1 service.go:390] Adding new service port "kube-system/tiller-deploy:tiller" at 172.20.83.48:44134/TCP
I0331 04:57:43.592558       1 service.go:390] Adding new service port "kube-system/metrics-server:https" at 172.20.118.43:443/TCP
I0331 04:57:43.592573       1 service.go:390] Adding new service port "velero/velero:monitoring" at 172.20.80.9:8085/TCP
I0331 04:57:43.592588       1 service.go:390] Adding new service port "kube-system/calico-typha:calico-typha" at 172.20.41.132:5473/TCP
I0331 04:57:43.592677       1 proxier.go:871] Syncing iptables rules
I0331 04:57:43.699700       1 proxier.go:826] syncProxyRules took 107.496815ms
I0331 04:57:43.699992       1 proxier.go:871] Syncing iptables rules
I0331 04:57:43.754491       1 proxier.go:826] syncProxyRules took 54.727153ms

Even when I patched kube-proxy as mentioned in above thread >> error "1 node.go:161] Failed to retrieve node info: nodes "ip*" not found" is not coming but there is no effect on vpc-cni >> stil probes failing.

  Warning  Unhealthy  3m24s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-31T05:31:55.448Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  3m14s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-31T05:32:05.481Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  3m4s                  kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-31T05:32:15.486Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  2m54s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-31T05:32:25.483Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  2m44s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-31T05:32:35.517Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  2m34s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-31T05:32:45.461Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  2m26s                 kubelet            Liveness probe failed: {"level":"info","ts":"2022-03-31T05:32:53.239Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  2m24s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-31T05:32:55.489Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  2m16s                 kubelet            Liveness probe failed: {"level":"info","ts":"2022-03-31T05:33:03.260Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Normal   Killing    2m6s                  kubelet            Container aws-node failed liveness probe, will be restarted
  Normal   Created    116s (x2 over 3m37s)  kubelet            Created container aws-node

rpf3 · 2022-03-31T20:59:21Z

I'm seeing the same issues but I've noticed that it only occurs if I explicitly set the service-account-role-arn flag. If I let the add-on inherit the role from the node as described in the documentation then the pods start without issue. My cluster also has an OIDC provider associated with it as required by the docs for using an existing IAM role.

crechandan · 2022-04-07T10:49:48Z

@rpf3 Thanks a lot - When I used the same cloud formation without mentioning the service-account-role-arn flag : it worked perfectly. Implemented Kube-proxy, Core-dns & aws-cni in same way and all are working with cloudformation EKS addon feature.

github-actions · 2022-06-07T00:12:53Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

jayanthvn · 2022-06-07T00:22:03Z

/not stale

ido-vcita · 2022-06-16T13:44:25Z

Just upgraded to EKS 1.21 and updated the aws-node VPC CNI to 1.11.2 and this started to happen...
What is the status here?

achevuru · 2022-06-16T17:16:24Z

@ido-vcita Please check resource utilization (on the nodes) and if required adjust the timeout values for your liveness/readiness probes. With 1.21 exec probe timeouts are honored - kubernetes/enhancements#1972

sidewinder12s · 2022-06-17T15:17:58Z

I've had all recommended settings applied as well and was still seeing this behavior if the node was running hot with workloads. Bumping/setting CPU/memory requests I think helped a bit more but errors still seem to crop up (we have 1000s of nodes we scale in and out daily).

KIVagant · 2022-06-22T23:44:16Z

Our setup:

EKS 1.22 based on the official AMI
CNI is installed using the Helm chart
A separate role is created with two policies: arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy and arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
ENABLE_IPv6: "false", ENABLE_IPv4: "true" set from the Helm chart values
OIDC is installed and confirmed
kubectl exec -n kube-system aws-node-f4t8s -c aws-node -- env | grep AWS_ROLE_ARN shows the custom Role ARN.
the underlying EC2 instance has an attached Role with the arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy attached to it among the others.

When the custom Role is provided, aws-node is being killed by K8s due to probes timeouts (Container aws-node failed liveness probe, will be restarted). And the logs contain many of these messages: Retrying waiting for IPAM-D.

When the custom Role is disabled, all pods start almost immediately (2 retries in the logs).

The Trust Relationship for that Role looks very equal to the one generated by eksctl:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::YYYYYY-provider/oidc.eks.us-east-1.amazonaws.com/id/XXXXXXXX"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "oidc.eks.us-east-1.amazonaws.com/id/XXXXXXXX:aud": "sts.amazonaws.com",
                    "oidc.eks.us-east-1.amazonaws.com/id/XXXXXXXX:sub": "system:serviceaccount:kube-system:aws-node"
                }
            }
        }
    ]
}

achevuru · 2022-06-23T01:04:06Z

@KIVagant I believe CNI is running in to permission issues - maybe not able to assume the custom role. Can you check the CNI/IPAMD logs? You should see 403s in the logs..

KIVagant · 2022-06-23T02:07:27Z

Thank you, @achevuru . Today I learned how to troubleshoot the CNI. It turned out the Role ARN was incorrectly configured. 🤦

The line has a mistake. Instead of this:

    # To set annotations - serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::<AWS_ACCOUNT_ID>:<IAM_ROLE_NAME>

it must be this (added role/):

    # To set annotations - serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::<AWS_ACCOUNT_ID>:role/<IAM_ROLE_NAME>

github-actions · 2022-09-21T17:04:00Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

zhengyongtao · 2022-10-02T15:41:43Z

I've had the same problem these two weeks, has someone found a solution?

zhengyongtao · 2022-10-04T02:00:50Z

ipamd log:

{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.43.0"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.43.0/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.60.1"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.60.1/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.47.2"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.47.2/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.46.131"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.46.131/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.61.196"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.61.196/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.49.6"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.49.6/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.41.135"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.41.135/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.38.218"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.38.218/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.39.157"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.39.157/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.59.213"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.59.213/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:653","msg":"Reconcile existing ENI eni-00023922abf62516c IP prefixes"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1351","msg":"Found prefix pool count 0 for eni eni-00023922abf62516c\n"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:653","msg":"Successfully Reconciled ENI/IP pool"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1396","msg":"IP pool stats: Total IPs/Prefixes = 87/0, AssignedIPs/CooldownIPs: 31/0, c.maxIPsPerENI = 29"}
command terminated with exit code 137

aws-node log:

# kubectl logs -f aws-node-zdp6x --tail 30 -n kube-system  
{"level":"info","ts":"2022-10-02T14:56:07.820Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-10-02T14:56:07.821Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-10-02T14:56:07.833Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-10-02T14:56:07.834Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-10-02T14:56:09.841Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:11.847Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:13.853Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:15.860Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:17.866Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:19.872Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:21.878Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:23.884Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:25.890Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:27.897Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:29.903Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:31.909Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:33.916Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:35.922Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:37.928Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:39.934Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:41.940Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:43.947Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:45.953Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:47.959Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:49.966Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

Event screenshots:

I used cluster-autoscaler for auto-scaling, k8s version is 1.22, also following the troubleshooting guide https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#known-issues and applying the suggestion
kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.4/config/master/aws-k8s-cni.yaml

Interestingly, this failure usually only occurs on a certain node, and when I terminate the instance of that node and make it automatically expand again, it starts working.

But after running for a while, it will restart again.

github-actions · 2022-12-04T00:04:21Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions · 2022-12-19T00:03:08Z

Issue closed due to inactivity.

daganida88 added the bug label Apr 13, 2021

jayanthvn mentioned this issue Aug 25, 2021

CNI fails healthchecks, errors when we run containers #1590

Closed

bryantbiggs mentioned this issue Apr 2, 2022

Nodes fail to join cluster and CNI plugin not being added terraform-aws-modules/terraform-aws-eks#1972

Closed

universam1 mentioned this issue Apr 21, 2022

[aws-vpc-cni] cannot be configured for QoS class: guaranteed because of missing resource section for init container aws/eks-charts#738

Closed

alt-dima mentioned this issue May 25, 2022

Readiness probe failures #1976

Closed

github-actions bot added the stale Issue or PR is stale label Jun 7, 2022

github-actions bot removed the stale Issue or PR is stale label Jun 8, 2022

Himangini mentioned this issue Sep 14, 2022

Fix managed suite failures eksctl-io/eksctl#5696

Merged

7 tasks

github-actions bot added the stale Issue or PR is stale label Sep 21, 2022

github-actions bot removed the stale Issue or PR is stale label Oct 3, 2022

github-actions bot added the stale Issue or PR is stale label Dec 4, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 19, 2022

Bugs automation moved this from Needs triage to Closed Dec 19, 2022

aws-node is restarting (Crashing, exiting on 137) sporadically which causes all pods on that node to stuck on ContainerCreating state. #1425

aws-node is restarting (Crashing, exiting on 137) sporadically which causes all pods on that node to stuck on ContainerCreating state. #1425

Comments

daganida88 commented Apr 13, 2021 • edited

jayanthvn commented Apr 13, 2021

jayanthvn commented Apr 20, 2021

jayanthvn commented May 3, 2021

trucnguyenlam commented Jun 15, 2021

jayanthvn commented Jun 15, 2021

trucnguyenlam commented Jun 15, 2021

brettplarson commented Jun 15, 2021 via email

trucnguyenlam commented Jun 16, 2021

Legion2 commented Jul 22, 2021

Legion2 commented Jul 23, 2021

mickael-ange commented Jul 26, 2021 • edited

Legion2 commented Jul 26, 2021

sarbajitdutta commented Aug 17, 2021

jayanthvn commented Aug 17, 2021

mikkotimoharju commented Aug 19, 2021 • edited

jayanthvn commented Aug 19, 2021

laupow commented Aug 25, 2021

CNI restarts (a 1d rate graph)

HPA scale (showing pods)

jayanthvn commented Aug 31, 2021 • edited

achevuru commented Aug 31, 2021

laupow commented Sep 1, 2021 • edited

jayanthvn commented Sep 1, 2021

laupow commented Sep 1, 2021 • edited

deva1987m commented Sep 2, 2021

vioan commented Sep 2, 2021

sstoyanovucsd commented Sep 3, 2021

jayanthvn commented Sep 3, 2021

sarbajitdutta commented Sep 7, 2021 • edited

AlKapkone commented Sep 8, 2021 • edited

crechandan commented Mar 30, 2022 • edited

jayanthvn commented Mar 30, 2022

christophermetz commented Mar 30, 2022

crechandan commented Mar 31, 2022 • edited

rpf3 commented Mar 31, 2022

crechandan commented Apr 7, 2022

github-actions bot commented Jun 7, 2022

jayanthvn commented Jun 7, 2022

ido-vcita commented Jun 16, 2022

achevuru commented Jun 16, 2022

sidewinder12s commented Jun 17, 2022

KIVagant commented Jun 22, 2022 • edited

achevuru commented Jun 23, 2022

KIVagant commented Jun 23, 2022

github-actions bot commented Sep 21, 2022

zhengyongtao commented Oct 2, 2022

zhengyongtao commented Oct 4, 2022

github-actions bot commented Dec 4, 2022

github-actions bot commented Dec 19, 2022

daganida88 commented Apr 13, 2021 •

edited

mickael-ange commented Jul 26, 2021 •

edited

mikkotimoharju commented Aug 19, 2021 •

edited

jayanthvn commented Aug 31, 2021 •

edited

laupow commented Sep 1, 2021 •

edited

laupow commented Sep 1, 2021 •

edited

sarbajitdutta commented Sep 7, 2021 •

edited

AlKapkone commented Sep 8, 2021 •

edited

crechandan commented Mar 30, 2022 •

edited

crechandan commented Mar 31, 2022 •

edited

KIVagant commented Jun 22, 2022 •

edited