Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS AMI v20210329 Issues #648

Closed
niroowns opened this issue Apr 6, 2021 · 23 comments
Closed

EKS AMI v20210329 Issues #648

niroowns opened this issue Apr 6, 2021 · 23 comments

Comments

@niroowns
Copy link

niroowns commented Apr 6, 2021

What happened: EC2 instances took greater than 30 minutes to join a cluster. Shortly after an instance is instantiated, the ec2's in question seem to be CPU starved and don't respond to even being able to SSH into an ec2 instance. There appears to be certain things that are occurring before the preBootstrap commands are even invoked.

What you expected to happen: New instances using AMI v20210329 to join a cluster in a reasonable amount of time

How to reproduce it (as minimally and precisely as possible): Creating a "unmanaged" node group using eksctl with the latest AMI (v20210329)

Anything else we need to know?: It appears that due to some new additions, the AMI is now taking approximately 30+ minutes to start with a proper working/configured kubelet. This used to take approx 5 min in the past.

Environment:

  • AWS Region: at minimum us-east-1/us-east-2
  • Instance Type(s): All EKS Supported
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.4
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.18
  • AMI Version: amazon-eks-node-1.18-v20210329
@jangrewe
Copy link

jangrewe commented Apr 8, 2021

Hi, we're seeing a similar behavior with the current (v20210329) AMI versions for 1.18 and 1.19 in eu-west-1.

We were running an 1.18 AMI (v20210310) on our nodes, upgraded the cluster to 1.19 and the AMI to the latest 1.19, which was v20210329, and a couple of minutes after being provisioned the new nodes randomly became NotReady as soon as they got workloads deployed to them, complaining that the PLEG was not healthy.

Reverting back to the latest 1.18 AMI (v20210329) caused the same behaviour.
We had to explicitly go back to the previous 1.18 version (v20210310) to keep this from happening - and this tipped us off that all current (v20210329) AMI builds, regardless of the EKS version, seem to cause this.

We then updated the AMI to 1.19 , but also the previous version (v20210322), and that allowed the nodes to become Ready again.

For reference, we've opened support case #8190279821 on Thursday about this, because there was no GitHub issue about this back then.

TL;DR: Revert to the following AMI versions to get stable nodes again

  • EKS 1.18: amazon-eks-node-1.18-v20210310
  • EKS 1.19: amazon-eks-node-1.19-v20210322

@hikhvar
Copy link

hikhvar commented Apr 8, 2021

We experience this as well. We moved to the 1.19 v20210329 AMI yesterday. Since then some nodes behave normal, after some time the PLEG becomes unhealthy, the nodes a flapping between ready/unready. We can not terminate pods on those nodes, nor spawn new pods. Existing workload does not seem to be effected.

@cablespaghetti
Copy link

cablespaghetti commented Apr 8, 2021

I have also opened a support case and they referred me over here. Rolling back to the 20210310 AMI which we were on previously seems to have resolved the issue.

The only thing I have to add is that I wonder if the kernel patch intended to fix an IO regression has actually made something worse in Docker performance? On clusters where we run a comparatively small number of pods per node things are still fine with the new AMI. It is only when we have in the region of 50+ pods per node we start seeing this behaviour with the new AMI.

@hgranillo
Copy link

hgranillo commented Apr 8, 2021

Hi, I'm also been observing the same behavior with amazon-eks-node-1.19-v20210329 in us-east-1

Containers seem to get stuck in "ContainerCreating" status, and as soon as this happens node starts reporting "NotReady", and this gets logged into /var/log/messages

Apr  8 07:44:28 ip-172-27-16-222 kubelet: {"level":"info","ts":"2021-04-08T07:44:28.615Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.5 ..."}
Apr  8 07:44:29 ip-172-27-16-222 kubelet: E0408 07:44:29.444932    6105 pod_workers.go:191] Error syncing pod 612add33-9269-4fdb-8971-54d974e01275 ("priips-api-management-b5664c45f-z8jjk_uat-priips(612add33-9269-4fdb-8971-54d974e01275)"), skipping: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Apr  8 07:44:33 ip-172-27-16-222 kubelet: {"level":"info","ts":"2021-04-08T07:44:33.627Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.5 ..."}
Apr  8 07:44:35 ip-172-27-16-222 kubelet: I0408 07:44:35.080669    6105 setters.go:77] Using node IP: "172.27.16.222"
Apr  8 07:44:35 ip-172-27-16-222 kubelet: E0408 07:44:35.444942    6105 pod_workers.go:191] Error syncing pod 369b1564-cc39-46d7-9fc6-1fd47757dc6b ("wsdstub-69ccbc5888-2cqfv_uat-priips(369b1564-cc39-46d7-9fc6-1fd47757dc6b)"), skipping: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Apr  8 07:44:38 ip-172-27-16-222 kubelet: {"level":"info","ts":"2021-04-08T07:44:38.639Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.5 ..."}
Apr  8 07:44:42 ip-172-27-16-222 kubelet: E0408 07:44:42.445107    6105 pod_workers.go:191] Error syncing pod 612add33-9269-4fdb-8971-54d974e01275 ("priips-api-management-b5664c45f-z8jjk_uat-priips(612add33-9269-4fdb-8971-54d974e01275)"), skipping: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Apr  8 07:44:43 ip-172-27-16-222 kubelet: {"level":"info","ts":"2021-04-08T07:44:43.651Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.5 ..."}
Apr  8 07:44:45 ip-172-27-16-222 kubelet: I0408 07:44:45.123933    6105 setters.go:77] Using node IP: "172.27.16.222"
Apr  8 07:44:45 ip-172-27-16-222 kubelet: I0408 07:44:45.155364    6105 kubelet_node_status.go:554] Recording NodeNotReady event message for node ip-172-27-16-222.ec2.internal
Apr  8 07:44:45 ip-172-27-16-222 kubelet: I0408 07:44:45.155407    6105 setters.go:555] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2021-04-08 07:44:45.155341572 +0000 UTC m=+175327.489210555 LastTransitionTime:2021-04-08 07:44:45.155341572 +0000 UTC m=+175327.489210555 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m0.092835925s ago; threshold is 3m0s}
Apr  8 07:44:45 ip-172-27-16-222 kubelet: E0408 07:44:45.444951    6105 kubelet.go:1770] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m0.38241197s ago; threshold is 3m0s
Apr  8 07:44:45 ip-172-27-16-222 kubelet: E0408 07:44:45.545164    6105 kubelet.go:1770] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m0.482636014s ago; threshold is 3m0s
Apr  8 07:44:45 ip-172-27-16-222 kubelet: E0408 07:44:45.745391    6105 kubelet.go:1770] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m0.682853111s ago; threshold is 3m0s

Some other thing I noticed, this seems to happen more frequently on nodes that have 30 or more pods.

I mention this because we only observed this issue in our integration environment where we have nodes more "packed" with pods, as opposed to our "production" cluster were pods are more spread and nodes are not packed with pods.

Swiching to v20210322 did the trick and nodes seem to be healthy again.

@hikhvar
Copy link

hikhvar commented Apr 8, 2021

We also encounter this on densly packaged nodes with more than 40 pods.

@rtripat
Copy link
Contributor

rtripat commented Apr 8, 2021

We are investigating this issue. I'm glad that rolling back to previous AMI has mitigated impact in your clusters.

@vishalkg
Copy link
Contributor

vishalkg commented Apr 8, 2021

As per the details mentioned in the issue, we tried to reproduce it but unfortunately, we couldn’t reproduce it. Here are the details:

  • Launched unmanged nodegroups with a new eks cluster and the AMI in question i.e. the one released on 20210329 using eksctl. We didn’t see any noticeable difference in node joining the cluster. All the nodes joined within 5 minutes
~ $ eksctl version
0.44.0
~ $ eksctl create cluster prod-118-20210407
...
2021-04-07 18:29:24 [ℹ]  deploying stack "eksctl-prod-118-20210407-nodegroup-ng-a8440f77"
2021-04-07 18:29:24 [ℹ]  waiting for CloudFormation stack "eksctl-prod-118-20210407-nodegroup-ng-a8440f77"
2021-04-07 18:29:57 [ℹ]  waiting for CloudFormation stack "eksctl-prod-118-20210407-nodegroup-ng-a8440f77"
2021-04-07 18:30:34 [ℹ]  waiting for CloudFormation stack "eksctl-prod-118-20210407-nodegroup-ng-a8440f77"
2021-04-07 18:31:13 [ℹ]  waiting for CloudFormation stack "eksctl-prod-118-20210407-nodegroup-ng-a8440f77"
2021-04-07 18:31:49 [ℹ]  waiting for CloudFormation stack "eksctl-prod-118-20210407-nodegroup-ng-a8440f77"
2021-04-07 18:32:24 [ℹ]  waiting for CloudFormation stack "eksctl-prod-118-20210407-nodegroup-ng-a8440f77"
2021-04-07 18:32:58 [ℹ]  waiting for CloudFormation stack "eksctl-prod-118-20210407-nodegroup-ng-a8440f77"
2021-04-07 18:33:14 [ℹ]  waiting for the control plane availability...
2021-04-07 18:33:15 [✔]  saved kubeconfig as "/Users/xxxx/.kube/config"
apiVersion: apps/v1
2021-04-07 18:33:15 [ℹ]  no tasks
2021-04-07 18:33:15 [✔]  all EKS cluster resources for "prod-118-20210407" have been created
2021-04-07 18:33:15 [ℹ]  adding identity "arn:aws:iam::123456789012:role/eksctl-prod-118-20210407-nodegrou-NodeInstanceRole-XXXXXXXXXX" to auth ConfigMap
2021-04-07 18:33:15 [ℹ]  nodegroup "ng-a8440f77" has 0 node(s)
2021-04-07 18:33:15 [ℹ]  waiting for at least 2 node(s) to become ready in "ng-a8440f77"
2021-04-07 18:33:51 [ℹ]  nodegroup "ng-a8440f77" has 2 node(s)
2021-04-07 18:33:51 [ℹ]  node "ip-192-168-36-74.us-west-2.compute.internal" is ready
2021-04-07 18:33:51 [ℹ]  node "ip-192-168-89-133.us-west-2.compute.internal" is ready
2021-04-07 18:33:53 [ℹ]  kubectl command should work with "/Users/xxxx/.kube/config", try 'kubectl get nodes'
2021-04-07 18:33:53 [✔]  EKS cluster "prod-118-20210407" in "us-west-2" region is ready
~ $
  • Launched managed nodegroup using eks AMI on the latest AMI. Again we didn’t observed any noticeable difference while the nodes join the cluster.
  • Launched a node on old AMI i.e. 20210322, deployed some 220 pods on the node and tried to upgrade it to latest AMI i.e. 20210329. The upgrade was successful. We monitored the node for some time and didn’t see the node going into NotReady state.

Here’s a comparison between the recently released AMI i.e. 20210329 with the one from 20210322:

           | 20210322         | 20210329  
-----------+------------------+------------
docker     | 19.03.13-ce      | 19.03.13-ce
containerd | 1.4.1            | 1.4.1
runc       | 1.0.0-rc92       | 1.0.0-rc93
kernel     | 4.14.225-168.357 | 4.14.225-169.362

As per the above table, the only significant change is runc version other than Amazon Linux kernel (which was to fix I/O regression as explained in the release notes)).

Here’s couple of things that could help us analyze the issue:

  • Specific steps to reproduce the issue that we can run on our end
  • Downgrading the runc to 1.0.0-rc92 on the node experiencing the issue and see if the issue is still happening.

@hgranillo
Copy link

hgranillo commented Apr 9, 2021

Managed to reproduce the issue again in one of my existing test clusters. It's not a fresh cluster and already have some idle workloads running on it.

  1. Patched my nodegroup to make sure i'm running 20210329
  2. Drained and removed all nodes that were using some other ami version. All nodes joined in a reasonable time, less than 5 minutes.
kubectl get nodes
NAME                          STATUS   ROLES    AGE   VERSION
ip-10-1-50-5.ec2.internal     Ready    <none>   11m   v1.19.6-eks-49a6c0
ip-10-1-54-172.ec2.internal   Ready    <none>   21m   v1.19.6-eks-49a6c0
ip-10-1-82-228.ec2.internal   Ready    <none>   18m   v1.19.6-eks-49a6c0
ip-10-1-97-64.ec2.internal    Ready    <none>   14m   v1.19.6-eks-49a6c0

The 4 nodes above are running amazon-eks-node-1.19-v20210329 ami-0b06ad6ce5341a208 on us-east-1

  1. On default namespace ran kubectl create deployment nginx --image=nginx
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
nginx   1/1     1            1           42s

All good, running and working

  1. Scaled deployment to 10 replicas: kubectl scale deployment/nginx --replicas=10
    After some time I checked how my pods were faring
NAME                     READY   STATUS              RESTARTS   AGE     IP            NODE                          NOMINATED NODE   READINESS GATES
nginx-6799fc88d8-8m2k4   1/1     Running             0          2m34s   10.1.37.219   ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-gpm6g   0/1     ContainerCreating   0          104s    <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-j7l79   1/1     Running             0          104s    10.1.62.104   ip-10-1-54-172.ec2.internal   <none>           <none>
nginx-6799fc88d8-jb5wv   0/1     ContainerCreating   0          104s    <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-jcttm   0/1     ContainerCreating   0          104s    <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-jtsgs   0/1     ContainerCreating   0          104s    <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-lshxg   0/1     ContainerCreating   0          104s    <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-smzth   0/1     ContainerCreating   0          104s    <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-svhcd   0/1     ContainerCreating   0          104s    <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-xj6dm   0/1     ContainerCreating   0          104s    <none>        ip-10-1-50-5.ec2.internal     <none>           <none>

Not so good. So I decided to increase the pressure and scheduled even more pods, and waited.

NAME                     READY   STATUS              RESTARTS   AGE     IP            NODE                          NOMINATED NODE   READINESS GATES
nginx-6799fc88d8-7wzhd   1/1     Running             0          3m56s   10.1.33.66    ip-10-1-54-172.ec2.internal   <none>           <none>
nginx-6799fc88d8-8m2k4   1/1     Running             0          6m40s   10.1.37.219   ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-gpm6g   0/1     ContainerCreating   0          5m50s   <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-j7l79   1/1     Running             0          5m50s   10.1.62.104   ip-10-1-54-172.ec2.internal   <none>           <none>
nginx-6799fc88d8-jb5wv   0/1     ContainerCreating   0          5m50s   <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-jcttm   0/1     ContainerCreating   0          5m50s   <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-jtsgs   0/1     ContainerCreating   0          5m50s   <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-klpsx   1/1     Running             0          3m56s   10.1.80.238   ip-10-1-82-228.ec2.internal   <none>           <none>
nginx-6799fc88d8-lshxg   0/1     ContainerCreating   0          5m50s   <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-nmvmt   1/1     Running             0          3m56s   10.1.63.40    ip-10-1-54-172.ec2.internal   <none>           <none>
nginx-6799fc88d8-pw5dx   1/1     Running             0          3m56s   10.1.45.60    ip-10-1-54-172.ec2.internal   <none>           <none>
nginx-6799fc88d8-smzth   0/1     ContainerCreating   0          5m50s   <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-svhcd   0/1     ContainerCreating   0          5m50s   <none>        ip-10-1-50-5.ec2.internal     <none>           <none>
nginx-6799fc88d8-vbgb2   1/1     Running             0          3m56s   10.1.36.106   ip-10-1-54-172.ec2.internal   <none>           <none>
nginx-6799fc88d8-xj6dm   0/1     ContainerCreating   0          5m50s   <none>        ip-10-1-50-5.ec2.internal     <none>           <none>

Checked nodes:

NAME                          STATUS     ROLES    AGE   VERSION
ip-10-1-50-5.ec2.internal     NotReady   <none>   15m   v1.19.6-eks-49a6c0
ip-10-1-54-172.ec2.internal   Ready      <none>   26m   v1.19.6-eks-49a6c0
ip-10-1-82-228.ec2.internal   Ready      <none>   22m   v1.19.6-eks-49a6c0
ip-10-1-97-64.ec2.internal    Ready      <none>   19m   v1.19.6-eks-49a6c0

Node is showing as NotReady

Describe node shows that the node already had some workloads into it.
(link to gist as the output is too large to paste it here) https://gist.github.com/hgranillo/749976453f2fedf32780d4858093ecf3
Some pods have pvc attached. Might this be a factor?

In this particular case happened something that didn't happen before, after ~10 minutes all the stuck pods got terminated and scheduled into the Ready nodes.
Usually the behavior would be that the node was so unresponsive that it took forever to terminate pods and workloads would never get rescheduled.


For me at least when I had this issue in my normal cluster, all nodes would join in a reasonable time, within 5 minutes tops but with each pod scheduled the node became more unresponsive as it took longer to pods to get past the ContainerCreating status. Eventually reaching a point were every pod would become stuck in ContainerCreating.

After looking what kind of workload my affected nodes had I noticed that most of them had 2 or 3 pvc (ebs) attached to it. Most of them databases (kafka, zookeeper, and replicated PostgreSQL databases) with io and network usage.

@mfreebairn-r7
Copy link

Interestingly, I only seem to be seeing this issue on the k8s v1.19 version of the ami.

When I upgraded to k8s 1.19 (amazon-eks-node-1.19-v20210329 ami-0b06ad6ce5341a208) I started seeing the mentioned issues on our densely packed nodes (PLEG errors, nodes flipping continually between ready/notready, pods stuck in container creating or terminating state).

I downgraded our eks managed workers back to v1.18, but the v20210329 version of the ami and everything went back to being healthy again. I am currently running amazon-eks-node-1.18-v20210329 (ami-0b8e294a936b2bcce) without any issues.

I have tried upgrading to 1.19 again since and saw the same problems again.

@hikhvar
Copy link

hikhvar commented Apr 9, 2021

We see those issues without any PVC attached.

@vishalkg
Copy link
Contributor

Thanks @mfreebairn-r7 for the specifics. We were able to reproduce the issue on 1.19 AMI. Also, with the same setup, the issue didn't occurred on 1.18 AMI. We will be working on the fix and will release a new set of AMI at the earliest.

@josephprem
Copy link

josephprem commented Apr 12, 2021

I see chrony process getting into zombie on these nodes flapping

 top -b1 -n1 | grep Z
 8866 chrony    20   0       0      0      0 Z   0.0  0.0   0:00.00 sh
27213 chrony    20   0       0      0      0 Z   0.0  0.0   0:00.00 deploy.sh
27214 chrony    20   0       0      0      0 Z   0.0  0.0   0:00.01 deploy.sh

so I guess runc could be the culprit @vishalkg I am interested to learn on the fix

@rtripat
Copy link
Contributor

rtripat commented Apr 13, 2021

We are actively working on an AMI release with downgraded runC. I will provide an update by end of day PST.

@rtripat
Copy link
Contributor

rtripat commented Apr 14, 2021

We will be releasing the AMI by afternoon PST tomorrow. Appreciate the patience.

@artificial-aidan
Copy link

Ran into this problem...sucked.

My AMI with the issue was ami-0a93391193b512e5d. If all the IDs were listed here I might have found this issue earlier.

@treyhyde
Copy link

I got hit hard with the node flapping on our trial upgrade to 1.19. I gave up trying to figure out how to a) get a list of old AMIs b) get eksctl to use it. A switch to unmanaged node groups and bottlerocket was rewarded with a functional cluster.

@billinghamj
Copy link

billinghamj commented Apr 15, 2021

It doesn't look like that new release made it? Maybe it hasn't reached that PST afternoon though 🤔

@artificial-aidan
Copy link

artificial-aidan commented Apr 15, 2021

In us-west-2, ami-034ccbf2f030b333a was released yesterday at 9pm. I haven't tried it. Someone else should let us know if it works 😄

@vishalkg
Copy link
Contributor

We have released a new set of AMIs with the fix . The tag for the release is v20210414. See details here.

@jackdpeterson
Copy link

jackdpeterson commented Apr 15, 2021

I've done two upgrade attempts in our eight node integration environment test cluster with fully managed nodes. Ultimately the nodegroup upgrade has failed both times.
What I'm observing at watch kubectl get nodes -o wide level is that Instances go from Ready and start receiving workloads. Around 10 minutes in they transition to Ready,Scheduling disabled and then terminate out.

Apr 15 16:46:27 ip-192-168-49-245 kubelet: E0415 16:46:27.170416    4754 remote_runtime.go:140] StopPodSandbox "6621d65e3f12385238f49af249a278f2bdcd06cbfb8b3dc82e5e102eb4d0bf85" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Apr 15 16:46:27 ip-192-168-49-245 kubelet: E0415 16:46:27.170470    4754 kuberuntime_manager.go:909] Failed to stop sandbox {"docker" "6621d65e3f12385238f49af249a278f2bdcd06cbfb8b3dc82e5e102eb4d0bf85"}
Apr 15 16:46:27 ip-192-168-49-245 kubelet: E0415 16:46:27.190912    4754 kubelet.go:1493] error killing pod: [failed to "KillContainer" for "echoserver" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded", failed to "KillPodSandbox" for "d4515ca6-126d-48cf-8fe5-7fa4403445ee" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
Apr 15 16:46:27 ip-192-168-49-245 kubelet: E0415 16:46:27.190960    4754 pod_workers.go:191] Error syncing pod d4515ca6-126d-48cf-8fe5-7fa4403445ee ("echoserver-7c48fd4b7c-wflhv_echoserver(d4515ca6-126d-48cf-8fe5-7fa4403445ee)"), skipping: error killing pod: [failed to "KillContainer" for "echoserver" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded", failed to "KillPodSandbox" for "d4515ca6-126d-48cf-8fe5-7fa4403445ee" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
Apr 15 16:46:27 ip-192-168-49-245 kubelet: E0415 16:46:27.537614    4754 docker_sandbox.go:275] Failed to stop sandbox "6621d65e3f12385238f49af249a278f2bdcd06cbfb8b3dc82e5e102eb4d0bf85": operation timeout: context deadline exceeded
Apr 15 16:46:27 ip-192-168-49-245 kubelet: I0415 16:46:27.894848    4754 setters.go:77] Using node IP: "192.168.49.245"
Apr 15 16:46:28 ip-192-168-49-245 kubelet: E0415 16:46:28.230197    4754 remote_runtime.go:140] StopPodSandbox "fde5157817c8836253ee16f92583a53c3ac64820f119f672c0a6357ad0df6f42" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Apr 15 16:46:28 ip-192-168-49-245 kubelet: E0415 16:46:28.230263    4754 kuberuntime_manager.go:909] Failed to stop sandbox {"docker" "fde5157817c8836253ee16f92583a53c3ac64820f119f672c0a6357ad0df6f42"}
Apr 15 16:46:28 ip-192-168-49-245 kubelet: E0415 16:46:28.230325    4754 kubelet.go:1493] error killing pod: [failed to "KillContainer" for "queue" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded", failed to "KillPodSandbox" for "cfc1ad45-f7da-4295-8187-815e6aaf9c1e" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
Apr 15 16:46:28 ip-192-168-49-245 kubelet: E0415 16:46:28.230341    4754 pod_workers.go:191] Error syncing pod cfc1ad45-f7da-4295-8187-815e6aaf9c1e ("queue-844bcdc9d7-dfjq7_pr-15340(cfc1ad45-f7da-4295-8187-815e6aaf9c1e)"), skipping: error killing pod: [failed to "KillContainer" for "queue" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded", failed to "KillPodSandbox" for "cfc1ad45-f7da-4295-8187-815e6aaf9c1e" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
Apr 15 16:46:28 ip-192-168-49-245 kubelet: E0415 16:46:28.472390    4754 docker_sandbox.go:275] Failed to stop sandbox "fde5157817c8836253ee16f92583a53c3ac64820f119f672c0a6357ad0df6f42": operation timeout: context deadline exceeded
Apr 15 16:46:29 ip-192-168-49-245 kubelet: {"level":"info","ts":"2021-04-15T16:46:29.254Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.5 ..."}

Currently working with AWS Support; however, sharing this since others are also impacted.

Below is the summary of the failed version update to the just released AMI.

image

Just to close the loop on this -- running an upgrade was regularly failing for my cluster. However, creating a new nodegroup was successful at running as expected. The timing is a bit unfortunate since at this very moment docker hub registry is down https://status.docker.com so FYI -- might want to wait until that's stable if you rely on public images hosted there.

@JonathanLachapelle
Copy link

I did the managed cluster node upgrade for all of our cluster and it was flawless

@mmerkes
Copy link
Member

mmerkes commented Apr 16, 2021

Upgrading to the latest AMI, v20210414, should resolve this issue. Let us know if that's not the case for you!

@mmerkes mmerkes closed this as completed Apr 16, 2021
@dcernag
Copy link

dcernag commented Apr 20, 2021

Upgrading to the latest AMI, v20210414, should resolve this issue. Let us know if that's not the case for you!

It seems like the package is not really pinned, so if anyone just uses v20210414 as base for their AMIs, runc and containerd may get upgraded during patch installs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests