-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS AMI v20210329 Issues #648
Comments
Hi, we're seeing a similar behavior with the current (v20210329) AMI versions for 1.18 and 1.19 in eu-west-1. We were running an 1.18 AMI (v20210310) on our nodes, upgraded the cluster to 1.19 and the AMI to the latest 1.19, which was v20210329, and a couple of minutes after being provisioned the new nodes randomly became Reverting back to the latest 1.18 AMI (v20210329) caused the same behaviour. We then updated the AMI to 1.19 , but also the previous version (v20210322), and that allowed the nodes to become For reference, we've opened support case #8190279821 on Thursday about this, because there was no GitHub issue about this back then. TL;DR: Revert to the following AMI versions to get stable nodes again
|
We experience this as well. We moved to the 1.19 v20210329 AMI yesterday. Since then some nodes behave normal, after some time the PLEG becomes unhealthy, the nodes a flapping between ready/unready. We can not terminate pods on those nodes, nor spawn new pods. Existing workload does not seem to be effected. |
I have also opened a support case and they referred me over here. Rolling back to the 20210310 AMI which we were on previously seems to have resolved the issue. The only thing I have to add is that I wonder if the kernel patch intended to fix an IO regression has actually made something worse in Docker performance? On clusters where we run a comparatively small number of pods per node things are still fine with the new AMI. It is only when we have in the region of 50+ pods per node we start seeing this behaviour with the new AMI. |
Hi, I'm also been observing the same behavior with amazon-eks-node-1.19-v20210329 in us-east-1 Containers seem to get stuck in "ContainerCreating" status, and as soon as this happens node starts reporting "NotReady", and this gets logged into /var/log/messages
Some other thing I noticed, this seems to happen more frequently on nodes that have 30 or more pods. I mention this because we only observed this issue in our integration environment where we have nodes more "packed" with pods, as opposed to our "production" cluster were pods are more spread and nodes are not packed with pods. Swiching to v20210322 did the trick and nodes seem to be healthy again. |
We also encounter this on densly packaged nodes with more than 40 pods. |
We are investigating this issue. I'm glad that rolling back to previous AMI has mitigated impact in your clusters. |
As per the details mentioned in the issue, we tried to reproduce it but unfortunately, we couldn’t reproduce it. Here are the details:
Here’s a comparison between the recently released AMI i.e.
As per the above table, the only significant change is Here’s couple of things that could help us analyze the issue:
|
Managed to reproduce the issue again in one of my existing test clusters. It's not a fresh cluster and already have some idle workloads running on it.
The 4 nodes above are running amazon-eks-node-1.19-v20210329
All good, running and working
Not so good. So I decided to increase the pressure and scheduled even more pods, and waited.
Checked nodes:
Node is showing as Describe node shows that the node already had some workloads into it. In this particular case happened something that didn't happen before, after ~10 minutes all the stuck pods got terminated and scheduled into the For me at least when I had this issue in my normal cluster, all nodes would join in a reasonable time, within 5 minutes tops but with each pod scheduled the node became more unresponsive as it took longer to pods to get past the After looking what kind of workload my affected nodes had I noticed that most of them had 2 or 3 pvc (ebs) attached to it. Most of them databases (kafka, zookeeper, and replicated PostgreSQL databases) with io and network usage. |
Interestingly, I only seem to be seeing this issue on the k8s v1.19 version of the ami. When I upgraded to k8s 1.19 (amazon-eks-node-1.19-v20210329 ami-0b06ad6ce5341a208) I started seeing the mentioned issues on our densely packed nodes (PLEG errors, nodes flipping continually between ready/notready, pods stuck in container creating or terminating state). I downgraded our eks managed workers back to v1.18, but the v20210329 version of the ami and everything went back to being healthy again. I am currently running amazon-eks-node-1.18-v20210329 (ami-0b8e294a936b2bcce) without any issues. I have tried upgrading to 1.19 again since and saw the same problems again. |
We see those issues without any PVC attached. |
Thanks @mfreebairn-r7 for the specifics. We were able to reproduce the issue on 1.19 AMI. Also, with the same setup, the issue didn't occurred on 1.18 AMI. We will be working on the fix and will release a new set of AMI at the earliest. |
I see chrony process getting into zombie on these nodes flapping top -b1 -n1 | grep Z 8866 chrony 20 0 0 0 0 Z 0.0 0.0 0:00.00 sh 27213 chrony 20 0 0 0 0 Z 0.0 0.0 0:00.00 deploy.sh 27214 chrony 20 0 0 0 0 Z 0.0 0.0 0:00.01 deploy.sh so I guess runc could be the culprit @vishalkg I am interested to learn on the fix |
We are actively working on an AMI release with downgraded runC. I will provide an update by end of day PST. |
We will be releasing the AMI by afternoon PST tomorrow. Appreciate the patience. |
Ran into this problem...sucked. My AMI with the issue was |
I got hit hard with the node flapping on our trial upgrade to 1.19. I gave up trying to figure out how to a) get a list of old AMIs b) get eksctl to use it. A switch to unmanaged node groups and bottlerocket was rewarded with a functional cluster. |
It doesn't look like that new release made it? Maybe it hasn't reached that PST afternoon though 🤔 |
In |
We have released a new set of AMIs with the fix . The tag for the release is |
I've done two upgrade attempts in our eight node integration environment test cluster with fully managed nodes. Ultimately the nodegroup upgrade has failed both times.
Currently working with AWS Support; however, sharing this since others are also impacted. Below is the summary of the failed version update to the just released AMI. Just to close the loop on this -- running an upgrade was regularly failing for my cluster. However, creating a new nodegroup was successful at running as expected. The timing is a bit unfortunate since at this very moment docker hub registry is down https://status.docker.com so FYI -- might want to wait until that's stable if you rely on public images hosted there. |
I did the managed cluster node upgrade for all of our cluster and it was flawless |
Upgrading to the latest AMI, |
It seems like the package is not really pinned, so if anyone just uses |
What happened: EC2 instances took greater than 30 minutes to join a cluster. Shortly after an instance is instantiated, the ec2's in question seem to be CPU starved and don't respond to even being able to SSH into an ec2 instance. There appears to be certain things that are occurring before the preBootstrap commands are even invoked.
What you expected to happen: New instances using AMI v20210329 to join a cluster in a reasonable amount of time
How to reproduce it (as minimally and precisely as possible): Creating a "unmanaged" node group using eksctl with the latest AMI (v20210329)
Anything else we need to know?: It appears that due to some new additions, the AMI is now taking approximately 30+ minutes to start with a proper working/configured kubelet. This used to take approx 5 min in the past.
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.4aws eks describe-cluster --name <name> --query cluster.version
): 1.18The text was updated successfully, but these errors were encountered: