Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: Pod got deleted unexpectedly #6303

Closed
ameukam opened this issue Jan 22, 2024 · 13 comments
Closed

AWS: Pod got deleted unexpectedly #6303

ameukam opened this issue Jan 22, 2024 · 13 comments
Assignees
Labels
area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Milestone

Comments

@ameukam
Copy link
Member

ameukam commented Jan 22, 2024

We have reports of prowjobs deleted on EKS cluster eks-prow-build-cluster with

Job execution failed: Pod got deleted unexpectedly

/kind bug
/area infra/aws
/priority important-soon
/milestone v1.30

@ameukam ameukam added the sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. label Jan 22, 2024
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Jan 22, 2024
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 22, 2024
@ameukam
Copy link
Member Author

ameukam commented Jan 22, 2024

/assign @xmudrii
cc @upodroid @dims

@xmudrii
Copy link
Member

xmudrii commented Jan 22, 2024

A support ticket has been created with AWS to investigate this issue

@xmudrii
Copy link
Member

xmudrii commented Jan 22, 2024

We have a discussion on Slack about potential root cause https://kubernetes.slack.com/archives/CCK68P2Q2/p1705919163947889

@xmudrii
Copy link
Member

xmudrii commented Jan 23, 2024

We received a response from the AWS support and this is the most important bit:

With regards to the behaviour you have observed, I have investigated into the nodes provided and can confirm that all except one of these nodes were terminated due to the AZRebalance feature as you have suspected. Additionally, there are about 1700 rebalancing activities for this Auto Scaling group over the past 6 weeks (6 weeks is the maximum history limit for Auto Scaling activity history) which matches the behaviour you are seeing in your environment.

They recommended us the following and it matches what we discussed with @tzneal yesterday:

As this can result in an imbalance spread of instances between AZs, a possible long-term approach would be to utilize multiple node groups, where each node group is scoped to a single availability zone. This would allow cluster autoscaler to scale and balance nodes between multiple node groups, effectively balancing nodes between AZs. Note that "--balance-similar-node-groups" must be enabled on the cluster autoscaler for this feature

The above approach is also recommended should you be utilizing EBS volumes for PVCs as EBS volumes are scoped to a single availability zone.

Out of 7 instances that I provided to the AWS support, 6 of them were removed by the AZRebalance feature and one was removed by cluster-autoscaler. We'll:

  • switch to a dedicated node group per AZ
  • continue monitoring if cluster-autoscaler is mistakenly removing nodes

@xmudrii
Copy link
Member

xmudrii commented Jan 24, 2024

The proposed mitigation has been rolled out to the production cluster. I propose leaving this issue open for 7 days to monitor if the issue is gone. We can use Prow's Deck for monitoring: https://prow.k8s.io/?state=error&cluster=eks-prow-build-cluster

@dims
Copy link
Member

dims commented Jan 24, 2024

thanks a ton @xmudrii

@hakuna-matatah
Copy link

@xmudrii I see a test got stuck in eternal mode since yesterday which was one of the failure mode observed in the past issue here

Test link that got triggered yesterday - https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kops/16296/presubmit-kops-aws-scale-amazonvpc-using-cl2/1752113814377074688

Do you have any idea ?

@xmudrii
Copy link
Member

xmudrii commented Jan 30, 2024

@hakuna-matatah The mentioned job got OOMKilled:

NAME                                   READY   STATUS      RESTARTS   AGE
c554cf25-37b0-4445-acb0-d09669adc3ea   1/2     OOMKilled   0          18h

(c554cf25-37b0-4445-acb0-d09669adc3ea is coming from https://prow.k8s.io/prowjob?prowjob=c554cf25-37b0-4445-acb0-d09669adc3ea)

I don't know why this didn't get reported back to Prow though. I recommend increasing memory requests and limits for this job.

@hakuna-matatah
Copy link

hakuna-matatah commented Jan 30, 2024

@hakuna-matatah The mentioned job got OOMKilled:

NAME                                   READY   STATUS      RESTARTS   AGE
c554cf25-37b0-4445-acb0-d09669adc3ea   1/2     OOMKilled   0          18h

(c554cf25-37b0-4445-acb0-d09669adc3ea is coming from https://prow.k8s.io/prowjob?prowjob=c554cf25-37b0-4445-acb0-d09669adc3ea)

I don't know why this didn't get reported back to Prow though. I recommend increasing memory requests and limits for this job.

@xmudrii Thanks for getting back. I can increase the limits the memory as quick fix for now but however its weird that it didn't report it back to Prow in such case. Do you want me to open a issue for this particular case, or do we want to track that in here ?

Just as an FYI - this particular Prow pod in question had 48Gi of Memory allocated to it, its surprising it needs more than that for running a cl2 test, GCE scale test pods run with the same memory (although they are not currently using kops), but 16Gi seems to be enough there. So, something in this particular environment is what is making it to consume more memory than it should, would be interesting to find out the differences at some point by profiling etc. Also, periodic jobs are running with the same limits but apparently they seem to not have issues w.r.t OOM

@xmudrii
Copy link
Member

xmudrii commented Jan 30, 2024

Do you want me to open a issue for this particular case, or do we want to track that in here ?

This is an issue with Prow, while this ticket tracks instability of a build cluster. I recommend raising an issue in the k/test-infra repo about this.

@xmudrii
Copy link
Member

xmudrii commented Feb 6, 2024

This issue has been fixed since we applied the mitigation on 2024-01-24. Given that the cluster has been stable for more than two weeks, I think we can close this issue. Thank y'all for patience while we figured this out! ❤️
/close

@k8s-ci-robot
Copy link
Contributor

@xmudrii: Closing this issue.

In response to this:

This issue has been fixed since we applied the mitigation on 2024-01-24. Given that the cluster has been stable for more than two weeks, I think we can close this issue. Thank y'all for patience while we figured this out! ❤️
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dims
Copy link
Member

dims commented Feb 6, 2024

giphy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Projects
Development

No branches or pull requests

5 participants