AWS: Pod got deleted unexpectedly #6303

ameukam · 2024-01-22T10:24:24Z

We have reports of prowjobs deleted on EKS cluster eks-prow-build-cluster with

Job execution failed: Pod got deleted unexpectedly

[Flaking Test] [sig-apps] periodic-conformance-main-k8s-main kubernetes#122822
Prowjobs fail with Pod got deleted unexpectedly on community aws infrastructure kubernetes-sigs/cluster-api#9901
Time and again Host cluster test pods where the test scripts are run to spin up scalability tests are getting deleted abruptly. test-infra#31459

/kind bug
/area infra/aws
/priority important-soon
/milestone v1.30

The text was updated successfully, but these errors were encountered:

ameukam · 2024-01-22T10:25:06Z

/assign @xmudrii
cc @upodroid @dims

xmudrii · 2024-01-22T13:33:17Z

A support ticket has been created with AWS to investigate this issue

xmudrii · 2024-01-22T13:48:23Z

We have a discussion on Slack about potential root cause https://kubernetes.slack.com/archives/CCK68P2Q2/p1705919163947889

xmudrii · 2024-01-23T14:29:13Z

We received a response from the AWS support and this is the most important bit:

With regards to the behaviour you have observed, I have investigated into the nodes provided and can confirm that all except one of these nodes were terminated due to the AZRebalance feature as you have suspected. Additionally, there are about 1700 rebalancing activities for this Auto Scaling group over the past 6 weeks (6 weeks is the maximum history limit for Auto Scaling activity history) which matches the behaviour you are seeing in your environment.

They recommended us the following and it matches what we discussed with @tzneal yesterday:

As this can result in an imbalance spread of instances between AZs, a possible long-term approach would be to utilize multiple node groups, where each node group is scoped to a single availability zone. This would allow cluster autoscaler to scale and balance nodes between multiple node groups, effectively balancing nodes between AZs. Note that "--balance-similar-node-groups" must be enabled on the cluster autoscaler for this feature

The above approach is also recommended should you be utilizing EBS volumes for PVCs as EBS volumes are scoped to a single availability zone.

Out of 7 instances that I provided to the AWS support, 6 of them were removed by the AZRebalance feature and one was removed by cluster-autoscaler. We'll:

switch to a dedicated node group per AZ
continue monitoring if cluster-autoscaler is mistakenly removing nodes

xmudrii · 2024-01-24T20:00:21Z

The proposed mitigation has been rolled out to the production cluster. I propose leaving this issue open for 7 days to monitor if the issue is gone. We can use Prow's Deck for monitoring: https://prow.k8s.io/?state=error&cluster=eks-prow-build-cluster

dims · 2024-01-24T21:03:52Z

thanks a ton @xmudrii

hakuna-matatah · 2024-01-30T16:17:38Z

@xmudrii I see a test got stuck in eternal mode since yesterday which was one of the failure mode observed in the past issue here

Test link that got triggered yesterday - https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kops/16296/presubmit-kops-aws-scale-amazonvpc-using-cl2/1752113814377074688

Do you have any idea ?

xmudrii · 2024-01-30T18:45:06Z

@hakuna-matatah The mentioned job got OOMKilled:

NAME                                   READY   STATUS      RESTARTS   AGE
c554cf25-37b0-4445-acb0-d09669adc3ea   1/2     OOMKilled   0          18h

(c554cf25-37b0-4445-acb0-d09669adc3ea is coming from https://prow.k8s.io/prowjob?prowjob=c554cf25-37b0-4445-acb0-d09669adc3ea)

I don't know why this didn't get reported back to Prow though. I recommend increasing memory requests and limits for this job.

hakuna-matatah · 2024-01-30T19:01:21Z

@hakuna-matatah The mentioned job got OOMKilled:
NAME                                   READY   STATUS      RESTARTS   AGE
c554cf25-37b0-4445-acb0-d09669adc3ea   1/2     OOMKilled   0          18h
(c554cf25-37b0-4445-acb0-d09669adc3ea is coming from https://prow.k8s.io/prowjob?prowjob=c554cf25-37b0-4445-acb0-d09669adc3ea)

I don't know why this didn't get reported back to Prow though. I recommend increasing memory requests and limits for this job.

@xmudrii Thanks for getting back. I can increase the limits the memory as quick fix for now but however its weird that it didn't report it back to Prow in such case. Do you want me to open a issue for this particular case, or do we want to track that in here ?

Just as an FYI - this particular Prow pod in question had 48Gi of Memory allocated to it, its surprising it needs more than that for running a cl2 test, GCE scale test pods run with the same memory (although they are not currently using kops), but 16Gi seems to be enough there. So, something in this particular environment is what is making it to consume more memory than it should, would be interesting to find out the differences at some point by profiling etc. Also, periodic jobs are running with the same limits but apparently they seem to not have issues w.r.t OOM

xmudrii · 2024-01-30T19:11:19Z

Do you want me to open a issue for this particular case, or do we want to track that in here ?

This is an issue with Prow, while this ticket tracks instability of a build cluster. I recommend raising an issue in the k/test-infra repo about this.

xmudrii · 2024-02-06T00:34:27Z

This issue has been fixed since we applied the mitigation on 2024-01-24. Given that the cluster has been stable for more than two weeks, I think we can close this issue. Thank y'all for patience while we figured this out! ❤️
/close

k8s-ci-robot · 2024-02-06T00:34:33Z

@xmudrii: Closing this issue.

In response to this:

This issue has been fixed since we applied the mitigation on 2024-01-24. Given that the cluster has been stable for more than two weeks, I think we can close this issue. Thank y'all for patience while we figured this out! ❤️
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dims · 2024-02-06T00:46:58Z

ameukam added the sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. label Jan 22, 2024

k8s-ci-robot added this to the v1.30 milestone Jan 22, 2024

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 22, 2024

k8s-ci-robot assigned xmudrii Jan 22, 2024

This was referenced Jan 30, 2024

Prow pod for scale tests are stuck eternally in running state and not reported back to Prow the final state. kubernetes/test-infra#31769

Closed

Adding test to release informing kubernetes/test-infra#31808

Merged

k8s-ci-robot closed this as completed Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS: Pod got deleted unexpectedly #6303

AWS: Pod got deleted unexpectedly #6303

ameukam commented Jan 22, 2024

ameukam commented Jan 22, 2024

xmudrii commented Jan 22, 2024

xmudrii commented Jan 22, 2024

xmudrii commented Jan 23, 2024

xmudrii commented Jan 24, 2024

dims commented Jan 24, 2024

hakuna-matatah commented Jan 30, 2024

xmudrii commented Jan 30, 2024 •

edited

Loading

hakuna-matatah commented Jan 30, 2024 •

edited

Loading

xmudrii commented Jan 30, 2024

xmudrii commented Feb 6, 2024

k8s-ci-robot commented Feb 6, 2024

dims commented Feb 6, 2024

AWS: Pod got deleted unexpectedly #6303

AWS: Pod got deleted unexpectedly #6303

Comments

ameukam commented Jan 22, 2024

ameukam commented Jan 22, 2024

xmudrii commented Jan 22, 2024

xmudrii commented Jan 22, 2024

xmudrii commented Jan 23, 2024

xmudrii commented Jan 24, 2024

dims commented Jan 24, 2024

hakuna-matatah commented Jan 30, 2024

xmudrii commented Jan 30, 2024 • edited Loading

hakuna-matatah commented Jan 30, 2024 • edited Loading

xmudrii commented Jan 30, 2024

xmudrii commented Feb 6, 2024

k8s-ci-robot commented Feb 6, 2024

dims commented Feb 6, 2024

xmudrii commented Jan 30, 2024 •

edited

Loading

hakuna-matatah commented Jan 30, 2024 •

edited

Loading