Time and again Host cluster test pods where the test scripts are run to spin up scalability tests are getting deleted abruptly. #31459

hakuna-matatah · 2023-12-15T19:53:53Z

What happened:

test pods are getting deleted abruptly in the middle of the test or Post test
Following examples test runs :
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/1734454386119151616
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/1733367010277986304
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kops/16172/presubmit-kops-aws-scale-amazonvpc-using-cl2/1735444219230687232
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kops/16172/presubmit-kops-aws-scale-amazonvpc-using-cl2/1734768234244083712

And some of them show as indefinitely running ( Not sure if it's running or a UI bug here)

What you expected to happen:

These pods should not be deleted and be viewable to look at the logs and artifacts of the testrun. If the pod dies in the middlke of the testrun, tests fail and it's an issue.
Given these are scale tests, as it is they are expensive w.r.t time and cost, pods got deleted issue masks all other issues that the test may have.

How to reproduce it (as minimally and precisely as possible):
Randomly we encounter this

https://testgrid.k8s.io/sig-scalability-aws#ec2-master-scale-performance

Please provide links to example occurrences, if any:
Links are provided ablve

Anything else we need to know?:

I think this may be happening due to resource constraints on the Host clusters where these test pods are run, but given we don't have access to the Host cluster, I don't have any way to know as to why these Pods are getting deleted abruptly.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-12-15T19:54:01Z

There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

/sig <group-name>
/wg <group-name>
/committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hakuna-matatah · 2023-12-15T19:56:22Z

@dims @ameukam - PTAL

dims · 2023-12-15T21:33:51Z

cc @xmudrii

upodroid · 2023-12-16T21:07:44Z

In the interim, what do you think of running the test pod on the community gke cluster? We can tweak it to assume the correct AWS role to run in the aws scalability account

hakuna-matatah · 2023-12-18T05:16:03Z

In the interim, what do you think of running the test pod on the community gke cluster? We can tweak it to assume the correct AWS role to run in the aws scalability account

I'm open to it but how do we manage the creds if we take that path ?

xmudrii · 2023-12-18T09:36:28Z

This is a known issue unfortunetly and we're working on resolving it. I hope we'll have a fix for this issue as soon as possible.

BenTheElder · 2023-12-27T21:23:19Z

This is a problem with the cluster hosting the test pod, on the prior clusters we set suitable (extra high) resource requests / limits to all but guarantee the scalability pods were not interrupted.

This problem impacts a lot of CI jobs on the cluster, but it's especially noticeable with lengthy $$$ scale tests.

We discussed this at the last K8s Infra meeting, to me it sounds like the cluster's autoscaling behavior needs tuning to not aggressively remove nodes without draining first.

hakuna-matatah · 2023-12-28T16:47:54Z

This is a problem with the cluster hosting the test pod, on the prior clusters we set suitable (extra high) resource requests / limits to all but guarantee the scalability pods were not interrupted.

This problem impacts a lot of CI jobs on the cluster, but it's especially noticeable with lengthy $$$ scale tests.

We discussed this at the last K8s Infra meeting, to me it sounds like the cluster's autoscaling behavior needs tuning to not aggressively remove nodes without draining first.

Given this is happening way more frequently now, Is it possible to run scale tests on it's own cluster ? Or could we leverage Karpenter ? Internally, we leverage karpenter to manage the nodes on Host cluster.

xmudrii · 2023-12-30T11:59:18Z

Given this is happening way more frequently now, Is it possible to run scale tests on it's own cluster ? Or could we leverage Karpenter ? Internally, we leverage karpenter to manage the nodes on Host cluster.

We're planning to look into this and we have an issue to track this: kubernetes/k8s.io#5168

We discussed this at the last K8s Infra meeting, to me it sounds like the cluster's autoscaling behavior needs tuning to not aggressively remove nodes without draining first.

@BenTheElder It's not really that we need to tune the cluster-autoscaler's behavior, it's more about making sure that it respects ProwJob pods. Let's look into this together after the holidays break.

hakuna-matatah · 2024-01-04T18:04:15Z

@BenTheElder It's not really that we need to tune the cluster-autoscaler's behavior, it's more about making sure that it respects ProwJob pods. Let's look into this together after the holidays break.

@xmudrii just wondering if you guys got a chance to followup ? Given we are being impacted actively, are we thinking in terms of short-term and long-term solutions for this if LOE is too much ? Any rough ETA we are thinking to resolve this issue ?

k8s-triage-robot · 2024-04-03T18:50:08Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

xmudrii · 2024-04-03T19:03:56Z

This shouldn't be an issue any longer
/close

k8s-ci-robot · 2024-04-03T19:04:00Z

@xmudrii: Closing this issue.

In response to this:

This shouldn't be an issue any longer
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hakuna-matatah added the kind/bug Categorizes issue or PR as related to a bug. label Dec 15, 2023

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 15, 2023

hakuna-matatah changed the title ~~Time and again Test host Pods where the test scripts are run are getting deleted~~ Time and again Host cluster test pods where the test scripts are run to spin up scalability tests are getting deleted abruptly. Dec 15, 2023

hakuna-matatah mentioned this issue Dec 15, 2023

Disable Statefulsets provisioning from CL2 Load Tests kubernetes/kops#16172

Merged

hakuna-matatah mentioned this issue Jan 10, 2024

WIP scale-test: Testing kubernetes/kops#16240

Closed

ameukam mentioned this issue Jan 22, 2024

AWS: Pod got deleted unexpectedly kubernetes/k8s.io#6303

Closed

hakuna-matatah mentioned this issue Feb 2, 2024

Adding test to release informing #31808

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2024

k8s-ci-robot closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time and again Host cluster test pods where the test scripts are run to spin up scalability tests are getting deleted abruptly. #31459

Time and again Host cluster test pods where the test scripts are run to spin up scalability tests are getting deleted abruptly. #31459

hakuna-matatah commented Dec 15, 2023 •

edited

Loading

k8s-ci-robot commented Dec 15, 2023

hakuna-matatah commented Dec 15, 2023

dims commented Dec 15, 2023

upodroid commented Dec 16, 2023

hakuna-matatah commented Dec 18, 2023

xmudrii commented Dec 18, 2023

BenTheElder commented Dec 27, 2023

hakuna-matatah commented Dec 28, 2023

xmudrii commented Dec 30, 2023

hakuna-matatah commented Jan 4, 2024 •

edited

Loading

k8s-triage-robot commented Apr 3, 2024

xmudrii commented Apr 3, 2024

k8s-ci-robot commented Apr 3, 2024

Time and again Host cluster test pods where the test scripts are run to spin up scalability tests are getting deleted abruptly. #31459

Time and again Host cluster test pods where the test scripts are run to spin up scalability tests are getting deleted abruptly. #31459

Comments

hakuna-matatah commented Dec 15, 2023 • edited Loading

k8s-ci-robot commented Dec 15, 2023

hakuna-matatah commented Dec 15, 2023

dims commented Dec 15, 2023

upodroid commented Dec 16, 2023

hakuna-matatah commented Dec 18, 2023

xmudrii commented Dec 18, 2023

BenTheElder commented Dec 27, 2023

hakuna-matatah commented Dec 28, 2023

xmudrii commented Dec 30, 2023

hakuna-matatah commented Jan 4, 2024 • edited Loading

k8s-triage-robot commented Apr 3, 2024

xmudrii commented Apr 3, 2024

k8s-ci-robot commented Apr 3, 2024

hakuna-matatah commented Dec 15, 2023 •

edited

Loading

hakuna-matatah commented Jan 4, 2024 •

edited

Loading