Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time and again Host cluster test pods where the test scripts are run to spin up scalability tests are getting deleted abruptly. #31459

Closed
hakuna-matatah opened this issue Dec 15, 2023 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one.

Comments

@hakuna-matatah
Copy link
Contributor

hakuna-matatah commented Dec 15, 2023

What happened:

And some of them show as indefinitely running ( Not sure if it's running or a UI bug here)

What you expected to happen:

These pods should not be deleted and be viewable to look at the logs and artifacts of the testrun. If the pod dies in the middlke of the testrun, tests fail and it's an issue.
Given these are scale tests, as it is they are expensive w.r.t time and cost, pods got deleted issue masks all other issues that the test may have.

How to reproduce it (as minimally and precisely as possible):
Randomly we encounter this

Please provide links to example occurrences, if any:
Links are provided ablve

Anything else we need to know?:

  • I think this may be happening due to resource constraints on the Host clusters where these test pods are run, but given we don't have access to the Host cluster, I don't have any way to know as to why these Pods are getting deleted abruptly.
@hakuna-matatah hakuna-matatah added the kind/bug Categorizes issue or PR as related to a bug. label Dec 15, 2023
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 15, 2023
@k8s-ci-robot
Copy link
Contributor

There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

  • /sig <group-name>
  • /wg <group-name>
  • /committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hakuna-matatah hakuna-matatah changed the title Time and again Test host Pods where the test scripts are run are getting deleted Time and again Host cluster test pods where the test scripts are run to spin up scalability tests are getting deleted abruptly. Dec 15, 2023
@hakuna-matatah
Copy link
Contributor Author

@dims @ameukam - PTAL

@dims
Copy link
Member

dims commented Dec 15, 2023

cc @xmudrii

@upodroid
Copy link
Member

In the interim, what do you think of running the test pod on the community gke cluster? We can tweak it to assume the correct AWS role to run in the aws scalability account

@hakuna-matatah
Copy link
Contributor Author

In the interim, what do you think of running the test pod on the community gke cluster? We can tweak it to assume the correct AWS role to run in the aws scalability account

I'm open to it but how do we manage the creds if we take that path ?

@xmudrii
Copy link
Member

xmudrii commented Dec 18, 2023

This is a known issue unfortunetly and we're working on resolving it. I hope we'll have a fix for this issue as soon as possible.

@BenTheElder
Copy link
Member

This is a problem with the cluster hosting the test pod, on the prior clusters we set suitable (extra high) resource requests / limits to all but guarantee the scalability pods were not interrupted.

This problem impacts a lot of CI jobs on the cluster, but it's especially noticeable with lengthy $$$ scale tests.

We discussed this at the last K8s Infra meeting, to me it sounds like the cluster's autoscaling behavior needs tuning to not aggressively remove nodes without draining first.

@hakuna-matatah
Copy link
Contributor Author

This is a problem with the cluster hosting the test pod, on the prior clusters we set suitable (extra high) resource requests / limits to all but guarantee the scalability pods were not interrupted.

This problem impacts a lot of CI jobs on the cluster, but it's especially noticeable with lengthy $$$ scale tests.

We discussed this at the last K8s Infra meeting, to me it sounds like the cluster's autoscaling behavior needs tuning to not aggressively remove nodes without draining first.

Given this is happening way more frequently now, Is it possible to run scale tests on it's own cluster ? Or could we leverage Karpenter ? Internally, we leverage karpenter to manage the nodes on Host cluster.

@xmudrii
Copy link
Member

xmudrii commented Dec 30, 2023

Given this is happening way more frequently now, Is it possible to run scale tests on it's own cluster ? Or could we leverage Karpenter ? Internally, we leverage karpenter to manage the nodes on Host cluster.

We're planning to look into this and we have an issue to track this: kubernetes/k8s.io#5168

We discussed this at the last K8s Infra meeting, to me it sounds like the cluster's autoscaling behavior needs tuning to not aggressively remove nodes without draining first.

@BenTheElder It's not really that we need to tune the cluster-autoscaler's behavior, it's more about making sure that it respects ProwJob pods. Let's look into this together after the holidays break.

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Jan 4, 2024

@BenTheElder It's not really that we need to tune the cluster-autoscaler's behavior, it's more about making sure that it respects ProwJob pods. Let's look into this together after the holidays break.

@xmudrii just wondering if you guys got a chance to followup ? Given we are being impacted actively, are we thinking in terms of short-term and long-term solutions for this if LOE is too much ? Any rough ETA we are thinking to resolve this issue ?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2024
@xmudrii
Copy link
Member

xmudrii commented Apr 3, 2024

This shouldn't be an issue any longer
/close

@k8s-ci-robot
Copy link
Contributor

@xmudrii: Closing this issue.

In response to this:

This shouldn't be an issue any longer
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

7 participants