Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes-sigs/karpenter needs infrastructure to run scale testing #7711

Open
jonathan-innis opened this issue Jan 22, 2025 · 5 comments
Open
Labels
sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.

Comments

@jonathan-innis
Copy link

kubernetes-sigs/karpenter is a sub-project under sig-autoscaling. Given that we are an autoscaling project, performance is an extremely high priority for us. We've been wanting to have a comprehensive scale testing suite for the core scheduling and consolidation logic that Karpenter uses for a while now and we are finally ready to prioritize this on our side.

The primary question that we have at this point is that we want to make sure that the underlying infrastructure for compute that we are using to execute the scale tests is large enough that we are getting accurate results and aren't getting throttled by CPU or hyper-constrained by memory. We tried running with the GH actions containers that we are given in our repository and tried running with a kind cluster, but this caused a bunch of throttling on our side and led to slow-downs in our scale testing, affecting our results.

I'd love to hear recommendations from the community on what we should be doing here -- ideally, we can get something like an EKS or GCP cluster and run the Karpenter installation directly on it with as much memory or CPU that it needs to avoid the throttling. We'd run with our kwok cloudprovider version of Karpenter so all node scale-ups and scale-downs would be "fake" so that we wouldn't be utilizing compute from actual instances.

TLDR: We need a managed Kubernetes cluster environment where we can deploy Karpenter to launch fake nodes for our scale testing without getting throttled

@jonathan-innis jonathan-innis added the sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. label Jan 22, 2025
@ameukam
Copy link
Member

ameukam commented Jan 22, 2025

cc @kubernetes/sig-k8s-infra-leads

@BenTheElder
Copy link
Member

I'd love to hear recommendations from the community on what we should be doing here -- ideally, we can get something like an EKS or GCP cluster and run the Karpenter installation directly on it with as much memory or CPU that it needs to avoid the throttling. We'd run with our kwok cloudprovider version of Karpenter so all node scale-ups and scale-downs would be "fake" so that we wouldn't be utilizing compute from actual instances.

Can you run kwok + karpenter within CI without having additional persistent infrastructure ..? (IE prow.k8s.io, we have that already available to all projects)

@jonathan-innis
Copy link
Author

Can you run kwok + karpenter within CI without having additional persistent infrastructure

If we run on prow, what's your thought around how we spin-up another cluster where we actually test launching nodes. Prow could definitely help for getting a container of arbitrary size where we can run the testing, but what's your recommendation there? Would running a kind cluster inside or Prow be the recommended path here so that the process itself has access to a cluster without using managed infra?

I'm also curious about K8s infra's take on using Prow vs. using CI actions in GH directly -- I don't think I can really scale-up the container for the GHA so I'm imagining that I'm going to be end up being fairly limited there

@BenTheElder
Copy link
Member

If we run on prow, what's your thought around how we spin-up another cluster where we actually test launching nodes. Prow could definitely help for getting a container of arbitrary size where we can run the testing, but what's your recommendation there? Would running a kind cluster inside or Prow be the recommended path here so that the process itself has access to a cluster without using managed infra?

Some job use kind clusters, in some cases that may be insufficient and they spin up external temporary clusters in projects/accounts rented from https://github.com/kubernetes-sigs/boskos, checkout for example using https://github.com/kubernetes/kops/tree/master/tests/e2e/kubetest2-kops / https://github.com/kubernetes-sigs/kubetest2

that's a complicated topic, most of those are setup to test kubernetes itself and there's a lot of copy pasting, but basically with a boskos client they rent a GCP project or an AWS sub-account from a pool for the duration of the job then spin up VMs as needed. Boskos will then delete everything when the job releases it or times out.

For the kind route, some performance hacks exist, in particular kubernetes-sigs/kind#845 (comment)

@BenTheElder
Copy link
Member

BenTheElder commented Jan 22, 2025

I'm also curious about K8s infra's take on using Prow vs. using CI actions in GH directly -- I don't think I can really scale-up the container for the GHA so I'm imagining that I'm going to be end up being fairly limited there

We don't currently have the resources for GHA larger than default, and we already have to sustain prow.
SIG K8s infra doesn't have strong opinions on these, our job is to make sure the resources are managed sustainably.

From a SIG Testing POV: We wouldn't move everything over to github actions even if we had the resources, there are some useful properties around ensuring the linearization of tested commits (IE prow knows what commit was tested in the PR and the branch it would merge to and ensures that the tests passed at the latest of both before merge) and being able to batch test large numbers of PRs together safely (so e.g. after kubernetes lifts code freeze we may test and merge up to 15 PRs at at time where the CI merges all of them inside the test env, tests that, and then if it passes across all required jobs we can merge all of them at once). .... but we also don't actively discourage using it. kubernetes-sigs/kind is using both (prow, GHA) currently (and is a sig testing subproject), it was easier [vagrant in GHA] to get some small disposable VMs with particular kernel settings, whereas prow gives you something running inside a host cluster as a container/pod and doing more requires effort ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Projects
None yet
Development

No branches or pull requests

3 participants