-
Notifications
You must be signed in to change notification settings - Fork 852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubernetes-sigs/karpenter
needs infrastructure to run scale testing
#7711
Comments
cc @kubernetes/sig-k8s-infra-leads |
Can you run kwok + karpenter within CI without having additional persistent infrastructure ..? (IE prow.k8s.io, we have that already available to all projects) |
If we run on prow, what's your thought around how we spin-up another cluster where we actually test launching nodes. Prow could definitely help for getting a container of arbitrary size where we can run the testing, but what's your recommendation there? Would running a kind cluster inside or Prow be the recommended path here so that the process itself has access to a cluster without using managed infra? I'm also curious about K8s infra's take on using Prow vs. using CI actions in GH directly -- I don't think I can really scale-up the container for the GHA so I'm imagining that I'm going to be end up being fairly limited there |
Some job use kind clusters, in some cases that may be insufficient and they spin up external temporary clusters in projects/accounts rented from https://github.com/kubernetes-sigs/boskos, checkout for example using https://github.com/kubernetes/kops/tree/master/tests/e2e/kubetest2-kops / https://github.com/kubernetes-sigs/kubetest2 that's a complicated topic, most of those are setup to test kubernetes itself and there's a lot of copy pasting, but basically with a boskos client they rent a GCP project or an AWS sub-account from a pool for the duration of the job then spin up VMs as needed. Boskos will then delete everything when the job releases it or times out. For the kind route, some performance hacks exist, in particular kubernetes-sigs/kind#845 (comment) |
We don't currently have the resources for GHA larger than default, and we already have to sustain prow. From a SIG Testing POV: We wouldn't move everything over to github actions even if we had the resources, there are some useful properties around ensuring the linearization of tested commits (IE prow knows what commit was tested in the PR and the branch it would merge to and ensures that the tests passed at the latest of both before merge) and being able to batch test large numbers of PRs together safely (so e.g. after kubernetes lifts code freeze we may test and merge up to 15 PRs at at time where the CI merges all of them inside the test env, tests that, and then if it passes across all required jobs we can merge all of them at once). .... but we also don't actively discourage using it. kubernetes-sigs/kind is using both (prow, GHA) currently (and is a sig testing subproject), it was easier [vagrant in GHA] to get some small disposable VMs with particular kernel settings, whereas prow gives you something running inside a host cluster as a container/pod and doing more requires effort ... |
kubernetes-sigs/karpenter
is a sub-project under sig-autoscaling. Given that we are an autoscaling project, performance is an extremely high priority for us. We've been wanting to have a comprehensive scale testing suite for the core scheduling and consolidation logic that Karpenter uses for a while now and we are finally ready to prioritize this on our side.The primary question that we have at this point is that we want to make sure that the underlying infrastructure for compute that we are using to execute the scale tests is large enough that we are getting accurate results and aren't getting throttled by CPU or hyper-constrained by memory. We tried running with the GH actions containers that we are given in our repository and tried running with a kind cluster, but this caused a bunch of throttling on our side and led to slow-downs in our scale testing, affecting our results.
I'd love to hear recommendations from the community on what we should be doing here -- ideally, we can get something like an EKS or GCP cluster and run the Karpenter installation directly on it with as much memory or CPU that it needs to avoid the throttling. We'd run with our kwok cloudprovider version of Karpenter so all node scale-ups and scale-downs would be "fake" so that we wouldn't be utilizing compute from actual instances.
TLDR: We need a managed Kubernetes cluster environment where we can deploy Karpenter to launch fake nodes for our scale testing without getting throttled
The text was updated successfully, but these errors were encountered: