Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale testing #8814

Open
15 of 27 tasks
sbueringer opened this issue Jun 7, 2023 · 5 comments
Open
15 of 27 tasks

Scale testing #8814

sbueringer opened this issue Jun 7, 2023 · 5 comments
Assignees
Labels
area/e2e-testing Issues or PRs related to e2e testing kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@sbueringer
Copy link
Member

sbueringer commented Jun 7, 2023

We recently merged the initial iteration of the in-memory provider (#8799). But this was just the first step of the scale test implementation. This issue provides an overview over ongoing and upcoming tasks around scale testing.

In-Memory provider features:

e2e test and test framework:

  • Implement scale test automation:
    • Cluster topologies:
      • Small workload cluster: x * (1 control-plane + 1 worker node)
      • Small medium workload cluster: x * (3 control-plane + 10 worker node)
      • Medium Workload Cluster: x * (3 control-plane + 50 worker nodes)
      • Large Workload Cluster: x * (3 control-plane + 500 worker nodes)
        • Dimensions: # of MachineDeployments
    • Scenarios:
      • P0 Create & delete (🌱 Add Scale e2e - development only #8833 @ykakarap)
      • Create, upgrade & delete @killianmuldoon
      • Long-lived clusters (~ a few hours or a day, to catch memory leaks etc.)
      • Chaos testing: e.g. injecting failures like cluster not reachable, machine failures
      • More complex scenarios: e.g. topology is actively changed (MD scale up etc.)
      • Add MachineHealthCheck to the scaling test (@ykakarap)
  • Automate scale testing in CI: (prior art KCP, k/k)
    • Metric collection and consumption after test completion
    • Test should fail based on SLA's (e.g. machine creation slower than x minutes)

Metrics & observability:

Performance improvements

Follow-up

Anomalies found that we should further triage:

  • /convert gets called a lot (even though we never use old apiVersions)
  • When deploying > 1k clusters into a namespace "list machines" in KCP becomes pretty slow and apiserver CPU usage was very high (8-14 CPUs) (Debug ideas: cpu profile, apiserver tracing)

Backlog improvement ideas:

  • KCP:
    • (breaking change): create issue that all KCP secrets must have cluster-name label => then configure KCP cache & client to only cache secrets with cluster-name label
    • EnsureResource: Resources are cached atm. Consider only caching PartialObjectMeta instead.
    • Consider caching the pods we care about (at least CP, check if we access other pods, kube-proxy, core-dns)
    • GetMachinesForCluster: cached call + wait for cache safeguards
    • Optimize etcd client creation (cache instead of recreate)
  • Others:
    • Change all CAPI controllers to cache unstructured per default, use APIReader for uncached calls (like for regular typed objects)
    • Audit all usages of APIReader if they are actually necessary
    • Run certain operations less frequently (e.g. apiVersion bump, reconcile labels)
    • Customize controller work queue rate-limiter
    • Buffered reconciling (avoid frequent reconcile of the same item within a short period of time)
    • Resync items over time instead of all at once at resyncPeriod
      • Investigate if a Reconciler re-reconciles all objects for every type it is watching (because resync is implemented on the informer level), e.g. KCP controller reconciles aver the KCP and after the Cluster resync.
    • Priority queue
    • Use CR transform option to strip parts of objects we don't use (fields which are not part of the contract)
      • trade-off: memory vs. processing time to strip fields, also not sure how to configure up front before we know the CRDs
      • => Based on data we don't know if it's worth it at the moment, so we won't do it for now.
@sbueringer sbueringer added the area/e2e-testing Issues or PRs related to e2e testing label Jun 7, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 7, 2023
@sbueringer
Copy link
Member Author

@fabriziopandini @ykakarap @killianmuldoon @lentzi90 Please let me know if I forgot something / you have further ideas.

@sbueringer sbueringer added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 7, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 7, 2023
@sbueringer
Copy link
Member Author

@lentzi90 @richardcase would be really great if you can rerun your tests with v1.5.0. I would expect huge improvements with the optimizations mentioned above in place

@richardcase
Copy link
Member

Will do @sbueringer 👍

@fabriziopandini
Copy link
Member

/kind cleanup
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Apr 11, 2024
@fabriziopandini
Copy link
Member

/assign @sbueringer
To re-assess state

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/e2e-testing Issues or PRs related to e2e testing kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants