Scale testing #8814

sbueringer · 2023-06-07T11:48:01Z

We recently merged the initial iteration of the in-memory provider (#8799). But this was just the first step of the scale test implementation. This issue provides an overview over ongoing and upcoming tasks around scale testing.

In-Memory provider features:

High-level:
- P0 ClusterClass support (🌱 add ClusterClass support for in-memory provider #8807 @ykakarap)
- P0 Deletion (🐛 fix cluster deletion in the in-memory API server #8818 @fabriziopandini)
- Upgrade (@killianmuldoon)
- KCP kube-proxy and CoreDNS reconciliation (🌱 CAPIM: Enable update for coreDNS and kube-proxy #8899 @killianmuldoon)
- Make it behave like a real infra provider:
  - P0 Provisioning duration (🌱 Add startup timeout to the in memory provider #8831 @fabriziopandini)
  - Errors
  - Configurable apiserver/etcd latency
Low-level:
- P0 apiserver: watches (🌱 Add watch to in-memory server multiplexer #8851 @killianmuldoon)
- apiserver: label selector for list calls
  - Not a problem for cached resources KCP. Label selectors for cached resources are evaluated client-side in CR.
- apiserver: improve field selector calls. Return an error if the field selector is not supported (✨ Enable Kubernetes upgrades in CAPIM #8938 (comment) @killianmuldoon )

e2e test and test framework:

Metrics & observability:

P0 Cluster API state metrics & dashboard: (🌱 hack/observability: Add Grafana state dashboard, improve metrics #8834 @sbueringer)
In-memory provider metrics & dashboard:
- apiserver & etcd: server-side request metrics (prior art: kube-apiserver)
Consider exposing more metrics in core CAPI, e.g.:
- time until Machine is running
- queue additions (to figure out who is doing it)
Consider writing alerts for problematic conditions

Performance improvements

Follow-up

Anomalies found that we should further triage:

/convert gets called a lot (even though we never use old apiVersions)
When deploying > 1k clusters into a namespace "list machines" in KCP becomes pretty slow and apiserver CPU usage was very high (8-14 CPUs) (Debug ideas: cpu profile, apiserver tracing)

Backlog improvement ideas:

KCP:
- (breaking change): create issue that all KCP secrets must have cluster-name label => then configure KCP cache & client to only cache secrets with cluster-name label
- EnsureResource: Resources are cached atm. Consider only caching PartialObjectMeta instead.
- Consider caching the pods we care about (at least CP, check if we access other pods, kube-proxy, core-dns)
- GetMachinesForCluster: cached call + wait for cache safeguards
- Optimize etcd client creation (cache instead of recreate)
Others:
- Change all CAPI controllers to cache unstructured per default, use APIReader for uncached calls (like for regular typed objects)
- Audit all usages of APIReader if they are actually necessary
- Run certain operations less frequently (e.g. apiVersion bump, reconcile labels)
- Customize controller work queue rate-limiter
- Buffered reconciling (avoid frequent reconcile of the same item within a short period of time)
- Resync items over time instead of all at once at resyncPeriod
  - Investigate if a Reconciler re-reconciles all objects for every type it is watching (because resync is implemented on the informer level), e.g. KCP controller reconciles aver the KCP and after the Cluster resync.
- Priority queue
- Use CR transform option to strip parts of objects we don't use (fields which are not part of the contract)
  - trade-off: memory vs. processing time to strip fields, also not sure how to configure up front before we know the CRDs
  - => Based on data we don't know if it's worth it at the moment, so we won't do it for now.

The text was updated successfully, but these errors were encountered:

sbueringer · 2023-06-07T11:48:23Z

@fabriziopandini @ykakarap @killianmuldoon @lentzi90 Please let me know if I forgot something / you have further ideas.

sbueringer · 2023-08-17T05:36:16Z

@lentzi90 @richardcase would be really great if you can rerun your tests with v1.5.0. I would expect huge improvements with the optimizations mentioned above in place

richardcase · 2023-08-17T06:26:48Z

Will do @sbueringer 👍

fabriziopandini · 2024-04-11T16:37:30Z

/kind cleanup
/priority important-longterm

fabriziopandini · 2024-05-08T12:39:01Z

/assign @sbueringer
To re-assess state

sbueringer added the area/e2e-testing Issues or PRs related to e2e testing label Jun 7, 2023

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 7, 2023

sbueringer added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 7, 2023

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 7, 2023

This was referenced Jun 7, 2023

🐛 fix cluster deletion in the in-memory API server #8818

Merged

🌱 Add startup timeout to the in memory provider #8831

Merged

killianmuldoon mentioned this issue Jun 13, 2023

🌱 Add watch to in-memory server multiplexer #8851

Merged

ykakarap mentioned this issue Jun 21, 2023

🌱 optimize reconcileInterruptibleNodeLabel of machine controller #8852

Merged

This was referenced Jun 21, 2023

🌱 test/e2e/in-memory: improve performance by fixing locking issues #8895

Merged

🌱 controller/machine: use unstructured caching client #8896

Merged

This was referenced Jun 21, 2023

🌱 Fixups for watch in in-memory apiServer #8898

Merged

🌱 CAPIM: Enable update for coreDNS and kube-proxy #8899

Merged

fabriziopandini mentioned this issue Jun 21, 2023

🌱 Kcp use one workload cluster for reconcile #8900

Merged

fabriziopandini mentioned this issue Jul 4, 2023

📖 Add tuning guide #8958

Merged

This was referenced Jul 5, 2023

🌱 test/e2e: add field to scale test to deploy Clusters in separate namespace #8963

Merged

✨ hack/observability: improve dashboards #8964

Merged

🌱 hack/observability: add apiserver request dashboards #8978

Merged

killianmuldoon mentioned this issue Jul 25, 2023

🌱 Add scale testing for upgrades #9077

Merged

chrischdi mentioned this issue Aug 23, 2023

🌱 Promote chrischdi to cluster-api reviewer #9286

Merged

k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Apr 11, 2024

k8s-ci-robot assigned sbueringer May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale testing #8814

Scale testing #8814

sbueringer commented Jun 7, 2023 •

edited

Loading

sbueringer commented Jun 7, 2023

sbueringer commented Aug 17, 2023

richardcase commented Aug 17, 2023

fabriziopandini commented Apr 11, 2024

fabriziopandini commented May 8, 2024

Scale testing #8814

Scale testing #8814

Comments

sbueringer commented Jun 7, 2023 • edited Loading

Performance improvements

Follow-up

sbueringer commented Jun 7, 2023

sbueringer commented Aug 17, 2023

richardcase commented Aug 17, 2023

fabriziopandini commented Apr 11, 2024

fabriziopandini commented May 8, 2024

sbueringer commented Jun 7, 2023 •

edited

Loading