-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SRE Investigation on Crossplane: found mitigation actions #2061
Comments
@gianfranco-l I changed the title, I think this is what you had in mind. |
@piontec could you a little bit of description about this? |
We need to compare why crossplane performance (actually: api server performance when crossplane is installed with AWS provider). Installations we should compare are |
Investigative SubjectsI was focusing on the impact crossplane has on our controlplane nodes. Scope was not:
General percieved problem was the slow/sluggishness of the api-server requests (best visualised in grafana -> k8s api performance -> API Server Request Duration) I had no logs from etcd as it was running on already removed instances without loki Summary FindingsConsistent finding in all installation was a lack of available memory on all controlplane nodes. In cases of unusable/crashed controlplane nodes - the persistent fix was doubling the available memory (cpu might not be that crucial after initial install). => Running on mostly (aws) Suggestions from upstreamA sizing guide for the provider installation is currently missing The split of providers is considered:
🥇 The challenge of installing fewer CRDs Useful linksDetailsAll installations are running
Timestamps are UTC GaussLeadupOn 27th of Feb. the aws-upbound provider in v0.30 (or earlier) was installed at around 12:20 on an already memory-wise struggling controlplane. Controlplane components: Impact on basic resourcesCPU and Memory exhaustion on all controlplane nodes: Impact on k8s components
=> a lot of timeout.go & retry_interceptor.go This cascades to pod-restarts/evictions (some >200) putting more stress on the controlplane As Dex was affected as well, no logins were possible. Impact on api-serverRate of API requests stayed almost the same; ETCD Actions & write did around double, but only for a short timeframe and not worrysome. ETCD Member keys increased from 32k -> 46k (my suspicion a lot of Overall etcd performance severely degraded: Avg_over_time indicates an sustained increase of overall memory consumption from the API Pods from ~4GiB to ~8GiB ETCD disk-sync or traffic in/out didn't indicate abnormal behaviour. RecoverySetting the ASG to Current State (14th march)
potatoLeadupOn March 1st at around 13:00 a new version of the upbound-aws provider was installed. This involves creation of the new revsion, tear down of the old; install of new crds and transfering ownership of the old crds (loads of errors with "cannot transfer ownership" though). This incident is different to the one on gauss, as this was just an upgrade of the aws provider and almost no new crds were installed. Controlplane components: Impact on basic resourcesCPU and Memory exhaustion on all controlplane nodes: Impact on api-server
Overall etcd performance severely degraded here as well; this was fixed by added memory heavier instances: Interestingly the etcd key entries spiked heavily: Could be related to a bit of rescheduling/eviction on the controlplane:
RecoveryMoving the ASG to new controlplane: Both providers are currently running:
anacondaIs running No metrics are available from the time of the install (jan 24th). Currently Api Servers consume about 7GiB Memory on rather full nodes without apparent impact. API Request Duration is (2d window) consitently <1s. grizzlyIs running On the Machines:
|
I noticed you're mostly testing with old-ish versions of Kubernetes. If you can, it does help to run Kubernetes v1.26. Each recent API server release has had a few small fixes to improve resource utilization in the face of many CRDs. This won't be enough to provide a good experience (we still need to reduce the number of CRDs Crossplane installs), but it should at least shave a bit of CPU and memory usage (I want to say at least a GB) from what you're seeing. |
We need to compare crossplane performance between different clusters (actually: api server performance when crossplane is installed with AWS provider).
Installations we should compare are
gauss
(when 1.11 upgrade is complete) andpotato
The text was updated successfully, but these errors were encountered: