SRE Investigation on Crossplane: found mitigation actions #2061

gianfranco-l · 2023-02-22T10:50:41Z

We need to compare crossplane performance between different clusters (actually: api server performance when crossplane is installed with AWS provider).

Installations we should compare are gauss (when 1.11 upgrade is complete) and potato

The text was updated successfully, but these errors were encountered:

piontec · 2023-02-23T07:34:04Z

@gianfranco-l I changed the title, I think this is what you had in mind.

gianfranco-l · 2023-02-23T08:58:03Z

@piontec could you a little bit of description about this?

piontec · 2023-02-23T13:32:42Z

We need to compare why crossplane performance (actually: api server performance when crossplane is installed with AWS provider).

Installations we should compare are gauss (when 1.11 upgrade is complete) and potato

nce · 2023-03-09T09:33:22Z

Investigative Subjects

I was focusing on the impact crossplane has on our controlplane nodes. Scope was not:

sizing of the crossplane installation itself
client api throttling from various sources

General percieved problem was the slow/sluggishness of the api-server requests (best visualised in grafana -> k8s api performance -> API Server Request Duration)

I had no logs from etcd as it was running on already removed instances without loki

Summary Findings

Consistent finding in all installation was a lack of available memory on all controlplane nodes. In cases of unusable/crashed controlplane nodes - the persistent fix was doubling the available memory (cpu might not be that crucial after initial install).

=> Running on mostly (aws) m5.xlarge; we should switch to r6i.xlarge (4C; 32G) for crossplane installations

Suggestions from upstream

A sizing guide for the provider installation is currently missing

The split of providers is considered:

The API server uses a lot of memory (around 3MB per CRD)

🥇 The challenge of installing fewer CRDs

Useful links

Details

All installations are running v1.24.10 (if not mentioned otherwise) and have (among others) the same api-server flags:

      --max-requests-inflight=266
      --max-mutating-requests-inflight=133
      --enable-priority-and-fairness=true

Timestamps are UTC

Gauss

Leadup

On 27th of Feb. the aws-upbound provider in v0.30 (or earlier) was installed at around 12:20 on an already memory-wise struggling controlplane.

Controlplane components:
10-0-5-111 & 10-0-5-185 & 10-0-5-9 all running on m5.xlarge (4Cpu & 16 GiB)

Impact on basic resources

CPU and Memory exhaustion on all controlplane nodes:

Impact on k8s components

Unable to contact k8s api-server" error="Get \"https://internal-g8s.gauss.eu-west-1.aws.gigantic.io:443/api/v1/namespaces/kube-system\": EOF" ipAddr="https://internal-g8s.gauss.eu-west-1.aws.gigantic.io:443" subsys=k8s

Get "https://172.31.0.1:443/api?timeout=32s": dial tcp 172.31.0.1:443: connect: no route to host`

cacher.go:425] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Kind=AWSManagedMachinePool: conversion webhook for infrastructure.cluster.x-k8s.io/v1alpha3, Kind=AWSManagedMachinePool failed: Post "https://capa-webhook-service.giantswarm.svc:443/convert?timeout=30s": dial tcp 172.31.251.1:443: connect: connection refused; reinitializing...

[{"level":"warn","ts":"2023-02-27T13:30:27.959Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0619cc1c0/etcd2.gauss.eu-west-1.aws.gigantic.io:2379","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}](status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}: etcdserver: request timed out)

=> a lot of timeout.go & retry_interceptor.go timeouts & context_exceeded & post_timeout from 12:25
Failure to access api-server amongst others (api-server -> etcd timeout)

This cascades to pod-restarts/evictions (some >200) putting more stress on the controlplane

As Dex was affected as well, no logins were possible.

Impact on api-server

Rate of API requests stayed almost the same; ETCD Actions & write did around double, but only for a short timeframe and not worrysome. ETCD Member keys increased from 32k -> 46k (my suspicion a lot of events)

Overall etcd performance severely degraded:

Avg_over_time indicates an sustained increase of overall memory consumption from the API Pods from ~4GiB to ~8GiB

ETCD disk-sync or traffic in/out didn't indicate abnormal behaviour.

Recovery

Setting the ASG to m5.2xlarge (8Cpu & 32GiB) stabilized the cluster again at around 14:40

Current State (14th march)

ip-10-0-5-30.eu-west-1.compute.internal     33        1562        23239
ip-10-0-5-121.eu-west-1.compute.internal    32         857        22986
ip-10-0-5-135.eu-west-1.compute.internal    36        1178        19885

giantswarm@ip-10-0-5-30 ~ $ free -m
               total        used        free      shared  buff/cache   available
Mem:           31645       17337        3618         913       10688       12951

giantswarm@ip-10-0-5-30 ~ $ sudo systemd-cgtop -n1 -m
Control Group                                                                    Tasks   %CPU   Memory  Input/s Output/s
/                                                                                1529      -    27.2G        -        -
kubepods.slice                                                                   797      -    16.7G        -        -
kubepods.slice/kubepods-burstable.slice                                          679      -    16.0G        -        -
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-kube-apiserver.slice  26      -    12.9G        -        -
system.slice                                                                     578      -     9.3G        -        -
system.slice/containerd.service                                                  468      -     5.0G        -        -
system.slice/docker-etcd.scope                                                   18      -     2.9G

potato

Leadup

On March 1st at around 13:00 a new version of the upbound-aws provider was installed. This involves creation of the new revsion, tear down of the old; install of new crds and transfering ownership of the old crds (loads of errors with "cannot transfer ownership" though).

This incident is different to the one on gauss, as this was just an upgrade of the aws provider and almost no new crds were installed.

Controlplane components:
10-0-5-110 & 10-0-5-49 & 10-0-5-178 all running on m5.xlarge (4Cpu & 16 GiB)

Impact on basic resources

CPU and Memory exhaustion on all controlplane nodes:

Impact on api-server

After the scaleup of the controlplane memory consumption went back to the "pre-upgrade" level of around 7GiB

Overall etcd performance severely degraded here as well; this was fixed by added memory heavier instances:

Interestingly the etcd key entries spiked heavily:

Could be related to a bit of rescheduling/eviction on the controlplane:

  Normal   NodeHasInsufficientMemory  9m18s (x31 over 20d)  kubelet          Node ip-10-0-5-49.eu-central-1.compute.internal status is now: NodeHasInsufficientMemory
  
   Normal   NodeHasInsufficientMemory  6m14s (x21 over 8d)    kubelet          Node ip-10-0-5-110.eu-central-1.compute.internal status is now: NodeHasInsufficientMemory

Recovery

Moving the ASG to r6i.xlarge (4Cpu & 32GiB) stabilized the cluster again at around 15:00

new controlplane:
10-0-5-182 & 10-0-5-100 & 10-0-5-12

Both providers are currently running:

crossplane-provider-aws:v0.37.1
upbound-provider-aws:v0.30.0

anaconda

Is running upbound-provider-aws:v0.30.0 on k8s v1.24.10 m5.xlarge.

No metrics are available from the time of the install (jan 24th). Currently anaconda seems to run fine.

Api Servers consume about 7GiB Memory on rather full nodes without apparent impact. API Request Duration is (2d window) consitently <1s.

grizzly

Is running upbound-provider-aws:v0.30.0 on k8s v1.23.16 m5.xlarge and perceived slow in manual api requests handling crds (k get crds). As it was a pretty new setup, etcd is now running as static pod with currently around 1GiB Memory usage; api-server remains running with around 8-9GiB. During my time using the cluster (k9s) was responding fast.

On the Machines:

# controlplane nodes (number of pods; cpu; mem)
ip-10-0-46-43.eu-west-2.compute.internal 17        1246         8515
ip-10-0-123-25.eu-west-2.compute.internal 16         993         8191
ip-10-0-17-82.eu-west-2.compute.internal 18        1107         7759

ip-10-0-46-43 $ free -m
              total        used        free      shared  buff/cache   available
Mem:          15537        9673         632          39        5230        5497

ip-10-0-46-43 $ sudo systemd-cgtop -n 1 -m # only the top 2; memory in the 3rd column
system.slice/containerd.service                                                497  368.6    13.0G
system.slice/containerd.service/kubepods-burstable-kube-apiserver              18  346.8     9.8G
system.slice/containerd.service/kubepods-burstable-etcd                        14   12.0     1.1G

nce · 2023-03-15T14:45:10Z

Investigation around deleting and blocking CRDs

As we'r impacted by the amount of (unnecessary) crds, we wanted to test the manual deletion of crds and evaluate its impact.

(gs-grizzly) ~ ❯❯❯ k get crd | wc -l
    1001
(gs-grizzly) ~ ❯❯❯ k get crds | grep upbound | awk '{print $1}' |  wc -l
     873
(gs-grizzly) ~ ❯❯❯ k delete crd $(k get crds | grep upbound | awk '{print $1}' | head -n 500)
  ...
  customresourcedefinition.apiextensions.k8s.io "originaccesscontrols.cloudfront.aws.upbound.io" deleted
(gs-grizzly) ~ ❯❯❯ k get crd | wc -l                                                                                                                                                                                     
     501

Apart from massive logging increase of upbound-provider, we noticed nothing unusual on the provider (like cpu/mem spikes):

E0315 08:00:42.279958       1 reflector.go:140] k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169: Failed to watch *v1beta1.Cluster: failed to list *v1beta1.Cluster: the server could not find the requested resource (g
et clusters.dax.aws.upbound.io)
W0315 08:00:42.315298       1 reflector.go:424] k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169: failed to list *v1beta1.Analyzer: the server could not find the requested resource (get analyzers.accessanalyzer.aws.u
pbound.io)
E0315 08:00:42.315323       1 reflector.go:140] k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169: Failed to watch *v1beta1.Analyzer: failed to list *v1beta1.Analyzer: the server could not find the requested resource
(get analyzers.accessanalyzer.aws.upbound.io)

As anticapted, about 20min later the provider added all missing CRDs again (without pod restart).

Memory consumption of the api-server decreased as expected (then increased again):

Although no real impact (besides a short spike) in cpu

9:51 am CRDs: 1001
system.slice/containerd.service/kubepods-burstable-poddbd36bd9805a5837721c23d7bc568b50.slice:cri-containerd:92c0ded4283ac455e3169362db95f16be262eb68f33a78753eaccd138cd96888      18  357.2     9.6G

10:00am CRDs: 501
system.slice/containerd.service/kubepods-burstable-poddbd36bd9805a5837721c23d7bc568b50.slice:cri-containerd:92c0ded4283ac455e3169362db95f16be262eb68f33a78753eaccd138cd96888      18    7.7     7.7G        -        -

10:14am CRDs: 1001
system.slice/containerd.service/kubepods-burstable-poddbd36bd9805a5837721c23d7bc568b50.slice:cri-containerd:92c0ded4283ac455e3169362db95f16be262eb68f33a78753eaccd138cd96888      18    8.1     8.8G        -        -

On a side note, this was observed as well:

Decision

As interesting as this was, we decided not to move forward with this

unsure if our current state of kyverno is up to the task
very hacky solution
we decided to wait for upstream tackling this problem (see above post)

negz · 2023-03-16T02:08:34Z

I noticed you're mostly testing with old-ish versions of Kubernetes. If you can, it does help to run Kubernetes v1.26. Each recent API server release has had a few small fixes to improve resource utilization in the face of many CRDs. This won't be enough to provide a good experience (we still need to reduce the number of CRDs Crossplane installs), but it should at least shave a bit of CPU and memory usage (I want to say at least a GB) from what you're seeing.

gianfranco-l added the team/honeybadger Team Honey Badger label Feb 22, 2023

piontec changed the title ~~SRE Investigation around Grizzly fall~~ SRE Investigation around Crossplane Feb 23, 2023

gianfranco-l mentioned this issue Feb 27, 2023

Offering Managed Crossplane #1971

Closed

gianfranco-l assigned nce Feb 27, 2023

piontec mentioned this issue Mar 1, 2023

Managed Crossplane: future improvements #1905

Open

gianfranco-l changed the title ~~SRE Investigation around Crossplane~~ SRE Investigation on Crossplane: found mitigation actions Mar 16, 2023

gianfranco-l closed this as completed Mar 16, 2023

gianfranco-l mentioned this issue Mar 16, 2023

first iteration on Crossplane docs #2160

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SRE Investigation on Crossplane: found mitigation actions #2061

SRE Investigation on Crossplane: found mitigation actions #2061

gianfranco-l commented Feb 22, 2023 •

edited

Loading

piontec commented Feb 23, 2023

gianfranco-l commented Feb 23, 2023

piontec commented Feb 23, 2023

nce commented Mar 9, 2023 •

edited

Loading

nce commented Mar 15, 2023

negz commented Mar 16, 2023

SRE Investigation on Crossplane: found mitigation actions #2061

SRE Investigation on Crossplane: found mitigation actions #2061

Comments

gianfranco-l commented Feb 22, 2023 • edited Loading

piontec commented Feb 23, 2023

gianfranco-l commented Feb 23, 2023

piontec commented Feb 23, 2023

nce commented Mar 9, 2023 • edited Loading

Investigative Subjects

Summary Findings

Suggestions from upstream

Useful links

Details

Gauss

Leadup

Impact on basic resources

Impact on k8s components

Impact on api-server

Recovery

Current State (14th march)

potato

Leadup

Impact on basic resources

Impact on api-server

Recovery

anaconda

grizzly

nce commented Mar 15, 2023

Investigation around deleting and blocking CRDs

Decision

negz commented Mar 16, 2023

gianfranco-l commented Feb 22, 2023 •

edited

Loading

nce commented Mar 9, 2023 •

edited

Loading