Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRE Investigation on Crossplane: found mitigation actions #2061

Closed
Tracked by #1971
gianfranco-l opened this issue Feb 22, 2023 · 6 comments
Closed
Tracked by #1971

SRE Investigation on Crossplane: found mitigation actions #2061

gianfranco-l opened this issue Feb 22, 2023 · 6 comments
Assignees
Labels
team/honeybadger Team Honey Badger

Comments

@gianfranco-l
Copy link

gianfranco-l commented Feb 22, 2023

We need to compare crossplane performance between different clusters (actually: api server performance when crossplane is installed with AWS provider).

Installations we should compare are gauss (when 1.11 upgrade is complete) and potato

@gianfranco-l gianfranco-l added the team/honeybadger Team Honey Badger label Feb 22, 2023
@piontec piontec changed the title SRE Investigation around Grizzly fall SRE Investigation around Crossplane Feb 23, 2023
@piontec
Copy link

piontec commented Feb 23, 2023

@gianfranco-l I changed the title, I think this is what you had in mind.

@gianfranco-l
Copy link
Author

@piontec could you a little bit of description about this?

@piontec
Copy link

piontec commented Feb 23, 2023

We need to compare why crossplane performance (actually: api server performance when crossplane is installed with AWS provider).

Installations we should compare are gauss (when 1.11 upgrade is complete) and potato

@nce
Copy link

nce commented Mar 9, 2023

Investigative Subjects

I was focusing on the impact crossplane has on our controlplane nodes. Scope was not:

  • sizing of the crossplane installation itself
  • client api throttling from various sources

General percieved problem was the slow/sluggishness of the api-server requests (best visualised in grafana -> k8s api performance -> API Server Request Duration)

I had no logs from etcd as it was running on already removed instances without loki

Summary Findings

Consistent finding in all installation was a lack of available memory on all controlplane nodes. In cases of unusable/crashed controlplane nodes - the persistent fix was doubling the available memory (cpu might not be that crucial after initial install).

=> Running on mostly (aws) m5.xlarge; we should switch to r6i.xlarge (4C; 32G) for crossplane installations

Suggestions from upstream

A sizing guide for the provider installation is currently missing

The split of providers is considered:

The API server uses a lot of memory (around 3MB per CRD)

🥇 The challenge of installing fewer CRDs

Useful links

Details

All installations are running v1.24.10 (if not mentioned otherwise) and have (among others) the same api-server flags:

      --max-requests-inflight=266
      --max-mutating-requests-inflight=133
      --enable-priority-and-fairness=true

Timestamps are UTC

Gauss

Leadup

On 27th of Feb. the aws-upbound provider in v0.30 (or earlier) was installed at around 12:20 on an already memory-wise struggling controlplane.

Controlplane components:
10-0-5-111 & 10-0-5-185 & 10-0-5-9 all running on m5.xlarge (4Cpu & 16 GiB)

Impact on basic resources

CPU and Memory exhaustion on all controlplane nodes:
image
image

Impact on k8s components

Unable to contact k8s api-server" error="Get \"https://internal-g8s.gauss.eu-west-1.aws.gigantic.io:443/api/v1/namespaces/kube-system\": EOF" ipAddr="https://internal-g8s.gauss.eu-west-1.aws.gigantic.io:443" subsys=k8s 
Get "https://172.31.0.1:443/api?timeout=32s": dial tcp 172.31.0.1:443: connect: no route to host`
cacher.go:425] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Kind=AWSManagedMachinePool: conversion webhook for infrastructure.cluster.x-k8s.io/v1alpha3, Kind=AWSManagedMachinePool failed: Post "https://capa-webhook-service.giantswarm.svc:443/convert?timeout=30s": dial tcp 172.31.251.1:443: connect: connection refused; reinitializing...
[{"level":"warn","ts":"2023-02-27T13:30:27.959Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0619cc1c0/etcd2.gauss.eu-west-1.aws.gigantic.io:2379","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}](status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}: etcdserver: request timed out)

=> a lot of timeout.go & retry_interceptor.go timeouts & context_exceeded & post_timeout from 12:25
Failure to access api-server amongst others (api-server -> etcd timeout)

This cascades to pod-restarts/evictions (some >200) putting more stress on the controlplane

As Dex was affected as well, no logins were possible.

Impact on api-server

image

Rate of API requests stayed almost the same; ETCD Actions & write did around double, but only for a short timeframe and not worrysome. ETCD Member keys increased from 32k -> 46k (my suspicion a lot of events)

Overall etcd performance severely degraded:
image

Avg_over_time indicates an sustained increase of overall memory consumption from the API Pods from ~4GiB to ~8GiB

ETCD disk-sync or traffic in/out didn't indicate abnormal behaviour.

Recovery

Setting the ASG to m5.2xlarge (8Cpu & 32GiB) stabilized the cluster again at around 14:40

Current State (14th march)

ip-10-0-5-30.eu-west-1.compute.internal     33        1562        23239
ip-10-0-5-121.eu-west-1.compute.internal    32         857        22986
ip-10-0-5-135.eu-west-1.compute.internal    36        1178        19885

giantswarm@ip-10-0-5-30 ~ $ free -m
               total        used        free      shared  buff/cache   available
Mem:           31645       17337        3618         913       10688       12951

giantswarm@ip-10-0-5-30 ~ $ sudo systemd-cgtop -n1 -m
Control Group                                                                    Tasks   %CPU   Memory  Input/s Output/s
/                                                                                1529      -    27.2G        -        -
kubepods.slice                                                                   797      -    16.7G        -        -
kubepods.slice/kubepods-burstable.slice                                          679      -    16.0G        -        -
kubepods.slice/kubepods-burstable.slice/kubepods-burstable-kube-apiserver.slice  26      -    12.9G        -        -
system.slice                                                                     578      -     9.3G        -        -
system.slice/containerd.service                                                  468      -     5.0G        -        -
system.slice/docker-etcd.scope                                                   18      -     2.9G


potato

Leadup

On March 1st at around 13:00 a new version of the upbound-aws provider was installed. This involves creation of the new revsion, tear down of the old; install of new crds and transfering ownership of the old crds (loads of errors with "cannot transfer ownership" though).

This incident is different to the one on gauss, as this was just an upgrade of the aws provider and almost no new crds were installed.

Controlplane components:
10-0-5-110 & 10-0-5-49 & 10-0-5-178 all running on m5.xlarge (4Cpu & 16 GiB)

Impact on basic resources

CPU and Memory exhaustion on all controlplane nodes:
image
image

Impact on api-server

image
After the scaleup of the controlplane memory consumption went back to the "pre-upgrade" level of around 7GiB

Overall etcd performance severely degraded here as well; this was fixed by added memory heavier instances:
image

Interestingly the etcd key entries spiked heavily:
image

Could be related to a bit of rescheduling/eviction on the controlplane:

  Normal   NodeHasInsufficientMemory  9m18s (x31 over 20d)  kubelet          Node ip-10-0-5-49.eu-central-1.compute.internal status is now: NodeHasInsufficientMemory
  
   Normal   NodeHasInsufficientMemory  6m14s (x21 over 8d)    kubelet          Node ip-10-0-5-110.eu-central-1.compute.internal status is now: NodeHasInsufficientMemory

Recovery

Moving the ASG to r6i.xlarge (4Cpu & 32GiB) stabilized the cluster again at around 15:00

new controlplane:
10-0-5-182 & 10-0-5-100 & 10-0-5-12

Both providers are currently running:

  • crossplane-provider-aws:v0.37.1
  • upbound-provider-aws:v0.30.0

anaconda

Is running upbound-provider-aws:v0.30.0 on k8s v1.24.10 m5.xlarge.

No metrics are available from the time of the install (jan 24th). Currently anaconda seems to run fine.

Api Servers consume about 7GiB Memory on rather full nodes without apparent impact. API Request Duration is (2d window) consitently <1s.

grizzly

Is running upbound-provider-aws:v0.30.0 on k8s v1.23.16 m5.xlarge and perceived slow in manual api requests handling crds (k get crds). As it was a pretty new setup, etcd is now running as static pod with currently around 1GiB Memory usage; api-server remains running with around 8-9GiB. During my time using the cluster (k9s) was responding fast.

image

On the Machines:

# controlplane nodes (number of pods; cpu; mem)
ip-10-0-46-43.eu-west-2.compute.internal 17        1246         8515
ip-10-0-123-25.eu-west-2.compute.internal 16         993         8191
ip-10-0-17-82.eu-west-2.compute.internal 18        1107         7759

ip-10-0-46-43 $ free -m
              total        used        free      shared  buff/cache   available
Mem:          15537        9673         632          39        5230        5497

ip-10-0-46-43 $ sudo systemd-cgtop -n 1 -m # only the top 2; memory in the 3rd column
system.slice/containerd.service                                                497  368.6    13.0G
system.slice/containerd.service/kubepods-burstable-kube-apiserver              18  346.8     9.8G
system.slice/containerd.service/kubepods-burstable-etcd                        14   12.0     1.1G

@nce
Copy link

nce commented Mar 15, 2023

Investigation around deleting and blocking CRDs

As we'r impacted by the amount of (unnecessary) crds, we wanted to test the manual deletion of crds and evaluate its impact.

(gs-grizzly) ~ ❯❯❯ k get crd | wc -l
    1001
(gs-grizzly) ~ ❯❯❯ k get crds | grep upbound | awk '{print $1}' |  wc -l
     873
(gs-grizzly) ~ ❯❯❯ k delete crd $(k get crds | grep upbound | awk '{print $1}' | head -n 500)
  ...
  customresourcedefinition.apiextensions.k8s.io "originaccesscontrols.cloudfront.aws.upbound.io" deleted
(gs-grizzly) ~ ❯❯❯ k get crd | wc -l                                                                                                                                                                                     
     501

Apart from massive logging increase of upbound-provider, we noticed nothing unusual on the provider (like cpu/mem spikes):

E0315 08:00:42.279958       1 reflector.go:140] k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169: Failed to watch *v1beta1.Cluster: failed to list *v1beta1.Cluster: the server could not find the requested resource (g
et clusters.dax.aws.upbound.io)
W0315 08:00:42.315298       1 reflector.go:424] k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169: failed to list *v1beta1.Analyzer: the server could not find the requested resource (get analyzers.accessanalyzer.aws.u
pbound.io)
E0315 08:00:42.315323       1 reflector.go:140] k8s.io/client-go@v0.25.0/tools/cache/reflector.go:169: Failed to watch *v1beta1.Analyzer: failed to list *v1beta1.Analyzer: the server could not find the requested resource
(get analyzers.accessanalyzer.aws.upbound.io)

As anticapted, about 20min later the provider added all missing CRDs again (without pod restart).

Memory consumption of the api-server decreased as expected (then increased again):
image
Although no real impact (besides a short spike) in cpu

9:51 am CRDs: 1001
system.slice/containerd.service/kubepods-burstable-poddbd36bd9805a5837721c23d7bc568b50.slice:cri-containerd:92c0ded4283ac455e3169362db95f16be262eb68f33a78753eaccd138cd96888      18  357.2     9.6G

10:00am CRDs: 501
system.slice/containerd.service/kubepods-burstable-poddbd36bd9805a5837721c23d7bc568b50.slice:cri-containerd:92c0ded4283ac455e3169362db95f16be262eb68f33a78753eaccd138cd96888      18    7.7     7.7G        -        -

10:14am CRDs: 1001
system.slice/containerd.service/kubepods-burstable-poddbd36bd9805a5837721c23d7bc568b50.slice:cri-containerd:92c0ded4283ac455e3169362db95f16be262eb68f33a78753eaccd138cd96888      18    8.1     8.8G        -        -

On a side note, this was observed as well:
image

Decision

As interesting as this was, we decided not to move forward with this

  • unsure if our current state of kyverno is up to the task
  • very hacky solution
  • we decided to wait for upstream tackling this problem (see above post)

@negz
Copy link

negz commented Mar 16, 2023

I noticed you're mostly testing with old-ish versions of Kubernetes. If you can, it does help to run Kubernetes v1.26. Each recent API server release has had a few small fixes to improve resource utilization in the face of many CRDs. This won't be enough to provide a good experience (we still need to reduce the number of CRDs Crossplane installs), but it should at least shave a bit of CPU and memory usage (I want to say at least a GB) from what you're seeing.

@gianfranco-l gianfranco-l changed the title SRE Investigation around Crossplane SRE Investigation on Crossplane: found mitigation actions Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/honeybadger Team Honey Badger
Projects
None yet
Development

No branches or pull requests

4 participants