CRDs causes the whole K8s cluster not to work properly #906

colinlabs · 2022-06-02T14:43:58Z

k8s version: eks 1.22.9/k8s 1.22.9
harbor-operator crd version: 1.2.0

Creating a resource with any harbor-operator crd(eg: chartmuseums.goharbor.io) will result in a large number of internal error logs in the kube-crontroller-manager component log. The resources of the entire cluster cannot be deleted normally. eg: create a deployment resource, and then delete it, and you will find that the associated replicas and pod cannot be deleted.

like this:

$ kubectl create deployment myapp --image nginx:alpine
deployment.apps/myapp created
$ kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
myapp-7c5c94c888-wwz4k   1/1     Running   0          6s
$ kubectl delete deploy myapp
deployment.apps "myapp" deleted
$ kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
myapp-7c5c94c888-wwz4k   1/1     Running   0          14s
$ kubectl get rs
NAME               DESIRED   CURRENT   READY   AGE
myapp-7c5c94c888   1         1         1       14s

kube-controller-manager logs:

I0602 14:23:45.550212      11 resource_quota_controller.go:439] syncing resource quota controller with updated resources from discovery: added: [goharbor.io/v1beta1, Resource=chartmuseums], removed: []
--
I0602 14:23:45.550303      11 resource_quota_monitor.go:229] QuotaMonitor created object count evaluator for chartmuseums.goharbor.io
I0602 14:23:45.550333      11 shared_informer.go:240] Waiting for caches to sync for resource quota
E0602 14:23:45.598174      11 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource
I0602 14:23:45.910829      11 garbagecollector.go:213] syncing garbage collector with updated resources from discovery (attempt 1): added: [goharbor.io/v1beta1, Resource=chartmuseums], removed: []
I0602 14:23:45.915794      11 shared_informer.go:240] Waiting for caches to sync for garbage collector
E0602 14:23:47.040663      11 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource
E0602 14:23:48.874719      11 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource
E0602 14:23:54.341724      11 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource
E0602 14:24:03.826977      11 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource
I0602 14:24:15.550886      11 shared_informer.go:266] stop requested
E0602 14:24:15.550911      11 shared_informer.go:243] unable to sync caches for resource quota
E0602 14:24:15.550921      11 resource_quota_controller.go:452] timed out waiting for quota monitor sync
I0602 14:24:15.916805      11 shared_informer.go:266] stop requested
E0602 14:24:15.916826      11 shared_informer.go:243] unable to sync caches for garbage collector
E0602 14:24:15.916836      11 garbagecollector.go:242] timed out waiting for dependency graph builder sync during GC sync (attempt 1)
I0602 14:24:16.033309      11 garbagecollector.go:213] syncing garbage collector with updated resources from discovery (attempt 2): added: [goharbor.io/v1beta1, Resource=chartmuseums], removed: []
I0602 14:24:16.033381      11 shared_informer.go:240] Waiting for caches to sync for garbage collector
E0602 14:24:17.006428      11 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource
I0602 14:24:45.560876      11 resource_quota_controller.go:439] syncing resource quota controller with updated resources from discovery: added: [goharbor.io/v1beta1, Resource=chartmuseums], removed: []

Another thing, I found that there are a lot of key-value caBundle: Cg== in the crd resource. After the CRD resource is created, this value is not replaced with something like caBundle: "Ci0tLS0tQk... < base64-encoded PEM bundle > .tLS0K". I don't know if it has any effect on this.

reproduce:

an eks 1.22 cluster environment
create a crd chartmuseums.goharbor.io
kubectl create deployment myapp
kubectl delete deployment myapp
kubectl get deploy,replicas,pod |grep myapp
result: replicasets and pod can not be deleted;kube-controller-manager will show many Internal error occurred

The text was updated successfully, but these errors were encountered:

cndoit18 · 2022-06-02T15:56:50Z

hi, do you install cert-manager

colinlabs · 2022-06-03T03:42:52Z

hi, do you install cert-manager

yeah sure。

colinlabs · 2022-06-05T02:31:22Z

I used Kind to create a k8s cluster 1.22.9, which will reproduce the same issue.

bitsf · 2022-06-07T12:48:29Z

I didn't reproduce this with kind 1.22.0 and eks 1.22.6.
My step is

kubectl apply -f "https://github.com/jetstack/cert-manager/releases/download/v1.6.1/cert-manager.yaml"
kubectl apply -f "https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.0.5/deploy/static/provider/kind/deploy.yaml"
kubectl apply manifests/cluster/deployment.yaml
kubectl apply manifests/samples/full_stack.yaml
change dns
kubectl create deployment myapp --image nginx:alpine
kubectl delete deployment myapp

colinlabs · 2022-06-10T03:01:41Z

After careful investigation, some resources are not installed properly during the installation of cert-manager or harbor-operator CRD, resulting in the failure of the caBundle: Cg== in the CRD resource to be replaced by a valid certificate by cert-manager, resulting in problems with the kube-controller-manager component. After reinstalling cert-manager and harbor-operator normally, it will work properly. This root cause may require some updates upstream of Kubernetes. caBundle: Cg== will cause the whole cluster to crash.

cvsgm · 2023-03-02T02:30:23Z

We encounter the same issue and resolve it with AWS support. I think it is not related to cert manger.
As long as you have a crd with caBundle: Cg== it might trigger in garbagecollector which further pause the reconciling of all EKS resources.
We fixed it by identity why which CRD is with caBundle: Cg== and remove such filed.
A simply verification of whether your cluster might have this issue in EKS 1.22 is by running kubectl get crd -o yaml | grep "caBundle: Cg==" -B 30

bitsf assigned cndoit18 Jun 7, 2022

cndoit18 added blocked-by-upstream kind/question Further information is requested and removed blocked-by-upstream labels Jun 10, 2022

bitsf closed this as completed Jun 13, 2022

aimuz mentioned this issue Apr 27, 2023

kube-controller-manager unable to perform normal GC projectcalico/calico#7598

Closed

ArchiFleKs mentioned this issue Jun 3, 2023

Issue with controller manager when deploying Velero on a fresh 1.26/1.27 EKS cluster vmware-tanzu/velero#6350

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRDs causes the whole K8s cluster not to work properly #906

CRDs causes the whole K8s cluster not to work properly #906

colinlabs commented Jun 2, 2022 •

edited

Loading

cndoit18 commented Jun 2, 2022

colinlabs commented Jun 3, 2022

colinlabs commented Jun 5, 2022

bitsf commented Jun 7, 2022 •

edited

Loading

colinlabs commented Jun 10, 2022

cvsgm commented Mar 2, 2023

CRDs causes the whole K8s cluster not to work properly #906

CRDs causes the whole K8s cluster not to work properly #906

Comments

colinlabs commented Jun 2, 2022 • edited Loading

cndoit18 commented Jun 2, 2022

colinlabs commented Jun 3, 2022

colinlabs commented Jun 5, 2022

bitsf commented Jun 7, 2022 • edited Loading

colinlabs commented Jun 10, 2022

cvsgm commented Mar 2, 2023

colinlabs commented Jun 2, 2022 •

edited

Loading

bitsf commented Jun 7, 2022 •

edited

Loading