Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRDs causes the whole K8s cluster not to work properly #906

Closed
colinlabs opened this issue Jun 2, 2022 · 6 comments
Closed

CRDs causes the whole K8s cluster not to work properly #906

colinlabs opened this issue Jun 2, 2022 · 6 comments
Assignees
Labels
kind/question Further information is requested

Comments

@colinlabs
Copy link
Contributor

colinlabs commented Jun 2, 2022

k8s version: eks 1.22.9/k8s 1.22.9
harbor-operator crd version: 1.2.0

Creating a resource with any harbor-operator crd(eg: chartmuseums.goharbor.io) will result in a large number of internal error logs in the kube-crontroller-manager component log. The resources of the entire cluster cannot be deleted normally. eg: create a deployment resource, and then delete it, and you will find that the associated replicas and pod cannot be deleted.

like this:

$ kubectl create deployment myapp --image nginx:alpine
deployment.apps/myapp created
$ kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
myapp-7c5c94c888-wwz4k   1/1     Running   0          6s
$ kubectl delete deploy myapp
deployment.apps "myapp" deleted
$ kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
myapp-7c5c94c888-wwz4k   1/1     Running   0          14s
$ kubectl get rs
NAME               DESIRED   CURRENT   READY   AGE
myapp-7c5c94c888   1         1         1       14s

kube-controller-manager logs:

I0602 14:23:45.550212      11 resource_quota_controller.go:439] syncing resource quota controller with updated resources from discovery: added: [goharbor.io/v1beta1, Resource=chartmuseums], removed: []
--
I0602 14:23:45.550303      11 resource_quota_monitor.go:229] QuotaMonitor created object count evaluator for chartmuseums.goharbor.io
I0602 14:23:45.550333      11 shared_informer.go:240] Waiting for caches to sync for resource quota
E0602 14:23:45.598174      11 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource
I0602 14:23:45.910829      11 garbagecollector.go:213] syncing garbage collector with updated resources from discovery (attempt 1): added: [goharbor.io/v1beta1, Resource=chartmuseums], removed: []
I0602 14:23:45.915794      11 shared_informer.go:240] Waiting for caches to sync for garbage collector
E0602 14:23:47.040663      11 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource
E0602 14:23:48.874719      11 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource
E0602 14:23:54.341724      11 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource
E0602 14:24:03.826977      11 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource
I0602 14:24:15.550886      11 shared_informer.go:266] stop requested
E0602 14:24:15.550911      11 shared_informer.go:243] unable to sync caches for resource quota
E0602 14:24:15.550921      11 resource_quota_controller.go:452] timed out waiting for quota monitor sync
I0602 14:24:15.916805      11 shared_informer.go:266] stop requested
E0602 14:24:15.916826      11 shared_informer.go:243] unable to sync caches for garbage collector
E0602 14:24:15.916836      11 garbagecollector.go:242] timed out waiting for dependency graph builder sync during GC sync (attempt 1)
I0602 14:24:16.033309      11 garbagecollector.go:213] syncing garbage collector with updated resources from discovery (attempt 2): added: [goharbor.io/v1beta1, Resource=chartmuseums], removed: []
I0602 14:24:16.033381      11 shared_informer.go:240] Waiting for caches to sync for garbage collector
E0602 14:24:17.006428      11 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource
I0602 14:24:45.560876      11 resource_quota_controller.go:439] syncing resource quota controller with updated resources from discovery: added: [goharbor.io/v1beta1, Resource=chartmuseums], removed: []

Another thing, I found that there are a lot of key-value caBundle: Cg== in the crd resource. After the CRD resource is created, this value is not replaced with something like caBundle: "Ci0tLS0tQk... < base64-encoded PEM bundle > .tLS0K". I don't know if it has any effect on this.

reproduce:

  1. an eks 1.22 cluster environment
  2. create a crd chartmuseums.goharbor.io
  3. kubectl create deployment myapp
  4. kubectl delete deployment myapp
  5. kubectl get deploy,replicas,pod |grep myapp
  6. result: replicasets and pod can not be deleted;kube-controller-manager will show many Internal error occurred
@cndoit18
Copy link
Collaborator

cndoit18 commented Jun 2, 2022

hi, do you install cert-manager

@colinlabs
Copy link
Contributor Author

hi, do you install cert-manager

yeah sure。

@colinlabs
Copy link
Contributor Author

I used Kind to create a k8s cluster 1.22.9, which will reproduce the same issue.

@bitsf
Copy link
Collaborator

bitsf commented Jun 7, 2022

I didn't reproduce this with kind 1.22.0 and eks 1.22.6.
My step is

  1. kubectl apply -f "https://github.com/jetstack/cert-manager/releases/download/v1.6.1/cert-manager.yaml"
  2. kubectl apply -f "https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.0.5/deploy/static/provider/kind/deploy.yaml"
  3. kubectl apply manifests/cluster/deployment.yaml
  4. kubectl apply manifests/samples/full_stack.yaml
    change dns
  5. kubectl create deployment myapp --image nginx:alpine
  6. kubectl delete deployment myapp

@colinlabs
Copy link
Contributor Author

After careful investigation, some resources are not installed properly during the installation of cert-manager or harbor-operator CRD, resulting in the failure of the caBundle: Cg== in the CRD resource to be replaced by a valid certificate by cert-manager, resulting in problems with the kube-controller-manager component. After reinstalling cert-manager and harbor-operator normally, it will work properly. This root cause may require some updates upstream of Kubernetes. caBundle: Cg== will cause the whole cluster to crash.

@cndoit18 cndoit18 added blocked-by-upstream kind/question Further information is requested and removed blocked-by-upstream labels Jun 10, 2022
@bitsf bitsf closed this as completed Jun 13, 2022
@cvsgm
Copy link

cvsgm commented Mar 2, 2023

We encounter the same issue and resolve it with AWS support. I think it is not related to cert manger.
As long as you have a crd with caBundle: Cg== it might trigger in garbagecollector which further pause the reconciling of all EKS resources.
We fixed it by identity why which CRD is with caBundle: Cg== and remove such filed.
A simply verification of whether your cluster might have this issue in EKS 1.22 is by running kubectl get crd -o yaml | grep "caBundle: Cg==" -B 30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants