Skip to content
This repository has been archived by the owner on Jun 26, 2023. It is now read-only.

HNC: if (conversion?) webhooks get into a bad state, cert-rotator can no longer update their secrets #1275

Closed
adrianludwin opened this issue Nov 17, 2020 · 13 comments
Assignees
Milestone

Comments

@adrianludwin
Copy link
Contributor

We've had external reports (#1255 - @Ledroid) that you can get TLS handshake errors (due "signed by an unknown authority") if HNC gets into a sufficiently bad state, and @GinnyJI has seen this happen once. But we have no idea how to reproduce it or fix it, short of fully uninstalling HNC. However, when we're in a bad state, we can't uninstall the CRDs, because their conversion webhooks don't work and so (I think) K8s can't list their resources and delete them.

The workaround I've found is to manually edit all the CRDs to change the conversion strategy from Webhook to None; this lets the CRDs be deleted. I have no idea what the "none" strategy means in a multi-version CRD but it seems to be good enough to allow deletion.

We tried deleting and recreating the secret and reinstalling the pod, hoping that cert-rotator would create a brand new set of certificates and install them in all the (conversion?) webhooks. But this didn't happen and I have no idea why.

Our next steps are to try to figure out how to induce this problem so we can fix it.

/assign @adrianludwin
/cc @GinnyJI

@GinnyJI
Copy link
Contributor

GinnyJI commented Nov 24, 2020

I recently reproduced this problem by the following steps:

  1. deploying the config-sync operator
  2. enabling Hierarchy Controller
  3. enable pod label trees

Verify the problem has occurred:

$ k hns set a -a
Allowing cascading deletion on 'a'

Could not update the hierarchical configuration of a.
Reason: the server could not find the requested resource (post hierarchyconfigurations.hnc.x-k8s.io hierarchy)

Note: I didn't have HNC deployed when reproducing this problem (but I don't think it matters because the first time I met this problem I have the HNC deployed)

@adrianludwin
Copy link
Contributor Author

I just tried this several times on both GKE 1.17 and GKE 1.18 but I was never able to reproduce it - everything seems to work perfectly for me. I'm not sure why several other people can reproduce this but not me :/

@adrianludwin
Copy link
Contributor Author

I don't think this can block v0.7 since we don't know how to reproduce it yet.

@ledroide
Copy link

Hello @adrianludwin . After issue #1255 I could upgrade HNC to v0.7.0. However, some time later, I started to find weird behaviors in my cluster, mostly kube-controller-manager malfunction with its garbage collector.

Some symptoms :

  • when deleting deployments, daemonsets, statefulsets ; replicaSets under their control are not deleted as expected.
  • when deleting manually replicaSets, many controlled pods are still running (whatever is their state, ready or not) - that means pods are still running without any replicaSet
  • resource quotas can't be updated, new resources can be refused because of old quotas that don't existe anymore, and quota statuses (hard/used) are not displayed at all
  • deleting and re-creating namespaces does not help
  • namespaces are not properly deleted and keep some old resources when re-created
  • some resources (like role bindings), when applied and created (or modified), are taken into account actually

I opened an Issue #98071, about a bug in kube-controller-manager. Actually, as you can guess, the issue was not there.
We have found these logs in kube-controller-manager :

I0121 11:15:54.467098       1 shared_informer.go:240] Waiting for caches to sync for garbage collector
I0121 11:16:39.867524       1 graph_builder.go:272] garbage controller monitor not yet synced: hnc.x-k8s.io/v1alpha2, Resource=hierarchyconfigurations

I had the same difficulties (same as #1255) to remove HNC, with resources that do not actually delete, so I had to edit and delete finalizers.
Finally, HNC is completely removed, and the garbage collector is back to life.

I think the root cause is part of HNC validatingwebhookconfigurations or customresourcedefinitions resources.

@adrianludwin
Copy link
Contributor Author

HNC doesn't put any finalizers on pods, replicasets, etc. So if kube-controller-manager is failing for all resource types just because something's gone wrong with HNC-specific types, that sounds like a bug in kube-controller-manager that was just triggered by HNC. (To be clear, there's also a bug in HNC, but that shouldn't take down unrelated functionality in the cluster).

/cc @liggitt - do you agree with this?

@liggitt
Copy link

liggitt commented Jan 21, 2021

Some controllers (like GC) require successful list/watch for all known resources to safely proceed deleting things, because ownerReferences can span types.

Unavailability of a conversion webhook on a CRD prevents reads, which prevents list/watch from succeeding.

Additionally, before a CRD is deleted, all CR instances must be deleted. This requires reading them from etcd, and if they have finalizers, setting deletionTimestamp and persisting back to etcd to wait for finalizers to act. An unavailable conversion webhook can disrupt this as well.

@liggitt
Copy link

liggitt commented Jan 21, 2021

cc @deads2k

@ledroide
Copy link

@adrianludwin : You should have a look to what @pacoxu (Kubernetes developer / kube-controller-manager) answered today after testing HNC on his side :

After some investigation, HNC's webhook will block when the HNC controller has some problems.
My namespace cannot be created or deleted after the webhook was created.
The cluster resumed after I delete the HNC controller webhook.

@pacoxu
Copy link

pacoxu commented Jan 22, 2021

In my case, HNC controller doesn't work because of Calico network issue.

  • My namespace cannot be created or deleted after the webhook was created.
[root@daocloud ~]# kubectl delete ns paco1 paco2 paco3
Error from server (Timeout): Timeout: request did not complete within requested timeout 34s
Error from server (InternalError): Internal error occurred: failed calling webhook "namespaces.hnc.x-k8s.io": Post "https://hnc-webhook-service.hnc-system.svc:443/validate-v1-namespace?timeout=19s": dial tcp 10.110.222.59:443: connect: no route to host

The cluster resumed after I delete the HNC controller webhook.

During the issue, other controllers like tigera-operator encounter list-watch blocking issues like

      message: 'Internal error occurred: failed calling webhook "namespaces.hnc.x-k8s.io":
        Post "https://hnc-webhook-service.hnc-system.svc:443/validate-v1-namespace?timeout=30s":
        dial tcp 10.110.222.59:443: connect: connection refused'

@ledroide
Copy link

In my case, HNC controller doesn't work because of Calico network issue.

I also use Calico. However I did not notice the no route to host log (althought, I was flooded by logs and could have missed it)

@adrianludwin
Copy link
Contributor Author

Sorry I haven't replied here for a while.

Yes, HNC puts validating admission controllers on namespaces, so if it goes down, this will break namespace creation and updates. Sadly, this is unavoidable given that one of HNC's goals is to control the labels on namespaces (and prevent you from shooting yourself in the foot, e.g. by deleting a namespace with subnamespaces). You can delete the validating webhook which will stop the problem, but will also mean that the labels become untrusted again and foot-shooting will become much easier.

All we can do to solve this problem is reduce the number of bugs in HNC. If the problem is caused by Calico, there's not much we can do about that except try to document how to fix the problem if the problem's HNC-specific (although we don't do anything fancier than set up webhooks).

As for the original problem about conversion webhooks: this is a serious issue and sadly, it seems to be a) in core K8s and b) unfixable. The latest version of HNC (v0.7.0) does not contain any conversion webhooks so hopefully this particular issue should be solved, at least until the next time we need to do a conversion. Perhaps when that time comes, we can write a separate deployment that manages nothing except the conversion webhooks, so that even if the main pod goes down, it won't take down the conversion webhooks (and hence, apparently, the entire cluster) with it.

@ledroide, @pacoxu - does this sound reasonable? Thanks!

@adrianludwin
Copy link
Contributor Author

I'm going to close this issue, since we no longer have conversion webhooks and the other problem is expected, but please feel free to reopen if you disagree! Thanks.

/close

@k8s-ci-robot
Copy link
Contributor

@adrianludwin: Closing this issue.

In response to this:

I'm going to close this issue, since we no longer have conversion webhooks and the other problem is expected, but please feel free to reopen if you disagree! Thanks.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants