-
Notifications
You must be signed in to change notification settings - Fork 172
HNC: if (conversion?) webhooks get into a bad state, cert-rotator can no longer update their secrets #1275
Comments
I recently reproduced this problem by the following steps: Verify the problem has occurred:
Note: I didn't have HNC deployed when reproducing this problem (but I don't think it matters because the first time I met this problem I have the HNC deployed) |
I just tried this several times on both GKE 1.17 and GKE 1.18 but I was never able to reproduce it - everything seems to work perfectly for me. I'm not sure why several other people can reproduce this but not me :/ |
I don't think this can block v0.7 since we don't know how to reproduce it yet. |
Hello @adrianludwin . After issue #1255 I could upgrade HNC to v0.7.0. However, some time later, I started to find weird behaviors in my cluster, mostly kube-controller-manager malfunction with its garbage collector. Some symptoms :
I opened an Issue #98071, about a bug in kube-controller-manager. Actually, as you can guess, the issue was not there.
I had the same difficulties (same as #1255) to remove HNC, with resources that do not actually delete, so I had to edit and delete finalizers. I think the root cause is part of HNC validatingwebhookconfigurations or customresourcedefinitions resources. |
HNC doesn't put any finalizers on pods, replicasets, etc. So if kube-controller-manager is failing for all resource types just because something's gone wrong with HNC-specific types, that sounds like a bug in /cc @liggitt - do you agree with this? |
Some controllers (like GC) require successful list/watch for all known resources to safely proceed deleting things, because ownerReferences can span types. Unavailability of a conversion webhook on a CRD prevents reads, which prevents list/watch from succeeding. Additionally, before a CRD is deleted, all CR instances must be deleted. This requires reading them from etcd, and if they have finalizers, setting deletionTimestamp and persisting back to etcd to wait for finalizers to act. An unavailable conversion webhook can disrupt this as well. |
cc @deads2k |
@adrianludwin : You should have a look to what @pacoxu (Kubernetes developer / kube-controller-manager) answered today after testing HNC on his side :
|
In my case, HNC controller doesn't work because of Calico network issue.
The cluster resumed after I delete the HNC controller webhook. During the issue, other controllers like tigera-operator encounter list-watch blocking issues like
|
I also use Calico. However I did not notice the |
Sorry I haven't replied here for a while. Yes, HNC puts validating admission controllers on namespaces, so if it goes down, this will break namespace creation and updates. Sadly, this is unavoidable given that one of HNC's goals is to control the labels on namespaces (and prevent you from shooting yourself in the foot, e.g. by deleting a namespace with subnamespaces). You can delete the validating webhook which will stop the problem, but will also mean that the labels become untrusted again and foot-shooting will become much easier. All we can do to solve this problem is reduce the number of bugs in HNC. If the problem is caused by Calico, there's not much we can do about that except try to document how to fix the problem if the problem's HNC-specific (although we don't do anything fancier than set up webhooks). As for the original problem about conversion webhooks: this is a serious issue and sadly, it seems to be a) in core K8s and b) unfixable. The latest version of HNC (v0.7.0) does not contain any conversion webhooks so hopefully this particular issue should be solved, at least until the next time we need to do a conversion. Perhaps when that time comes, we can write a separate deployment that manages nothing except the conversion webhooks, so that even if the main pod goes down, it won't take down the conversion webhooks (and hence, apparently, the entire cluster) with it. |
I'm going to close this issue, since we no longer have conversion webhooks and the other problem is expected, but please feel free to reopen if you disagree! Thanks. /close |
@adrianludwin: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We've had external reports (#1255 - @Ledroid) that you can get TLS handshake errors (due "signed by an unknown authority") if HNC gets into a sufficiently bad state, and @GinnyJI has seen this happen once. But we have no idea how to reproduce it or fix it, short of fully uninstalling HNC. However, when we're in a bad state, we can't uninstall the CRDs, because their conversion webhooks don't work and so (I think) K8s can't list their resources and delete them.
The workaround I've found is to manually edit all the CRDs to change the conversion strategy from
Webhook
toNone
; this lets the CRDs be deleted. I have no idea what the "none" strategy means in a multi-version CRD but it seems to be good enough to allow deletion.We tried deleting and recreating the secret and reinstalling the pod, hoping that cert-rotator would create a brand new set of certificates and install them in all the (conversion?) webhooks. But this didn't happen and I have no idea why.
Our next steps are to try to figure out how to induce this problem so we can fix it.
/assign @adrianludwin
/cc @GinnyJI
The text was updated successfully, but these errors were encountered: