HNC: if (conversion?) webhooks get into a bad state, cert-rotator can no longer update their secrets #1275

adrianludwin · 2020-11-17T18:34:41Z

We've had external reports (#1255 - @Ledroid) that you can get TLS handshake errors (due "signed by an unknown authority") if HNC gets into a sufficiently bad state, and @GinnyJI has seen this happen once. But we have no idea how to reproduce it or fix it, short of fully uninstalling HNC. However, when we're in a bad state, we can't uninstall the CRDs, because their conversion webhooks don't work and so (I think) K8s can't list their resources and delete them.

The workaround I've found is to manually edit all the CRDs to change the conversion strategy from Webhook to None; this lets the CRDs be deleted. I have no idea what the "none" strategy means in a multi-version CRD but it seems to be good enough to allow deletion.

We tried deleting and recreating the secret and reinstalling the pod, hoping that cert-rotator would create a brand new set of certificates and install them in all the (conversion?) webhooks. But this didn't happen and I have no idea why.

Our next steps are to try to figure out how to induce this problem so we can fix it.

/assign @adrianludwin
/cc @GinnyJI

The text was updated successfully, but these errors were encountered:

GinnyJI · 2020-11-24T16:30:47Z

I recently reproduced this problem by the following steps:

Verify the problem has occurred:

$ k hns set a -a
Allowing cascading deletion on 'a'

Could not update the hierarchical configuration of a.
Reason: the server could not find the requested resource (post hierarchyconfigurations.hnc.x-k8s.io hierarchy)

Note: I didn't have HNC deployed when reproducing this problem (but I don't think it matters because the first time I met this problem I have the HNC deployed)

adrianludwin · 2020-12-03T03:34:55Z

I just tried this several times on both GKE 1.17 and GKE 1.18 but I was never able to reproduce it - everything seems to work perfectly for me. I'm not sure why several other people can reproduce this but not me :/

adrianludwin · 2020-12-07T18:11:44Z

I don't think this can block v0.7 since we don't know how to reproduce it yet.

ledroide · 2021-01-21T13:43:59Z

Hello @adrianludwin . After issue #1255 I could upgrade HNC to v0.7.0. However, some time later, I started to find weird behaviors in my cluster, mostly kube-controller-manager malfunction with its garbage collector.

Some symptoms :

when deleting deployments, daemonsets, statefulsets ; replicaSets under their control are not deleted as expected.
when deleting manually replicaSets, many controlled pods are still running (whatever is their state, ready or not) - that means pods are still running without any replicaSet
resource quotas can't be updated, new resources can be refused because of old quotas that don't existe anymore, and quota statuses (hard/used) are not displayed at all
deleting and re-creating namespaces does not help
namespaces are not properly deleted and keep some old resources when re-created
some resources (like role bindings), when applied and created (or modified), are taken into account actually

I opened an Issue #98071, about a bug in kube-controller-manager. Actually, as you can guess, the issue was not there.
We have found these logs in kube-controller-manager :

I0121 11:15:54.467098       1 shared_informer.go:240] Waiting for caches to sync for garbage collector
I0121 11:16:39.867524       1 graph_builder.go:272] garbage controller monitor not yet synced: hnc.x-k8s.io/v1alpha2, Resource=hierarchyconfigurations

I had the same difficulties (same as #1255) to remove HNC, with resources that do not actually delete, so I had to edit and delete finalizers.
Finally, HNC is completely removed, and the garbage collector is back to life.

I think the root cause is part of HNC validatingwebhookconfigurations or customresourcedefinitions resources.

adrianludwin · 2021-01-21T15:17:14Z

HNC doesn't put any finalizers on pods, replicasets, etc. So if kube-controller-manager is failing for all resource types just because something's gone wrong with HNC-specific types, that sounds like a bug in kube-controller-manager that was just triggered by HNC. (To be clear, there's also a bug in HNC, but that shouldn't take down unrelated functionality in the cluster).

/cc @liggitt - do you agree with this?

liggitt · 2021-01-21T15:21:37Z

Some controllers (like GC) require successful list/watch for all known resources to safely proceed deleting things, because ownerReferences can span types.

Unavailability of a conversion webhook on a CRD prevents reads, which prevents list/watch from succeeding.

Additionally, before a CRD is deleted, all CR instances must be deleted. This requires reading them from etcd, and if they have finalizers, setting deletionTimestamp and persisting back to etcd to wait for finalizers to act. An unavailable conversion webhook can disrupt this as well.

liggitt · 2021-01-21T15:22:27Z

cc @deads2k

ledroide · 2021-01-22T09:04:57Z

@adrianludwin : You should have a look to what @pacoxu (Kubernetes developer / kube-controller-manager) answered today after testing HNC on his side :

After some investigation, HNC's webhook will block when the HNC controller has some problems.
My namespace cannot be created or deleted after the webhook was created.
The cluster resumed after I delete the HNC controller webhook.

pacoxu · 2021-01-22T09:14:07Z

In my case, HNC controller doesn't work because of Calico network issue.

My namespace cannot be created or deleted after the webhook was created.

[root@daocloud ~]# kubectl delete ns paco1 paco2 paco3
Error from server (Timeout): Timeout: request did not complete within requested timeout 34s
Error from server (InternalError): Internal error occurred: failed calling webhook "namespaces.hnc.x-k8s.io": Post "https://hnc-webhook-service.hnc-system.svc:443/validate-v1-namespace?timeout=19s": dial tcp 10.110.222.59:443: connect: no route to host

The cluster resumed after I delete the HNC controller webhook.

During the issue, other controllers like tigera-operator encounter list-watch blocking issues like

      message: 'Internal error occurred: failed calling webhook "namespaces.hnc.x-k8s.io":
        Post "https://hnc-webhook-service.hnc-system.svc:443/validate-v1-namespace?timeout=30s":
        dial tcp 10.110.222.59:443: connect: connection refused'

ledroide · 2021-01-22T10:59:40Z

In my case, HNC controller doesn't work because of Calico network issue.

I also use Calico. However I did not notice the no route to host log (althought, I was flooded by logs and could have missed it)

adrianludwin · 2021-02-22T02:16:05Z

Sorry I haven't replied here for a while.

Yes, HNC puts validating admission controllers on namespaces, so if it goes down, this will break namespace creation and updates. Sadly, this is unavoidable given that one of HNC's goals is to control the labels on namespaces (and prevent you from shooting yourself in the foot, e.g. by deleting a namespace with subnamespaces). You can delete the validating webhook which will stop the problem, but will also mean that the labels become untrusted again and foot-shooting will become much easier.

All we can do to solve this problem is reduce the number of bugs in HNC. If the problem is caused by Calico, there's not much we can do about that except try to document how to fix the problem if the problem's HNC-specific (although we don't do anything fancier than set up webhooks).

As for the original problem about conversion webhooks: this is a serious issue and sadly, it seems to be a) in core K8s and b) unfixable. The latest version of HNC (v0.7.0) does not contain any conversion webhooks so hopefully this particular issue should be solved, at least until the next time we need to do a conversion. Perhaps when that time comes, we can write a separate deployment that manages nothing except the conversion webhooks, so that even if the main pod goes down, it won't take down the conversion webhooks (and hence, apparently, the entire cluster) with it.

@ledroide, @pacoxu - does this sound reasonable? Thanks!

adrianludwin · 2021-02-22T02:17:25Z

I'm going to close this issue, since we no longer have conversion webhooks and the other problem is expected, but please feel free to reopen if you disagree! Thanks.

/close

k8s-ci-robot · 2021-02-22T02:17:29Z

@adrianludwin: Closing this issue.

In response to this:

I'm going to close this issue, since we no longer have conversion webhooks and the other problem is expected, but please feel free to reopen if you disagree! Thanks.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

adrianludwin added this to the hnc-v0.7 milestone Nov 17, 2020

k8s-ci-robot assigned adrianludwin Nov 17, 2020

adrianludwin modified the milestones: hnc-v0.7, hnc-v0.6.1, hnc-backlog Dec 7, 2020

k8s-ci-robot closed this as completed Feb 22, 2021

pacoxu mentioned this issue Jun 9, 2021

REQUEST: New membership for pacoxu kubernetes/org#2771

Closed

6 tasks

micnncim mentioned this issue Aug 2, 2021

Connection refused by tls: bad certificate kubernetes-sigs/hierarchical-namespaces#65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HNC: if (conversion?) webhooks get into a bad state, cert-rotator can no longer update their secrets #1275

HNC: if (conversion?) webhooks get into a bad state, cert-rotator can no longer update their secrets #1275

adrianludwin commented Nov 17, 2020

GinnyJI commented Nov 24, 2020

adrianludwin commented Dec 3, 2020

adrianludwin commented Dec 7, 2020

ledroide commented Jan 21, 2021

adrianludwin commented Jan 21, 2021

liggitt commented Jan 21, 2021 •

edited

Loading

liggitt commented Jan 21, 2021

ledroide commented Jan 22, 2021

pacoxu commented Jan 22, 2021 •

edited

Loading

ledroide commented Jan 22, 2021

adrianludwin commented Feb 22, 2021

adrianludwin commented Feb 22, 2021

k8s-ci-robot commented Feb 22, 2021

HNC: if (conversion?) webhooks get into a bad state, cert-rotator can no longer update their secrets #1275

HNC: if (conversion?) webhooks get into a bad state, cert-rotator can no longer update their secrets #1275

Comments

adrianludwin commented Nov 17, 2020

GinnyJI commented Nov 24, 2020

adrianludwin commented Dec 3, 2020

adrianludwin commented Dec 7, 2020

ledroide commented Jan 21, 2021

adrianludwin commented Jan 21, 2021

liggitt commented Jan 21, 2021 • edited Loading

liggitt commented Jan 21, 2021

ledroide commented Jan 22, 2021

pacoxu commented Jan 22, 2021 • edited Loading

ledroide commented Jan 22, 2021

adrianludwin commented Feb 22, 2021

adrianludwin commented Feb 22, 2021

k8s-ci-robot commented Feb 22, 2021

liggitt commented Jan 21, 2021 •

edited

Loading

pacoxu commented Jan 22, 2021 •

edited

Loading