HNC under Kubernetes 1.22 : "error resolving resource" #86

ledroide · 2021-09-29T13:10:40Z

After upgrading a cluster from Kubernetes 1.21.5 to 1.22.2,

quotas can't be updated or created, namespaces can't be deleted, garbage collector fails to finalize resources.
hnc doesn't know its own resources

$ stern kube-controller-manager -n kube-system --since 15m
kube-controller-manager-kube-dev-master1 kube-controller-manager E0929 09:59:35.967260       1 garbagecollector.go:242] timed out waiting for dependency graph builder sync during GC sync (attempt 207)
kube-controller-manager-kube-dev-master1 kube-controller-manager I0929 09:59:36.068606       1 shared_informer.go:240] Waiting for caches to sync for garbage collector
kube-controller-manager-kube-dev-master1 kube-controller-manager I0929 09:59:46.891909       1 shared_informer.go:240] Waiting for caches to sync for resource quota
kube-controller-manager-kube-dev-master1 kube-controller-manager E0929 10:00:00.762819       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource

I have removed all HNC resources including CRD, as explained in [#1255] (kubernetes-retired/multi-tenancy#1255 (comment)), then checked the effective removal :

$ kubectl api-resources | grep hnc
$ kubectl get namespace hnc-system
Error from server (NotFound): namespaces "hnc-system" not found
$ kubectl get hncconfigurations
error: the server doesn't have a resource type "hncconfigurations"
$ kubectl get subns -A
error: the server doesn't have a resource type "subns"

Then I have reinstalled HNC controller v0.8.0 + hncconfiguration as documented (standard manifests). Everything looks good.

$ kubectl get deploy -n hnc-system -o wide
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                                               SELECTOR
hnc-controller-manager   1/1     1            1           16m   manager      gcr.io/k8s-staging-multitenancy/hnc-manager:v0.8.0   control-plane=controller-manager

$ kubectl get pod -n hnc-system 
NAME                                      READY   STATUS    RESTARTS      AGE
hnc-controller-manager-6b96f74c79-p8xvl   1/1     Running   1 (16m ago)   17m

$ kubectl api-versions | grep hnc
hnc.x-k8s.io/v1alpha2

$ kubectl api-resources | grep hnc
hierarchyconfigurations                                                    hnc.x-k8s.io/v1alpha2                  true         HierarchyConfiguration
hncconfigurations                                                          hnc.x-k8s.io/v1alpha2                  false        HNCConfiguration
subnamespaceanchors               subns                                    hnc.x-k8s.io/v1alpha2                  true         SubnamespaceAnchor

However, resources aren't recognized by the API :

$ kubectl get subns --all-namespaces
error: the server doesn t have a resource type "subns"

$ kubectl apply -f ./tools/hnc/hnc-configuration.yaml
Error from server (InternalError): error when retrieving current configuration of:
Resource: "hnc.x-k8s.io/v1alpha2, Resource=hncconfigurations", GroupVersionKind: "hnc.x-k8s.io/v1alpha2, Kind=HNCConfiguration"
Name: "config", Namespace: ""
from server for: "./tools/hnc/hnc-configuration.yaml": Internal error occurred: error resolving resource

$ kubectl get hncconfigurations
Error from server (InternalError): Internal error occurred: error resolving resource

$ kubectl create namespace hnctest
namespace/hnctest created
$ kubectl hns create hncsubnstest -n hnctest
Could not create subnamespace anchor.
Reason: Internal error occurred: error resolving resource

Maybe a clue here :

$ kubectl describe customresourcedefinition.apiextensions.k8s.io/hierarchyconfigurations.hnc.x-k8s.io
[...]
    Message:               could not list instances: unable to find a custom resource client for hierarchyconfigurations.hnc.x-k8s.io: unable to load root certificates: unable to parse bytes as PEM block

Note : there are some apiVersions that come deprecated with Kubernetes 1.22, but in the HNC code, it looks like nothing is deprecated anyway. Please have a look to Kubernetes changelog.

Additional info :

Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:32:41Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

Manifests used from hnc-v0.8.0 tag.

The text was updated successfully, but these errors were encountered:

adrianludwin · 2021-09-29T19:50:09Z

What flavour of K8s is this - e.g. managed (GKE, EKS, etc) or self-hosted (e.g. VMWare etc)?

This error message looks pretty suspicious to me:

$ kubectl get hncconfigurations
Error from server (InternalError): Internal error occurred: error resolving resource

That error's coming directly from the K8s apiserver, not from HNC. There shouldn't be any way for us to cause an internal error in the server. This sounds like a bug for K8s itself.

One question - if you fully uninstall HNC (again), do all the other problems go away? E.g. you can delete namespaces, create quotas, etc? (HNC doesn't do anything with quotas by default, did you configure it to manage them?)

ledroide · 2021-09-30T05:53:44Z

Hello @adrianludwin
Additional info :

on premise installation, CentOS stream, 4 clusters, mix VM and bare metal
the problem with quotas and namespace relates to the kube-controller-manager that is disturbed by hnc controller, as explained in issues #98071 and #12, confirmed by @pacoxu (Kubernetes core developer) in January.
hnc changes some finalizers, so the garbage collector is unable to terminate some resources, such as namespaces.
everything (including errors from kube-controller-manager logs) comes back to normal situation when I fully delete HNC + manually remove finalizers from customresourcedefinition.apiextensions.k8s.io/hierarchyconfigurations.hnc.x-k8s.io , customresourcedefinition.apiextensions.k8s.io/hncconfigurations.hnc.x-k8s.io , customresourcedefinition.apiextensions.k8s.io/subnamespaceanchors.hnc.x-k8s.io (otherwise they remain stuck in Terminating status).
I guess you can easily reproduce the issue with a Kubernetes 1.22.x cluster
logs from hnc controller :

$ kubectl logs deploy/hnc-controller-manager -n hnc-system
[...]
E0930 05:41:41.179149       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1alpha2.HNCConfiguration: failed to list *v1alpha2.HNCConfiguration: Internal error occurred: error resolving resource
E0930 05:41:50.585121       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1alpha2.HierarchyConfiguration: failed to list *v1alpha2.HierarchyConfiguration: Internal error occurred: error resolving resource
E0930 05:42:14.175463       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1alpha2.SubnamespaceAnchor: failed to list *v1alpha2.SubnamespaceAnchor: Internal error occurred: error resolving resource
E0930 05:42:37.856475       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1alpha2.HNCConfiguration: failed to list *v1alpha2.HNCConfiguration: Internal error occurred: error resolving resource
E0930 05:42:49.808978       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1alpha2.HierarchyConfiguration: failed to list *v1alpha2.HierarchyConfiguration: Internal error occurred: error resolving resource
E0930 05:42:50.424179       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1alpha2.SubnamespaceAnchor: failed to list *v1alpha2.SubnamespaceAnchor: Internal error occurred: error resolving resource
[...]

adrianludwin · 2021-10-14T19:33:42Z

Hey sorry for the delay @ledroide. Just another question - you say you can easily reproduce the problem with 1.22, what are the steps to doing this? E.g. if you have have some script that you could run that would cause this problem, that would be great and I could try it out.

ledroide · 2021-10-15T08:07:05Z

you can easily reproduce the problem with 1.22, what are the steps to doing this?

@adrianludwin : HNC was running fine with Kubernetes 1.21.5. We have upgraded our clusters to Kubernetes 1.22.2 (using standard kubespray - which relies on kubeadm and ansible).

We immediately noticed a weird behavior with resources finalizers, at first regarding namespaces deletion and quotas usage - as described here above.

adrianludwin · 2021-10-15T10:38:32Z

Ah, I somehow missed that it was working fine earlier, sorry. I'll check this out on 1.22.

adrianludwin · 2021-10-18T16:43:30Z

Ok, I've reproduced this on 1.22. Somehow our CRDs have a conversion webhook specified even though there's nothing to convert - something must have changed in our build process. Changing the conversion strategy to None seems to fix this. I'll fix this in v0.9 ASAP.

adrianludwin · 2021-10-18T16:43:43Z

Not sure why we're only seeing this in 1.22 though.

adrianludwin · 2021-10-18T18:14:39Z

I'll also consider backporting this to v0.8.1.

adrianludwin · 2021-10-18T18:58:09Z

@ledroide Do you want me to patch v0.8 with this? It's fairly easy for you to apply locally, just remove the "conversion" from all three CRDs:

  conversion:
    strategy: Webhook
    webhook:
      clientConfig:
        caBundle: Cg==
        service:
          name: webhook-service
          namespace: system
          path: /convert
      conversionReviewVersions:
      - v1
      - v1beta1

ledroide · 2021-10-19T07:21:00Z

@adrianludwin : I have removed 3 occurrences of this conversion section in the hnc CRDs, and applied.

subns creation and deletion works
all resources set in hnc configuration are propagated to subns
quotas can be patched in parent namespaces and propagated to its subns
full namespaces can be created and deleted

From my side, this patch fixes the issue for Kubernetes 1.22.

adrianludwin · 2021-10-19T13:07:37Z

Thanks @ledroide . Sorry this took so long to resolve, but I'm glad it's working now. I'll be releasing v0.9 soon (this week, hopefully) and this will be fixed in that version.

I'll keep this bug open as a reminder to file a bug against K8s (it's not good that a config this simple can crash the apiserver).

This was referenced Oct 18, 2021

Remove webhooks from CRDs and unbreak K8s 1.22 (branch) #102

Merged

Remove webhooks from CRDs and unbreak K8s 1.22 (master) #103

Merged

adrianludwin added this to the release-v0.9 milestone Oct 18, 2021

k8s-ci-robot closed this as completed in #103 Oct 18, 2021

jiangpengcheng mentioned this issue Jun 10, 2022

not compatible with k8s > 1.22 streamnative/function-mesh#389

Closed

ca-scribner mentioned this issue Nov 20, 2023

Upgrade Kserve to v0.11.1 for 1.8 release canonical/kserve-operators#193

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HNC under Kubernetes 1.22 : "error resolving resource" #86

HNC under Kubernetes 1.22 : "error resolving resource" #86

ledroide commented Sep 29, 2021

adrianludwin commented Sep 29, 2021

ledroide commented Sep 30, 2021 •

edited

Loading

adrianludwin commented Oct 14, 2021

ledroide commented Oct 15, 2021 •

edited

Loading

adrianludwin commented Oct 15, 2021

adrianludwin commented Oct 18, 2021

adrianludwin commented Oct 18, 2021

adrianludwin commented Oct 18, 2021

adrianludwin commented Oct 18, 2021

ledroide commented Oct 19, 2021 •

edited

Loading

adrianludwin commented Oct 19, 2021

HNC under Kubernetes 1.22 : "error resolving resource" #86

HNC under Kubernetes 1.22 : "error resolving resource" #86

Comments

ledroide commented Sep 29, 2021

adrianludwin commented Sep 29, 2021

ledroide commented Sep 30, 2021 • edited Loading

adrianludwin commented Oct 14, 2021

ledroide commented Oct 15, 2021 • edited Loading

adrianludwin commented Oct 15, 2021

adrianludwin commented Oct 18, 2021

adrianludwin commented Oct 18, 2021

adrianludwin commented Oct 18, 2021

adrianludwin commented Oct 18, 2021

ledroide commented Oct 19, 2021 • edited Loading

adrianludwin commented Oct 19, 2021

ledroide commented Sep 30, 2021 •

edited

Loading

ledroide commented Oct 15, 2021 •

edited

Loading

ledroide commented Oct 19, 2021 •

edited

Loading