Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HNC under Kubernetes 1.22 : "error resolving resource" #86

Closed
ledroide opened this issue Sep 29, 2021 · 11 comments · Fixed by #103
Closed

HNC under Kubernetes 1.22 : "error resolving resource" #86

ledroide opened this issue Sep 29, 2021 · 11 comments · Fixed by #103
Milestone

Comments

@ledroide
Copy link

After upgrading a cluster from Kubernetes 1.21.5 to 1.22.2,

  • quotas can't be updated or created, namespaces can't be deleted, garbage collector fails to finalize resources.
  • hnc doesn't know its own resources
$ stern kube-controller-manager -n kube-system --since 15m
kube-controller-manager-kube-dev-master1 kube-controller-manager E0929 09:59:35.967260       1 garbagecollector.go:242] timed out waiting for dependency graph builder sync during GC sync (attempt 207)
kube-controller-manager-kube-dev-master1 kube-controller-manager I0929 09:59:36.068606       1 shared_informer.go:240] Waiting for caches to sync for garbage collector
kube-controller-manager-kube-dev-master1 kube-controller-manager I0929 09:59:46.891909       1 shared_informer.go:240] Waiting for caches to sync for resource quota
kube-controller-manager-kube-dev-master1 kube-controller-manager E0929 10:00:00.762819       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: error resolving resource

I have removed all HNC resources including CRD, as explained in [#1255] (kubernetes-retired/multi-tenancy#1255 (comment)), then checked the effective removal :

$ kubectl api-resources | grep hnc
$ kubectl get namespace hnc-system
Error from server (NotFound): namespaces "hnc-system" not found
$ kubectl get hncconfigurations
error: the server doesn't have a resource type "hncconfigurations"
$ kubectl get subns -A
error: the server doesn't have a resource type "subns"

Then I have reinstalled HNC controller v0.8.0 + hncconfiguration as documented (standard manifests). Everything looks good.

$ kubectl get deploy -n hnc-system -o wide
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                                               SELECTOR
hnc-controller-manager   1/1     1            1           16m   manager      gcr.io/k8s-staging-multitenancy/hnc-manager:v0.8.0   control-plane=controller-manager

$ kubectl get pod -n hnc-system 
NAME                                      READY   STATUS    RESTARTS      AGE
hnc-controller-manager-6b96f74c79-p8xvl   1/1     Running   1 (16m ago)   17m

$ kubectl api-versions | grep hnc
hnc.x-k8s.io/v1alpha2

$ kubectl api-resources | grep hnc
hierarchyconfigurations                                                    hnc.x-k8s.io/v1alpha2                  true         HierarchyConfiguration
hncconfigurations                                                          hnc.x-k8s.io/v1alpha2                  false        HNCConfiguration
subnamespaceanchors               subns                                    hnc.x-k8s.io/v1alpha2                  true         SubnamespaceAnchor

However, resources aren't recognized by the API :

$ kubectl get subns --all-namespaces
error: the server doesn t have a resource type "subns"

$ kubectl apply -f ./tools/hnc/hnc-configuration.yaml
Error from server (InternalError): error when retrieving current configuration of:
Resource: "hnc.x-k8s.io/v1alpha2, Resource=hncconfigurations", GroupVersionKind: "hnc.x-k8s.io/v1alpha2, Kind=HNCConfiguration"
Name: "config", Namespace: ""
from server for: "./tools/hnc/hnc-configuration.yaml": Internal error occurred: error resolving resource

$ kubectl get hncconfigurations
Error from server (InternalError): Internal error occurred: error resolving resource

$ kubectl create namespace hnctest
namespace/hnctest created
$ kubectl hns create hncsubnstest -n hnctest
Could not create subnamespace anchor.
Reason: Internal error occurred: error resolving resource

Maybe a clue here :

$ kubectl describe customresourcedefinition.apiextensions.k8s.io/hierarchyconfigurations.hnc.x-k8s.io
[...]
    Message:               could not list instances: unable to find a custom resource client for hierarchyconfigurations.hnc.x-k8s.io: unable to load root certificates: unable to parse bytes as PEM block

Note : there are some apiVersions that come deprecated with Kubernetes 1.22, but in the HNC code, it looks like nothing is deprecated anyway. Please have a look to Kubernetes changelog.

Additional info :

Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:32:41Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

Manifests used from hnc-v0.8.0 tag.

@adrianludwin
Copy link
Contributor

What flavour of K8s is this - e.g. managed (GKE, EKS, etc) or self-hosted (e.g. VMWare etc)?

This error message looks pretty suspicious to me:

$ kubectl get hncconfigurations
Error from server (InternalError): Internal error occurred: error resolving resource

That error's coming directly from the K8s apiserver, not from HNC. There shouldn't be any way for us to cause an internal error in the server. This sounds like a bug for K8s itself.

One question - if you fully uninstall HNC (again), do all the other problems go away? E.g. you can delete namespaces, create quotas, etc? (HNC doesn't do anything with quotas by default, did you configure it to manage them?)

@ledroide
Copy link
Author

ledroide commented Sep 30, 2021

Hello @adrianludwin
Additional info :

  • on premise installation, CentOS stream, 4 clusters, mix VM and bare metal
  • the problem with quotas and namespace relates to the kube-controller-manager that is disturbed by hnc controller, as explained in issues #98071 and #12, confirmed by @pacoxu (Kubernetes core developer) in January.
  • hnc changes some finalizers, so the garbage collector is unable to terminate some resources, such as namespaces.
  • everything (including errors from kube-controller-manager logs) comes back to normal situation when I fully delete HNC + manually remove finalizers from customresourcedefinition.apiextensions.k8s.io/hierarchyconfigurations.hnc.x-k8s.io , customresourcedefinition.apiextensions.k8s.io/hncconfigurations.hnc.x-k8s.io , customresourcedefinition.apiextensions.k8s.io/subnamespaceanchors.hnc.x-k8s.io (otherwise they remain stuck in Terminating status).
  • I guess you can easily reproduce the issue with a Kubernetes 1.22.x cluster
  • logs from hnc controller :
$ kubectl logs deploy/hnc-controller-manager -n hnc-system
[...]
E0930 05:41:41.179149       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1alpha2.HNCConfiguration: failed to list *v1alpha2.HNCConfiguration: Internal error occurred: error resolving resource
E0930 05:41:50.585121       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1alpha2.HierarchyConfiguration: failed to list *v1alpha2.HierarchyConfiguration: Internal error occurred: error resolving resource
E0930 05:42:14.175463       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1alpha2.SubnamespaceAnchor: failed to list *v1alpha2.SubnamespaceAnchor: Internal error occurred: error resolving resource
E0930 05:42:37.856475       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1alpha2.HNCConfiguration: failed to list *v1alpha2.HNCConfiguration: Internal error occurred: error resolving resource
E0930 05:42:49.808978       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1alpha2.HierarchyConfiguration: failed to list *v1alpha2.HierarchyConfiguration: Internal error occurred: error resolving resource
E0930 05:42:50.424179       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:229: Failed to watch *v1alpha2.SubnamespaceAnchor: failed to list *v1alpha2.SubnamespaceAnchor: Internal error occurred: error resolving resource
[...]

@adrianludwin
Copy link
Contributor

Hey sorry for the delay @ledroide. Just another question - you say you can easily reproduce the problem with 1.22, what are the steps to doing this? E.g. if you have have some script that you could run that would cause this problem, that would be great and I could try it out.

@ledroide
Copy link
Author

ledroide commented Oct 15, 2021

you can easily reproduce the problem with 1.22, what are the steps to doing this?

@adrianludwin : HNC was running fine with Kubernetes 1.21.5. We have upgraded our clusters to Kubernetes 1.22.2 (using standard kubespray - which relies on kubeadm and ansible).

We immediately noticed a weird behavior with resources finalizers, at first regarding namespaces deletion and quotas usage - as described here above.

@adrianludwin
Copy link
Contributor

Ah, I somehow missed that it was working fine earlier, sorry. I'll check this out on 1.22.

@adrianludwin
Copy link
Contributor

Ok, I've reproduced this on 1.22. Somehow our CRDs have a conversion webhook specified even though there's nothing to convert - something must have changed in our build process. Changing the conversion strategy to None seems to fix this. I'll fix this in v0.9 ASAP.

@adrianludwin
Copy link
Contributor

Not sure why we're only seeing this in 1.22 though.

@adrianludwin
Copy link
Contributor

I'll also consider backporting this to v0.8.1.

@adrianludwin
Copy link
Contributor

@ledroide Do you want me to patch v0.8 with this? It's fairly easy for you to apply locally, just remove the "conversion" from all three CRDs:

  conversion:
    strategy: Webhook
    webhook:
      clientConfig:
        caBundle: Cg==
        service:
          name: webhook-service
          namespace: system
          path: /convert
      conversionReviewVersions:
      - v1
      - v1beta1

@ledroide
Copy link
Author

ledroide commented Oct 19, 2021

@adrianludwin : I have removed 3 occurrences of this conversion section in the hnc CRDs, and applied.

  • subns creation and deletion works
  • all resources set in hnc configuration are propagated to subns
  • quotas can be patched in parent namespaces and propagated to its subns
  • full namespaces can be created and deleted

From my side, this patch fixes the issue for Kubernetes 1.22.

@adrianludwin
Copy link
Contributor

Thanks @ledroide . Sorry this took so long to resolve, but I'm glad it's working now. I'll be releasing v0.9 soon (this week, hopefully) and this will be fixed in that version.

I'll keep this bug open as a reminder to file a bug against K8s (it's not good that a config this simple can crash the apiserver).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants