User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" #4800

Infinoid · 2024-08-25T14:06:36Z

Describe the bug

After upgrading from 3.11.0 to 3.12.1 (using helm), the csi-provisioner log is getting constant permission errors. I updated both rbd and cephfs, but it's only happening in the rbd provisioner.

Environment details

Image/version of Ceph CSI driver : 3.12.1
Helm chart version : 3.12.1
Kernel version : 6.6.6
Mounter used for mounting PVC: krbd
Kubernetes cluster version : 1.30.3
Ceph cluster version : 18.2.4

Steps to reproduce

Steps to reproduce the behavior:

Install and configure rbd 3.11.0 using helm.
Set up a storage class called "rbd".
Create and mount a PVC, watch it get provisioned.
Unmount and delete the PVC.
Update helm chart to 3.12.1.
Create and mount a PVC, watch it never get provisioned.
Check the csi-provisioner log to see non-stop permission errors.

Actual results

rbd volume not provisioned, pod never starts.

Expected behavior

Provisioner provisions without error, same as 1.11.0.

Logs

The csi-provisioner container logs repeat these messages at some interval:

W0825 13:10:32.334012       1 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
E0825 13:10:32.334037       1 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.CSINode: failed to list *v1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
W0825 13:10:32.334128       1 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" in API group "" at the cluster scope
E0825 13:10:32.334140       1 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" in API group "" at the cluster scope
W0825 13:10:33.398178       1 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
E0825 13:10:33.398197       1 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.CSINode: failed to list *v1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
W0825 13:10:33.480147       1 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" in API group "" at the cluster scope
E0825 13:10:33.480164       1 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" in API group "" at the cluster scope

I see lots of errors in csi-snapshotter logs too, but I think that's unrelated. (I don't have any VSCs defined.)

I see nothing of note in other ceph-csi-rbd logs.

In the application's namespace, I see events like:

0s          Normal    ExternalProvisioning   persistentvolumeclaim/data-mariadb-0   Waiting for a volume to be created either by the external provisioner 'rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

Additional context

Terraform installation & configuration

This is how I installed and configured it. To update, I only changed the "version" line.

resource "helm_release" "rbd" {
  name      = "ceph-csi-rbd"
  namespace = kubernetes_namespace.rbd.metadata.0.name

  repository = "https://ceph.github.io/csi-charts"
  chart      = "ceph-csi-rbd"
  version    = "v3.12.1"

  values = [
    yamlencode({
      cephconf = local.ceph_conf
      csiConfig = local.csi_config
      storageClass = {
        create    = true
        name      = "rbd"
        pool      = "rbd"
        dataPool  = "rbd.ec124"
        clusterID = "${var.clusterID}"
      }
      readAffinity = {
        enabled = true
      }
    })
  ]
}

What the previous (successful) version looks like

I tested provisioning and removal using the bitnami/mariadb helm chart. I am using the same chart, configured the same way, for old and new versions of ceph-csi-rbd. Here's what the v3.11.0 csi-provisioner log looks like:

I0825 12:36:20.610487       1 leaderelection.go:260] successfully acquired lease ceph-csi-rbd/rbd-csi-ceph-com
I0825 12:36:20.710870       1 controller.go:811] Starting provisioner controller rbd.csi.ceph.com_ceph-csi-rbd-provisioner-66bff55c47-9p97p_4ced4187-44dc-4ba1-9921-9f12d48b958e!
I0825 12:36:20.710892       1 clone_controller.go:66] Starting CloningProtection controller
I0825 12:36:20.710899       1 volume_store.go:97] Starting save volume queue
I0825 12:36:20.710911       1 clone_controller.go:82] Started CloningProtection controller
I0825 12:36:20.811803       1 controller.go:1366] provision "test/data-mariadb-0" class "rbd": started
I0825 12:36:20.811818       1 controller.go:860] Started provisioner controller rbd.csi.ceph.com_ceph-csi-rbd-provisioner-66bff55c47-9p97p_4ced4187-44dc-4ba1-9921-9f12d48b958e!
I0825 12:36:20.812329       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test", Name:"data-mariadb-0", UID:"2a45a0a1-d09c-495a-9846-76eed47a0b2c", APIVersion:"v1", ResourceVersion:"65771679", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "test/data-mariadb-0"
I0825 12:36:21.247153       1 controller.go:1449] provision "test/data-mariadb-0" class "rbd": volume "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c" provisioned
I0825 12:36:21.247168       1 controller.go:1462] provision "test/data-mariadb-0" class "rbd": succeeded
I0825 12:36:21.699252       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test", Name:"data-mariadb-0", UID:"2a45a0a1-d09c-495a-9846-76eed47a0b2c", APIVersion:"v1", ResourceVersion:"65771679", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c
I0825 13:07:54.161702       1 controller.go:1509] delete "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": started
I0825 13:07:54.586347       1 controller.go:1524] delete "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": volume deleted
I0825 13:07:54.926752       1 controller.go:1561] delete "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": failed to remove finalizer for persistentvolume: Operation cannot be fulfilled on persistentvolumes "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": the object has been modified; please apply your changes to the latest version and try again
W0825 13:07:54.926775       1 controller.go:989] Retrying syncing volume "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c", failure 0
E0825 13:07:54.926798       1 controller.go:1007] error syncing volume "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": Operation cannot be fulfilled on persistentvolumes "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": the object has been modified; please apply your changes to the latest version and try again
I0825 13:07:54.926811       1 controller.go:1509] delete "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": started
I0825 13:07:55.034306       1 controller.go:1524] delete "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": volume deleted
I0825 13:07:55.154234       1 controller.go:1569] delete "pvc-2a45a0a1-d09c-495a-9846-76eed47a0b2c": persistentvolume deleted succeeded

In comparison, v3.12.1 gets the lease, fails to create its watches, and keeps retrying forever; it doesn't even notice the PVC.

RBAC

In the installation manifest, permission to read nodes and csinodes is always granted.

In the helm chart, permission is only granted when the domainLabels list is non-empty, but it's now empty by default (#4776). But the provisioner is still trying to read node/csinode stuff, and apparently can't finish its setup phase without it. So that seems to be why it's failing now.

At this point, I'm feeling a little lost. I feel like an enabled feature with an empty configuration should do nothing. But this is doing a little too much nothing 😁. Should the provisioner have permission to look at nodes, regardless of whether domain labels are defined in the helm chart?

I saw the discussion of command line arguments in #4777 and #4790. Was that intended to fix this issue? I checked and my provisioner is indeed being passed the --immediate-topology=false flag.

The text was updated successfully, but these errors were encountered:

Infinoid · 2024-08-25T14:33:45Z

For now, I've worked around it by adding the necessary permissions to the clusterrole. Once I did that, the provisioner finished its startup phase, saw the PVC, provisioned the PV and life is good.

Infinoid · 2024-08-25T14:44:26Z

Hmm. With the above workaround, the pod is started and apparently running. But I see this event on the pod that mounted the new PVC:

  Warning  FailedScheduling        2m37s              default-scheduler        0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.

Other than that event, I don't see the word "immediate" anywhere in the pod, PVC, or PV. I don't really know what it means.

dragoangel · 2024-08-26T07:42:12Z

@Infinoid hi, will be fixed by #4798 and issue covered in the #4790 (comment)

Infinoid · 2024-08-26T08:30:54Z

Thanks. Your discussion on #4790 was after it was closed, so I missed it. I think you're right, this is a duplicate.

Infinoid closed this as completed Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" #4800

User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" #4800

Infinoid commented Aug 25, 2024

Infinoid commented Aug 25, 2024

Infinoid commented Aug 25, 2024 •

edited

Loading

dragoangel commented Aug 26, 2024 •

edited

Loading

Infinoid commented Aug 26, 2024

User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" #4800

User "system:serviceaccount:ceph-csi-rbd:ceph-csi-rbd-provisioner" cannot list resource "nodes" #4800

Comments

Infinoid commented Aug 25, 2024

Describe the bug

Environment details

Steps to reproduce

Actual results

Expected behavior

Logs

Additional context

Terraform installation & configuration

What the previous (successful) version looks like

RBAC

Infinoid commented Aug 25, 2024

Infinoid commented Aug 25, 2024 • edited Loading

dragoangel commented Aug 26, 2024 • edited Loading

Infinoid commented Aug 26, 2024

Infinoid commented Aug 25, 2024 •

edited

Loading

dragoangel commented Aug 26, 2024 •

edited

Loading