Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RBAC] Helm Chart 3.10.0: csi-rbdplugin container enters CrashLoopBackoff with failed to get node error #4306

Closed
remisauvat opened this issue Dec 6, 2023 · 5 comments · Fixed by #4302

Comments

@remisauvat
Copy link

Describe the bug

Hello,
After upgrading the helm chart from 3.9.0 to 3.10.0, the container csi-rbdplugin crashes in a loop with a permissions error to fetch the nodes resources.

F1206 16:31:23.921143       1 driver.go:131] failed to get node "xxxxxxxx" information: nodes "xxxxxxxx" is forbidden: User "system:serviceaccount:ceph-csi-rbd:cph-cs-rbd-ceph-csi-rbd-provisioner" cannot get resource "nodes" in API group "" at the cluster scope

This is probably related to changes from #4165, but the RBAC rules for the service account do not match the new requirement to fetch the nodes labels.

Environment details

  • Image/version of Ceph CSI driver : 3.10.0
  • Helm chart version : 3.10.0
  • Kernel version : 5.15.0-89-generic
  • Mounter used for mounting PVC (for cephFS its fuse or kernel. for rbd its
    krbd or rbd-nbd) :
  • Kubernetes cluster version : 1.27.3
  • Ceph cluster version :

Steps to reproduce

Steps to reproduce the behavior:

  1. Deploy helm chart ceph-csi/ceph-csi-rbd with version 3.10.0.
  2. Make sure rbac.create: true is set in values.yaml

Actual results

Pod csi-rbd-provisioner is in CrashLoopBackup state due to failure in csi-rbdplugin container

Expected behavior

The container and pod should not crash

Logs

csi-rbdplugin:

I1206 16:31:23.907599       1 cephcsi.go:191] Driver version: v3.10.0 and Git version: 24ae2a7a062b3e58746bb9cc6d5737e37a7e771c
I1206 16:31:23.907720       1 cephcsi.go:223] Starting driver type: rbd with name: rbd.csi.ceph.com
I1206 16:31:23.907750       1 driver.go:94] Enabling controller service capability: CREATE_DELETE_VOLUME
I1206 16:31:23.907755       1 driver.go:94] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I1206 16:31:23.907761       1 driver.go:94] Enabling controller service capability: CLONE_VOLUME
I1206 16:31:23.907765       1 driver.go:94] Enabling controller service capability: EXPAND_VOLUME
I1206 16:31:23.907770       1 driver.go:107] Enabling volume access mode: SINGLE_NODE_WRITER
I1206 16:31:23.907774       1 driver.go:107] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I1206 16:31:23.907778       1 driver.go:107] Enabling volume access mode: SINGLE_NODE_SINGLE_WRITER
I1206 16:31:23.907918       1 driver.go:107] Enabling volume access mode: SINGLE_NODE_MULTI_WRITER
F1206 16:31:23.921143       1 driver.go:131] failed to get node "xxxxxxx" information: nodes "xxxxxxx" is forbidden: User "system:serviceaccount:ceph-csi-rbd:cph-cs-rbd-ceph-csi-rbd-provisioner" cannot get resource "nodes" in API group "" at the cluster scope

Additional context

I don't think it's the same issue as #4298. Setting readAffinity.enabled to true or false doesn't change the issue.

@Rakshith-R Rakshith-R linked a pull request Dec 7, 2023 that will close this issue
2 tasks
@Rakshith-R
Copy link
Contributor

@remisauvat The linked pr should solve the issue. We'll include the fix in v3.10.1 soon.

As a workaround, can you please add required clusterrole in provisioner rbac too ?

@remisauvat
Copy link
Author

The linked PR will not solve the issue because it only adds get nodes permission to the nodeplugin clusterrole.

The provisioner service account is linked only to a Role which cannot get nodes. So I think there is a need to create a new clusterrole for the provisioner to allow get nodes.

I manually patched a clusterrole for the provisioner and also patched for #4297 and now it works. I am in a lab cluster so I will revert to v3.9.0 and wait for a chart version that can fix this. I am sorry I am not able to provide a PR for this.

@Rakshith-R
Copy link
Contributor

The linked PR will not solve the issue because it only adds get nodes permission to the nodeplugin clusterrole.

The provisioner service account is linked only to a Role which cannot get nodes. So I think there is a need to create a new clusterrole for the provisioner to allow get nodes.

I manually patched a clusterrole for the provisioner and also patched for #4297 and now it works. I am in a lab cluster so I will revert to v3.9.0 and wait for a chart version that can fix this. I am sorry I am not able to provide a PR for this.

The latest commit in that pr moves the code to run only in nodeserver which will solve the issue.

@remisauvat
Copy link
Author

Oh I didn't get that. Then you are right it should solve the issue.
I will wait for 3.10.1 to test it.

Thank you

@XtremeOwnageDotCom
Copy link

XtremeOwnageDotCom commented Dec 8, 2023

Here is a single line command to fix the issue for now

kubectl patch ClusterRole ceph-csi-rbd-nodeplugin --type=json -p='[{"op":"add","path":"/rules/-","value":{"apiGroups":[""],"resources":["nodes"],"verbs":["get","list","watch"]}}]'

#4302 will fix this.

@mergify mergify bot closed this as completed in #4302 Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants