CSI NodePlugin with v.3.12.0 driver fails when missing domain labels #4775

grazvan96 · 2024-08-16T18:59:46Z

Describe the bug

If you have normal worker nodes, which are missing the labels, and install csi with the latest driver v3.12.0 pod fails with error 258440 driver.go:145] missing domain labels [failure-domain/region failure-domain/zone] on node "[redacted]"

Environment details

Image/version of Ceph CSI driver : v3.12.0
Helm chart version : ceph-csi-rbd-3.12.0
Kernel version : 6.6.0-14-generic
Mounter used for mounting PVC (for cephFS its fuse or kernel. for rbd its
krbd or rbd-nbd) : rbd-nbd
Kubernetes cluster version : v1.30.2+rke2r1
Ceph cluster version : 18.2.2

Steps to reproduce

Steps to reproduce the behavior:

Have default nodes, without these labels [failure-domain/region failure-domain/zone]
Deploy latest ceph csi / rbd version
See error in nodeplugin

Actual results

all node plugins failing with error error 258440 driver.go:145] missing domain labels [failure-domain/region failure-domain/zone] on node "[redacted]"

Expected behavior

nodeplugins healthy.

Logs

error 258440 driver.go:145] missing domain labels [failure-domain/region failure-domain/zone] on node "[redacted]"

Additional context

Workaround is just to actually put those labels in the nodes, and after that it works as expected

The text was updated successfully, but these errors were encountered:

jtackaberry · 2024-08-16T22:22:03Z

I'm seeing this on a microk8s cluster as well.

Are the expected labels literally failure-domain/region and failure-domain/zone? If anything is going to be mandatory, shouldn't it be topology.kubernetes.io/region and topology.kubernetes.io/zone?

Infinoid · 2024-08-17T10:23:24Z

Are the expected labels literally failure-domain/region and failure-domain/zone? If anything is going to be mandatory, shouldn't it be topology.kubernetes.io/region and topology.kubernetes.io/zone?

Standard labels do seem like a better choice, but they are not present on every cluster and there's no requirement that they should be.

the description of topology.kubernetes.io/zone says:

This will be set only if you are using a cloud provider. However, you can consider setting this on nodes if it makes sense in your topology.

And indeed, on an EKS cluster, I see these labels on the nodes:

                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1c

And I don't see these labels on a local cluster built with kubeadm.

But digging into the code a bit, it looks to me like they aren't fixed requirements at all, they're configurable.

The rbd manifest has a command-line parameter you can uncomment to specify which labels to use:

            # If topology based provisioning is desired, configure required
            # node labels representing the nodes topology domain
            # and pass the label names below, for CSI to consume and advertise
            # its equivalent topology domain
            # - "--domainlabels=failure-domain/region,failure-domain/zone"

The rbd helm chart specifies those labels in the default values.yaml:

  # NOTE: the value here serves as an example and needs to be
  # updated with node labels that define domains of interest
  domainLabels:
    - failure-domain/region
    - failure-domain/zone

Is that a bad default? Does passing helm the --set topology.domainLabels=[] value solve it?

Infinoid · 2024-08-17T11:38:42Z

I ran into the same issue when I updated from 3.11.0 to 3.12.0. Adding this to my values fixed it.

topology:
  domainLabels: []

I submitted PR #4776 to comment out the bad default value. That said, these label values were 4 years old, and previous versions (3.10.2 and 3.11.0) did not crash in this way. I don't know what caused 3.12.0 to behave differently, so I can't say whether this was the only problem or not.

awigen · 2024-08-18T07:31:37Z

I also hit this issue, adding this fixed the problem:

I ran into the same issue when I updated from 3.11.0 to 3.12.0. Adding this to my values fixed it.
topology:
  domainLabels: []

ypyly · 2024-08-19T07:49:51Z

I also hit this issue, adding this fixed the problem:
I ran into the same issue when I updated from 3.11.0 to 3.12.0. Adding this to my values fixed it.
topology:
  domainLabels: []

this does indeed "fixes" the mentioned problem, however, on new pvc, there's new one :D

I0819 07:34:41.176426       1 clone_controller.go:66] Starting CloningProtection controller
I0819 07:34:41.176445       1 clone_controller.go:82] Started CloningProtection controller
I0819 07:34:41.176446       1 volume_store.go:98] "Starting save volume queue"
W0819 07:34:41.276968       1 topology.go:319] No topology keys found on any node
W0819 07:34:41.276990       1 topology.go:319] No topology keys found on any node
I0819 07:34:41.277011       1 controller.go:951] "Retrying syncing claim" key="redacted1" failures=0
I0819 07:34:41.277012       1 controller.go:951] "Retrying syncing claim" key="redacted2" failures=0
E0819 07:34:41.277028       1 controller.go:974] error syncing claim "redacted2": failed to provision volume with StorageClass "ceph-rbd-sc": error generating accessibility requirements: no available topology found
E0819 07:34:41.277031       1 controller.go:974] error syncing claim "redacted1": failed to provision volume with StorageClass "ceph-rbd-sc": error generating accessibility requirements: no available topology found

Madhu-1 · 2024-08-19T07:52:21Z

@iPraveenParihar can you please check above?

iPraveenParihar · 2024-08-19T07:58:33Z

@iPraveenParihar can you please check above?

#4776 should fix the CSI NodePlugin with v.3.12.0 driver fails when missing domain labels deployed via helm.

this does indeed "fixes" the mentioned problem, however, on new pvc, there's new one :D

We yet to decide on how to enable/disable topology aware provisioning (#4726).
For now, the workaround is to add --feature-gates=Topology=false for csi-provisioner in csi-rbdplugin-provisioner deployment pod.

@ypyly, Can you please try with above workaround and see if that helps you?

ypyly · 2024-08-19T08:03:38Z

@iPraveenParihar yup, that works, thank you for a quick help

Remove the bad default, add commented-out standard labels as a suggestion. Fixes: #4775 Signed-off-by: Mark Glines <mark@glines.org> (cherry picked from commit 2c0e65b)

grazvan96 closed this as completed Aug 16, 2024

grazvan96 reopened this Aug 16, 2024

nevivurn mentioned this issue Aug 17, 2024

Revert "chore(deps): update helm release ceph-csi-rbd to v3.12.0" bacchus-snu/cd-manifests#368

Merged

Infinoid mentioned this issue Aug 17, 2024

helm: don't specify default topology domainlabels in rbd chart #4776

Merged

6 tasks

Madhu-1 mentioned this issue Aug 19, 2024

ceph-csi 3.12.0 not able to provision RBD PVC. #4777

Closed

mergify bot closed this as completed in 2c0e65b Aug 19, 2024

mergify bot closed this as completed in #4776 Aug 19, 2024

mergify bot mentioned this issue Aug 19, 2024

helm: don't specify default topology domainlabels in rbd chart (backport #4776) #4778

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI NodePlugin with v.3.12.0 driver fails when missing domain labels #4775

CSI NodePlugin with v.3.12.0 driver fails when missing domain labels #4775

grazvan96 commented Aug 16, 2024

jtackaberry commented Aug 16, 2024

Infinoid commented Aug 17, 2024

Infinoid commented Aug 17, 2024

awigen commented Aug 18, 2024

ypyly commented Aug 19, 2024 •

edited

Loading

Madhu-1 commented Aug 19, 2024

iPraveenParihar commented Aug 19, 2024 •

edited

Loading

ypyly commented Aug 19, 2024 •

edited

Loading

CSI NodePlugin with v.3.12.0 driver fails when missing domain labels #4775

CSI NodePlugin with v.3.12.0 driver fails when missing domain labels #4775

Comments

grazvan96 commented Aug 16, 2024

Describe the bug

Environment details

Steps to reproduce

Actual results

Expected behavior

Logs

Additional context

jtackaberry commented Aug 16, 2024

Infinoid commented Aug 17, 2024

Infinoid commented Aug 17, 2024

awigen commented Aug 18, 2024

ypyly commented Aug 19, 2024 • edited Loading

Madhu-1 commented Aug 19, 2024

iPraveenParihar commented Aug 19, 2024 • edited Loading

ypyly commented Aug 19, 2024 • edited Loading

ypyly commented Aug 19, 2024 •

edited

Loading

iPraveenParihar commented Aug 19, 2024 •

edited

Loading

ypyly commented Aug 19, 2024 •

edited

Loading