Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI NodePlugin with v.3.12.0 driver fails when missing domain labels #4775

Closed
grazvan96 opened this issue Aug 16, 2024 · 8 comments · Fixed by #4776
Closed

CSI NodePlugin with v.3.12.0 driver fails when missing domain labels #4775

grazvan96 opened this issue Aug 16, 2024 · 8 comments · Fixed by #4776

Comments

@grazvan96
Copy link

Describe the bug

If you have normal worker nodes, which are missing the labels, and install csi with the latest driver v3.12.0 pod fails with error 258440 driver.go:145] missing domain labels [failure-domain/region failure-domain/zone] on node "[redacted]"

Environment details

  • Image/version of Ceph CSI driver : v3.12.0
  • Helm chart version : ceph-csi-rbd-3.12.0
  • Kernel version : 6.6.0-14-generic
  • Mounter used for mounting PVC (for cephFS its fuse or kernel. for rbd its
    krbd or rbd-nbd) : rbd-nbd
  • Kubernetes cluster version : v1.30.2+rke2r1
  • Ceph cluster version : 18.2.2

Steps to reproduce

Steps to reproduce the behavior:

  1. Have default nodes, without these labels [failure-domain/region failure-domain/zone]
  2. Deploy latest ceph csi / rbd version
  3. See error in nodeplugin

Actual results

all node plugins failing with error error 258440 driver.go:145] missing domain labels [failure-domain/region failure-domain/zone] on node "[redacted]"

Expected behavior

nodeplugins healthy.

Logs

error 258440 driver.go:145] missing domain labels [failure-domain/region failure-domain/zone] on node "[redacted]"

Additional context

Workaround is just to actually put those labels in the nodes, and after that it works as expected

@grazvan96 grazvan96 reopened this Aug 16, 2024
@jtackaberry
Copy link

I'm seeing this on a microk8s cluster as well.

Are the expected labels literally failure-domain/region and failure-domain/zone? If anything is going to be mandatory, shouldn't it be topology.kubernetes.io/region and topology.kubernetes.io/zone?

@Infinoid
Copy link
Contributor

Are the expected labels literally failure-domain/region and failure-domain/zone? If anything is going to be mandatory, shouldn't it be topology.kubernetes.io/region and topology.kubernetes.io/zone?

Standard labels do seem like a better choice, but they are not present on every cluster and there's no requirement that they should be.

the description of topology.kubernetes.io/zone says:

This will be set only if you are using a cloud provider. However, you can consider setting this on nodes if it makes sense in your topology.

And indeed, on an EKS cluster, I see these labels on the nodes:

                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1c

And I don't see these labels on a local cluster built with kubeadm.

But digging into the code a bit, it looks to me like they aren't fixed requirements at all, they're configurable.

The rbd manifest has a command-line parameter you can uncomment to specify which labels to use:

            # If topology based provisioning is desired, configure required
            # node labels representing the nodes topology domain
            # and pass the label names below, for CSI to consume and advertise
            # its equivalent topology domain
            # - "--domainlabels=failure-domain/region,failure-domain/zone"

The rbd helm chart specifies those labels in the default values.yaml:

  # NOTE: the value here serves as an example and needs to be
  # updated with node labels that define domains of interest
  domainLabels:
    - failure-domain/region
    - failure-domain/zone

Is that a bad default? Does passing helm the --set topology.domainLabels=[] value solve it?

@Infinoid
Copy link
Contributor

I ran into the same issue when I updated from 3.11.0 to 3.12.0. Adding this to my values fixed it.

topology:
  domainLabels: []

I submitted PR #4776 to comment out the bad default value. That said, these label values were 4 years old, and previous versions (3.10.2 and 3.11.0) did not crash in this way. I don't know what caused 3.12.0 to behave differently, so I can't say whether this was the only problem or not.

@awigen
Copy link

awigen commented Aug 18, 2024

I also hit this issue, adding this fixed the problem:

I ran into the same issue when I updated from 3.11.0 to 3.12.0. Adding this to my values fixed it.

topology:
  domainLabels: []

@ypyly
Copy link

ypyly commented Aug 19, 2024

I also hit this issue, adding this fixed the problem:

I ran into the same issue when I updated from 3.11.0 to 3.12.0. Adding this to my values fixed it.

topology:
  domainLabels: []

this does indeed "fixes" the mentioned problem, however, on new pvc, there's new one :D

I0819 07:34:41.176426       1 clone_controller.go:66] Starting CloningProtection controller
I0819 07:34:41.176445       1 clone_controller.go:82] Started CloningProtection controller
I0819 07:34:41.176446       1 volume_store.go:98] "Starting save volume queue"
W0819 07:34:41.276968       1 topology.go:319] No topology keys found on any node
W0819 07:34:41.276990       1 topology.go:319] No topology keys found on any node
I0819 07:34:41.277011       1 controller.go:951] "Retrying syncing claim" key="redacted1" failures=0
I0819 07:34:41.277012       1 controller.go:951] "Retrying syncing claim" key="redacted2" failures=0
E0819 07:34:41.277028       1 controller.go:974] error syncing claim "redacted2": failed to provision volume with StorageClass "ceph-rbd-sc": error generating accessibility requirements: no available topology found
E0819 07:34:41.277031       1 controller.go:974] error syncing claim "redacted1": failed to provision volume with StorageClass "ceph-rbd-sc": error generating accessibility requirements: no available topology found

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Aug 19, 2024

@iPraveenParihar can you please check above?

@iPraveenParihar
Copy link
Contributor

iPraveenParihar commented Aug 19, 2024

@iPraveenParihar can you please check above?

#4776 should fix the CSI NodePlugin with v.3.12.0 driver fails when missing domain labels deployed via helm.

this does indeed "fixes" the mentioned problem, however, on new pvc, there's new one :D

We yet to decide on how to enable/disable topology aware provisioning (#4726).
For now, the workaround is to add --feature-gates=Topology=false for csi-provisioner in csi-rbdplugin-provisioner deployment pod.

@ypyly, Can you please try with above workaround and see if that helps you?

@ypyly
Copy link

ypyly commented Aug 19, 2024

@iPraveenParihar yup, that works, thank you for a quick help

@mergify mergify bot closed this as completed in 2c0e65b Aug 19, 2024
@mergify mergify bot closed this as completed in #4776 Aug 19, 2024
mergify bot pushed a commit that referenced this issue Aug 19, 2024
Remove the bad default, add commented-out standard labels as a suggestion.

Fixes: #4775
Signed-off-by: Mark Glines <mark@glines.org>
(cherry picked from commit 2c0e65b)
mergify bot pushed a commit that referenced this issue Aug 19, 2024
Remove the bad default, add commented-out standard labels as a suggestion.

Fixes: #4775
Signed-off-by: Mark Glines <mark@glines.org>
(cherry picked from commit 2c0e65b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants