Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi-rbdplugin on the node crashes on start with nil pointer #4807

Closed
xcompass opened this issue Aug 27, 2024 · 11 comments · Fixed by #4808
Closed

csi-rbdplugin on the node crashes on start with nil pointer #4807

xcompass opened this issue Aug 27, 2024 · 11 comments · Fixed by #4808

Comments

@xcompass
Copy link

xcompass commented Aug 27, 2024

csi-rbdplugin is crashing on one of the k8s node. Here is the log of the csi-rbdplugin container:

Defaulted container "csi-rbdplugin" out of: csi-rbdplugin, driver-registrar, liveness-prometheus
I0826 09:34:01.412332   57346 cephcsi.go:191] Driver version: v3.11.0 and Git version: bc24b5eca87626d690a29effa9d7420cc0154a7a
I0826 09:34:01.413253   57346 cephcsi.go:268] Initial PID limit is set to 256123
I0826 09:34:01.413510   57346 cephcsi.go:274] Reconfigured PID limit to -1 (max)
I0826 09:34:01.414051   57346 cephcsi.go:223] Starting driver type: rbd with name: rbd.csi.ceph.com
I0826 09:34:01.438534   57346 mount_linux.go:282] Detected umount with safe 'not mounted' behavior
I0826 09:34:01.453157   57346 rbd_attach.go:242] nbd module loaded
I0826 09:34:01.453253   57346 rbd_attach.go:256] kernel version "6.6.43-flatcar" supports cookie feature
I0826 09:34:01.497897   57346 rbd_attach.go:272] rbd-nbd tool supports cookie feature
I0826 09:34:01.498969   57346 server.go:114] listening for CSI-Addons requests on address: &net.UnixAddr{Name:"/csi/csi-addons.sock", Net:"unix"}
I0826 09:34:01.499266   57346 server.go:117] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x1bedd56]

goroutine 75 [running]:
github.com/ceph/ceph-csi/internal/rbd.RunVolumeHealer(0xc0007e7ea0, 0x3b27aa0)
        /go/src/github.com/ceph/ceph-csi/internal/rbd/rbd_healer.go:199 +0x3d6
github.com/ceph/ceph-csi/internal/rbd/driver.(*Driver).Run.func1()
        /go/src/github.com/ceph/ceph-csi/internal/rbd/driver/driver.go:191 +0x1f
created by github.com/ceph/ceph-csi/internal/rbd/driver.(*Driver).Run in goroutine 1
        /go/src/github.com/ceph/ceph-csi/internal/rbd/driver/driver.go:189 +0x749

I tried to run a debugger in the container and traced to the PV causing the crash:

Name:            pvc-4a532f6e-35c3-11e7-870a-00505601176d
Labels:          <none>
Annotations:     pv.kubernetes.io/bound-by-controller: yes
                 pv.kubernetes.io/migrated-to: rbd.csi.ceph.com
                 pv.kubernetes.io/provisioned-by: kubernetes.io/rbd
Finalizers:      [kubernetes.io/pv-protection external-provisioner.volume.kubernetes.io/finalizer]
StorageClass:    fast
Status:          Bound
Claim:           default/devops-compair-deploy-staging-redis-pvc
Reclaim Policy:  Delete
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        2Gi
Node Affinity:   <none>
Message:
Source:
    Type:          RBD (a Rados Block Device mount on the host that shares a pod's lifetime)
    CephMonitors:  [10.93.1.100:6789]
    RBDImage:      kubernetes-dynamic-pvc-4a5f348c-35c3-11e7-a683-005056011766
    FSType:
    RBDPool:       rbd
    RadosUser:     kube
    Keyring:       /etc/ceph/keyring
    SecretRef:     &SecretReference{Name:ceph-secret-user,Namespace:default,}
    ReadOnly:      false
Events:            <none>

And also traced that pv.Spec.PersistentVolumeSource.CSI is nil on this line

Here is the value of pv.Spec.PersistentVolumeSource before crash:

(dlv) print pv.Spec.PersistentVolumeSource
k8s.io/api/core/v1.PersistentVolumeSource {
        GCEPersistentDisk: *k8s.io/api/core/v1.GCEPersistentDiskVolumeSource nil,
        AWSElasticBlockStore: *k8s.io/api/core/v1.AWSElasticBlockStoreVolumeSource nil,
        HostPath: *k8s.io/api/core/v1.HostPathVolumeSource nil,
        Glusterfs: *k8s.io/api/core/v1.GlusterfsPersistentVolumeSource nil,
        NFS: *k8s.io/api/core/v1.NFSVolumeSource nil,
        RBD: *k8s.io/api/core/v1.RBDPersistentVolumeSource {
                CephMonitors: []string len: 1, cap: 4, [
                        "10.93.1.100:6789",
                ],
                RBDImage: "kubernetes-dynamic-pvc-4a5f348c-35c3-11e7-a683-005056011766",
                FSType: "",
                RBDPool: "rbd",
                RadosUser: "kube",
                Keyring: "/etc/ceph/keyring",
                SecretRef: *(*"k8s.io/api/core/v1.SecretReference")(0xc00061c9a0),
                ReadOnly: false,},
        ISCSI: *k8s.io/api/core/v1.ISCSIPersistentVolumeSource nil,
        Cinder: *k8s.io/api/core/v1.CinderPersistentVolumeSource nil,
        CephFS: *k8s.io/api/core/v1.CephFSPersistentVolumeSource nil,
        FC: *k8s.io/api/core/v1.FCVolumeSource nil,
        Flocker: *k8s.io/api/core/v1.FlockerVolumeSource nil,
        FlexVolume: *k8s.io/api/core/v1.FlexPersistentVolumeSource nil,
        AzureFile: *k8s.io/api/core/v1.AzureFilePersistentVolumeSource nil,
        VsphereVolume: *k8s.io/api/core/v1.VsphereVirtualDiskVolumeSource nil,
        Quobyte: *k8s.io/api/core/v1.QuobyteVolumeSource nil,
        AzureDisk: *k8s.io/api/core/v1.AzureDiskVolumeSource nil,
        PhotonPersistentDisk: *k8s.io/api/core/v1.PhotonPersistentDiskVolumeSource nil,
        PortworxVolume: *k8s.io/api/core/v1.PortworxVolumeSource nil,
        ScaleIO: *k8s.io/api/core/v1.ScaleIOPersistentVolumeSource nil,
        Local: *k8s.io/api/core/v1.LocalVolumeSource nil,
        StorageOS: *k8s.io/api/core/v1.StorageOSPersistentVolumeSource nil,
        CSI: *k8s.io/api/core/v1.CSIPersistentVolumeSource nil,}

Looks like it should reference RBD instead of CSI, or VolumeHealer should skip RBD volumes?

NOTE: the PV was provisioned by in-tree kubernetes.io/rbd provisioner and migrated to CSI.

I don't have enough knowledge on either ceph-csi code base or volume healer function to continue my debugging. Any suggestions? Thanks!

Environment details

  • Image/version of Ceph CSI driver : v3.11.0 and v3.12.1
  • Helm chart version : v3.11.0 and v3.12.1
  • Kernel version : 6.1.90-flatcar
  • Mounter used for mounting PVC (for cephFS its fuse or kernel. for rbd its
    krbd or rbd-nbd) :
  • Kubernetes cluster version : v1.27.16
  • Ceph cluster version : v16.2.15
Madhu-1 added a commit to Madhu-1/ceph-csi that referenced this issue Aug 27, 2024
add a check for CSI as it can be
nil for non-csi PV.

fixes: ceph#4807

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Madhu-1 added a commit to Madhu-1/ceph-csi that referenced this issue Aug 27, 2024
add a check for CSI as it can be
nil for non-csi PV.

fixes: ceph#4807

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Madhu-1 added a commit to Madhu-1/ceph-csi that referenced this issue Aug 27, 2024
add a check for CSI as it can be
nil for non-csi PV.

fixes: ceph#4807

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
@mergify mergify bot closed this as completed in #4808 Aug 27, 2024
mergify bot pushed a commit that referenced this issue Aug 27, 2024
add a check for CSI as it can be
nil for non-csi PV.

fixes: #4807

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
mergify bot pushed a commit that referenced this issue Aug 27, 2024
add a check for CSI as it can be
nil for non-csi PV.

fixes: #4807

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit 3ac5968)
mergify bot pushed a commit that referenced this issue Aug 28, 2024
add a check for CSI as it can be
nil for non-csi PV.

fixes: #4807

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit 3ac5968)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@xcompass and others