Add krew command to debugging csi #69

Madhu-1 · 2022-10-26T10:54:50Z

Verify network connectivity from csi pods and ceph cluster

This can be done by doing curl/ping or executing ceph commands from from cephfs/rbd-polugin container of provisioner and daemonset pods.

Command to get dmesg logs from the node based on the pod name or pvc name

If the PVC is not attaching to a give pod, we can identify to which node the pod is scheduled and on that node/rbdplugin pod we can run dmesg and print logs which helps for debuggin

Command to check any stale map or mount commands in csi plugin pods

More details about what command to run and where to run is documented here and here

Command to pull required logs from the leader provisioner and sidecar container based on the pod name or the pvc name

We might have two provisioner pods running, sometime for new comers its hard to find leader pod and which container to pull the logs from , we can make simple helper command which helps in pulling the required logs

For pvc create/delete issue pull logs from csi-provisioner and csi-rbdplugin/csi-cephfsplugin container
For snapshot create/delete issue pull logs from csi-snapshotter and csi-rbdplugin/csi-cephfsplugin container
For resize issue pull logs from csi-resizer and csi-rbdplugin/csi-cephfsplugin container
etc

Command to run ceph commands from provisioner or nodeplugin containers

This helps sometime where csi is not getting expected results but still admin wants to manually mount the rbd image or umount the image or run rados commands or ceph fs commands

Command to identify and cleanup stale resources (rbd images/subvolumes/omap)

This is a big topic and need lot of automation but this is really helpful and also at the same time its a dangerous command. i will provide more details when we start working on this one.

Print kernel version of all the plugin pod nodes

Printing the kernel version from the nodes where cephfs/rbd plugin pod runs, this helps in some debug cases

Recover from Node lost cases

some details here we might also need some command to remove watcher also.

travisn · 2022-11-10T20:36:30Z

Before we implement these commands, we need a design about what will be most helpful for troubleshooting csi issues. For example, if the tool could help with questions such as:

Why is my PVC unbound?
Why is my volume not mounting?
Is my cluster health affecting csi provisioning?

And in the output of the tool, do we really want to retrieve container logs or get long dmesg output? I wonder if we should do something more basic like print suggestions for where to run dmesg, print which provisioners are the leaders (with the pod names) so they can look at those logs, or print that there are ceph health issues that would prevent volumes from working. Or if we're going to get full logs, do we dump them in some directory, or where do we put them?

For ceph health, we could print status such as whether mons are in quorum, whether there are OSDs, whether there are PGs unhealthy, whether all the expected csi pods are running, and so on.

What about some overall functions like this?

kubectl rook-ceph csi health: Overall csi health
kubectl rook-ceph csi health <pvc>: Info about why a specific pvc might not be mounting and which logs might help troubleshoot
kubectl rook-ceph csi blocklist <node>: Block a node that is down to allow the PVs on that node to move to another node.

subhamkrai self-assigned this Oct 28, 2022

subhamkrai mentioned this issue Nov 2, 2022

docs: Update roadmap for the v1.11 release rook/rook#11255

Merged

7 tasks

subhamkrai mentioned this issue Nov 10, 2022

csi: adding new command csi-debug #72

Closed

subhamkrai mentioned this issue Nov 14, 2022

csi-debug: command to check network conn b/w csi #73

Closed

obnoxxx added the enhancement New feature or request label Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add krew command to debugging csi #69

Add krew command to debugging csi #69

Madhu-1 commented Oct 26, 2022 •

edited

Loading

travisn commented Nov 10, 2022

Add krew command to debugging csi #69

Add krew command to debugging csi #69

Comments

Madhu-1 commented Oct 26, 2022 • edited Loading

travisn commented Nov 10, 2022

Madhu-1 commented Oct 26, 2022 •

edited

Loading