You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Verify network connectivity from csi pods and ceph cluster
This can be done by doing curl/ping or executing ceph commands from from cephfs/rbd-polugin container of provisioner and daemonset pods.
Command to get dmesg logs from the node based on the pod name or pvc name
If the PVC is not attaching to a give pod, we can identify to which node the pod is scheduled and on that node/rbdplugin pod we can run dmesg and print logs which helps for debuggin
Command to check any stale map or mount commands in csi plugin pods
More details about what command to run and where to run is documented here and here
Command to pull required logs from the leader provisioner and sidecar container based on the pod name or the pvc name
We might have two provisioner pods running, sometime for new comers its hard to find leader pod and which container to pull the logs from , we can make simple helper command which helps in pulling the required logs
For pvc create/delete issue pull logs from csi-provisioner and csi-rbdplugin/csi-cephfsplugin container
For snapshot create/delete issue pull logs from csi-snapshotter and csi-rbdplugin/csi-cephfsplugin container
For resize issue pull logs from csi-resizer and csi-rbdplugin/csi-cephfsplugin container
etc
Command to run ceph commands from provisioner or nodeplugin containers
This helps sometime where csi is not getting expected results but still admin wants to manually mount the rbd image or umount the image or run rados commands or ceph fs commands
Command to identify and cleanup stale resources (rbd images/subvolumes/omap)
This is a big topic and need lot of automation but this is really helpful and also at the same time its a dangerous command. i will provide more details when we start working on this one.
Print kernel version of all the plugin pod nodes
Printing the kernel version from the nodes where cephfs/rbd plugin pod runs, this helps in some debug cases
Recover from Node lost cases
some details here we might also need some command to remove watcher also.
The text was updated successfully, but these errors were encountered:
Before we implement these commands, we need a design about what will be most helpful for troubleshooting csi issues. For example, if the tool could help with questions such as:
Why is my PVC unbound?
Why is my volume not mounting?
Is my cluster health affecting csi provisioning?
And in the output of the tool, do we really want to retrieve container logs or get long dmesg output? I wonder if we should do something more basic like print suggestions for where to run dmesg, print which provisioners are the leaders (with the pod names) so they can look at those logs, or print that there are ceph health issues that would prevent volumes from working. Or if we're going to get full logs, do we dump them in some directory, or where do we put them?
For ceph health, we could print status such as whether mons are in quorum, whether there are OSDs, whether there are PGs unhealthy, whether all the expected csi pods are running, and so on.
What about some overall functions like this?
kubectl rook-ceph csi health: Overall csi health
kubectl rook-ceph csi health <pvc>: Info about why a specific pvc might not be mounting and which logs might help troubleshoot
kubectl rook-ceph csi blocklist <node>: Block a node that is down to allow the PVs on that node to move to another node.
This can be done by doing curl/ping or executing ceph commands from from cephfs/rbd-polugin container of provisioner and daemonset pods.
If the PVC is not attaching to a give pod, we can identify to which node the pod is scheduled and on that node/rbdplugin pod we can run dmesg and print logs which helps for debuggin
More details about what command to run and where to run is documented here and here
We might have two provisioner pods running, sometime for new comers its hard to find leader pod and which container to pull the logs from , we can make simple helper command which helps in pulling the required logs
etc
This helps sometime where csi is not getting expected results but still admin wants to manually mount the rbd image or umount the image or run rados commands or ceph fs commands
This is a big topic and need lot of automation but this is really helpful and also at the same time its a dangerous command. i will provide more details when we start working on this one.
Printing the kernel version from the nodes where cephfs/rbd plugin pod runs, this helps in some debug cases
some details here we might also need some command to remove watcher also.
The text was updated successfully, but these errors were encountered: