-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbd-nbd: Add restart support #31687
rbd-nbd: Add restart support #31687
Conversation
Sync our copy of the nbd-netlink.h header to the current kernel. Signed-off-by: Mike Christie <mchristi@redhat.com>
Move the nl code that sets up the socket part of the msg, so it can be used in the next reconfig reconnect patch. Signed-off-by: Mike Christie <mchristi@redhat.com>
Move find_mapped_dev_by_spec to before the nl functions, so in the next patch it can be used by them. Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Mike Christie <mchristi@redhat.com>
This patch allows us to start a new daemon on a running device. This can be done on a device where the daemon has crashed or if there is another daemon already running and we want to replace the old one with the new one. Signed-off-by: Mike Christie <mchristi@redhat.com>
goto free_msg; | ||
} | ||
|
||
if (index >= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is not needed. Will fix on next repush.
Am I correct in presuming that The end state that we need to reach is that the k8s ceph-csi container should be able to be destroyed / recreated w/o killing or erroring IO to the clients. Right now in the CSI, the per-node RBD CSI node plugin will kick off a "rbd-nbd" daemon that runs within the CSI pod. However, upon upgrade, k8s will just destroy that pod (and its rbd-nbd daemons) and recreate a new one. Therefore, the CSI driver won't have any persistent data available to it to attempt to restart the "rbd-nbd" daemons (not even the keys to talk to the Ceph cluster). That means the If we could use the k8s "hostPath" volume w/ the RBD CSI node plugin to offer a persistent data scratch area, we could record some json to allow rbd-nbd to restart all known mounts. I believe the CephFS CSI node plugin uses this strategy (looks like it currently uses a k8s "emptyDir" volume which the docs state will get wiped when a pod is destroyed so I don't know how that survives upgrades). However, that doesn't make it safe from random rbd-nbd daemon crashes since the CSI won't check to see if the daemon is still alive. Therefore, that was one of the design rationales for using a unified daemon for rbd-nbd so k8s could manage its lifecycle (and restart it if it crashes). If we therefore end up w/ a rbd-nbd pod per node, we need some way for the RBD CSI node plugin to talk to that other pod. This is where we can use the Ceph admin socket to send commands from the CSI node plugin pod to the rbd-nbd pod on the same node if we just "hostMount" a common directory for the ASOK socket file between the two pods. The CSI could then run "ceph --admin-daemon /path/to/the-rbd-nbd.asok map/unmap " and the associated command handler in the rbd-nbd daemon could do the required work (it would also need to pass the cluster connection details as command arguments). Pinging @Madhu-1 and @ShyamsundarR since they can probably poke holes in any k8s / CSI data persistence issues better than I can. |
Yeah, I was waiting for spinic/marc to update so I could add a link. Here it is: https://www.spinics.net/lists/linux-block/msg47054.html You use NBD_ATTR_SOCKETS to pass down sockets. For a device that has not been configured yet we will do the initial binding of the passed in sockets to the device. For reconfigure, we only reconnect the passed in sockets if we find one that is marked dead. My patch modifies the reconfigure case so we use the socket being passed in instead of the current one even if the socket is not marked dead, and we restart running commands immediately instead of waiting for the commands to timeout.
When you say it will destroy the pod, will it have done the equivalent of // stop rbd-nbd and delete the kernel device so /dev/nbd accesses return failure: ? Or do you mean we would do something like leave the kernel device intact and running (it basically just queues IO until a timeout or rbd-nbd restarts), then just stop rbd-nbd?
I thought at some point we said we would be able to start a new rbd-nbd instance while the old one was still running. So that is why I switched form adding a new sysfs file to use the proc mapping used by find_mapped_dev_by_spec. |
I don't think it does it cleanly. It should send a SIGTERM to the CSI plugin process since that's all it knows about. However, the
I think that should be our end-goal. Worst case, IO would be hung for X seconds while k8s restarts the pod. In the best case, we would run at least 2 concurrent rbd-nbd pods to support the built-in failover of the nbd driver. That would allow us to upgrade one pod at a time and avoid IO hangs.
We wouldn't get the coordination at that level to say "reconnect NBD devices A, B, and C". You would have different pods that cannot see the processes of the other (due to the isolated "/proc" namespace), so the only way the could sync up would be through files on a shared file node-hosted system or via those ASOK commands. i.e. the CSI could concurrently call "map" ASOK commands on all visible ASOK sockets and upon start up a new rbd-nbd pod could ask the other pod for all block devices it currently has mapped so that it too could attach to them. |
Yes ceph-csi won't have any persistent data
we have not tested upgrade for cephfs fuse client, the PR was sent by the user and it's not tested as we default to use kernel client for cephfs. are we planning to store credentials to talk to backend ceph cluster here or just storing some plain data?
even if we have multiple CSI driver nodeplugins are running in a node, running one rbd-nbd is enough or do we need to run rbd-nbd per nodeplugin ds pod on the node?
|
We would have to store credentials. Either in a plain file or we would need a way to pass credentials between rbd-nbd daemons via the domain socket they can export.
Do we potentially run multiple RBD CSI node plugins on the same node? If we did, I don't think it would impact the fact that we would, end-state, want at least 2 rbd-nbd pods per node since each CSI pod would have the domain socket hostPath mapped into itself and could therefore talk to each rbd-nbd daemon. |
The last I tested this feature in cephfs, it does not work as intended. The mounts are reestablished (by the CSI plugin) but the application pod does not see the data. I suspect some level of multiple bind mounts and mount propagation causing this not to work. I would need to experiment with this a little, i.e reestablish a mount point backed by an rbd device, and see if the application pod regains access before commenting if this scheme works for CSI as such. |
The cephfs feature to re-mount cephfs FUSE mounts, does not work as intended and details are updated here. NOTE: This does not mean CSI cannot stash information as such and reuse it to map images for RBD, just that, in the cephfs case the fuse process backing the mount for the application pod is stale, and creating a new cephfs FUSE instance does not change what the application pod is using, hence is not recoverable.
Played around with krbd and rbd-nbd on Centos7, attempting to bring down a rbd-nbd device and then restarting it (while attempting a touch of a file from the application pod in the interim). The above does not work as such, but I guess that is what we intend to fix with this patch. IOW, my understanding is, if I used this patch in conjunction with the ndb kernel changes, and executed the map again, I would have been able to continue IO. The useful facts gathered from my experiment was,
Here is some (hopefully) useful data:
|
Ack! On terminate (gracefully or otherwise) CSI nodeplugin does not ensure all mapped devices are unmapped etc. As a matter of fact it cannot unmap, as it retains no such state, and is not expected to as well. So the device would be still be present, orphaned of sorts as the user space rbd-nbd daemon would have terminated.
Ack! we would leave the kernel device intact. Although the commands and comments seem to allude to the same noting the requirement that, we need to map the same image to the same ndb device, as pods using the image are already bound to the specific nbd device. Further, from the CSI perspective we can provide with accuracy the RBD image name, but most possibly not the nbd device name that it was mapped to initially. |
Just wanted to sync us up on some details, because how nbd works by default is a little odd. If you send SIGTEM/SIGINT to rbd-nbd, it unmaps the image. It's the same as running rbd-nbd unmap someimage As far as the nbd kernel module is concerned the device is deleted. The rbd-nbd unmapping results in the kernel performing it's nbd disconnection process. By default, the nbd kernel module does not delete the /dev/nbd device node and related kernel structs when a device is unmapped. It's a result of how the old ioctl interface worked. The device though is considered removed / cleaned up, and will return IO errors if accessed. If you later run "rbd-nbd map some image" then the nbd kernel modules just loops over these cleaned up devices, selects the first unused one and then it reuses it for the newly mapped device. There are some flags that can be set so that when the unmap operation is run and the kernel performs a nbd disconnect operation the /dev/nbd node and all the kernel related structs are deleted similar to how you see with modern block devices. If you were to send SIGKILL to rbd-nbd, then only rbd-nbd would be killed. The daemon would just exit and it would leave the devices in the kernel running. If you did a R/W to /dev/nbd0 it would accept the IO, and other hang since rbd-nbd is not up, or report failure after N seconds if the nbd cmd timeout was set. So if we wanted to leave the /dev/nbd$N device node intact in a way that IO is not going to immediately fail while rbd-nbd is restarting then you would need to do SIGKILL on it right now. I was going to add a new signal handler to tell rbd-nbd to stop, but leave devices running so you do not have to use SIGKILL. |
Does rbd-nbd container have access to sysfs? If so and it makes it easier, I have a patch where for each nbd device we have a sysfs file that contains the settings it was mapped/created with like the proc cmdline file. If you did a SIGKILL on rbd-nbd so the kernel device is left intact, then you restart rbd-nbd with the --replace option but no image/dev argument then it just loops over /sys/block/nbd$N, checks if there is a rbd-nbd process already running for it, and if not then starts one. The nasty part is something doing the:
When we have the single daemon done, then it would just restart, loop over sysfs, do the check, then start a thread per device. So the approach is less crazy then :) We were not sure if the patch would be accepted upstream though. I can post it up if we want to go this route. |
Yes, it will need to be a privileged container so it will have full access to the host.
Does this exist now or is it a patch suggestion for the upstream kernel?
We definitely don't want to wait for a year+ if we don't need to. If we can get something (more or less) functional, let's start there and improve it as time goes on. |
Yeah, this would be something we need to push upstream then port downstream so it would take a while. |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
@dillaman This is not resolved yet isn't it.? |
Correct -- still in-progress. |
The following patches allow us to restart rbd-nbd on a running /dev/nbd device while IO is running. The new map option --replace can be used if there is a running rbd-nbd instance or if it has crashed. Example:
rbd-nbd map --try-netlink image_name --replace
or
rbd-nbd map --try-netlink /dev/nbd0 --replace