Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd-nbd: Add restart support #31687

Closed
wants to merge 5 commits into from
Closed

rbd-nbd: Add restart support #31687

wants to merge 5 commits into from

Conversation

mikechristie
Copy link

The following patches allow us to restart rbd-nbd on a running /dev/nbd device while IO is running. The new map option --replace can be used if there is a running rbd-nbd instance or if it has crashed. Example:

rbd-nbd map --try-netlink image_name --replace
or
rbd-nbd map --try-netlink /dev/nbd0 --replace

Mike Christie added 5 commits November 15, 2019 23:36
Sync our copy of the nbd-netlink.h header to the current kernel.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Move the nl code that sets up the socket part of the msg, so it can be
used in the next reconfig reconnect patch.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Move find_mapped_dev_by_spec to before the nl functions, so in the next
patch it can be used by them.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Mike Christie <mchristi@redhat.com>
This patch allows us to start a new daemon on a running device. This can
be done on a device where the daemon has crashed or if there is another
daemon already running and we want to replace the old one with the new
one.

Signed-off-by: Mike Christie <mchristi@redhat.com>
goto free_msg;
}

if (index >= 0)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not needed. Will fix on next repush.

@dillaman
Copy link

dillaman commented Nov 18, 2019

Am I correct in presuming that NBD_ATTR_SWAP_SOCKETS is some new command that you have proposed for upstream nbd? I couldn't find a matching patch via Google. What was wrong w/ NBD_ATTR_SOCKETS?

The end state that we need to reach is that the k8s ceph-csi container should be able to be destroyed / recreated w/o killing or erroring IO to the clients. Right now in the CSI, the per-node RBD CSI node plugin will kick off a "rbd-nbd" daemon that runs within the CSI pod. However, upon upgrade, k8s will just destroy that pod (and its rbd-nbd daemons) and recreate a new one. Therefore, the CSI driver won't have any persistent data available to it to attempt to restart the "rbd-nbd" daemons (not even the keys to talk to the Ceph cluster). That means the find_mapped_dev_by_spec method will fall to find a match since there won't be any running daemons.

If we could use the k8s "hostPath" volume w/ the RBD CSI node plugin to offer a persistent data scratch area, we could record some json to allow rbd-nbd to restart all known mounts. I believe the CephFS CSI node plugin uses this strategy (looks like it currently uses a k8s "emptyDir" volume which the docs state will get wiped when a pod is destroyed so I don't know how that survives upgrades). However, that doesn't make it safe from random rbd-nbd daemon crashes since the CSI won't check to see if the daemon is still alive. Therefore, that was one of the design rationales for using a unified daemon for rbd-nbd so k8s could manage its lifecycle (and restart it if it crashes).

If we therefore end up w/ a rbd-nbd pod per node, we need some way for the RBD CSI node plugin to talk to that other pod. This is where we can use the Ceph admin socket to send commands from the CSI node plugin pod to the rbd-nbd pod on the same node if we just "hostMount" a common directory for the ASOK socket file between the two pods. The CSI could then run "ceph --admin-daemon /path/to/the-rbd-nbd.asok map/unmap " and the associated command handler in the rbd-nbd daemon could do the required work (it would also need to pass the cluster connection details as command arguments).

Pinging @Madhu-1 and @ShyamsundarR since they can probably poke holes in any k8s / CSI data persistence issues better than I can.

@mikechristie
Copy link
Author

Am I correct in presuming that NBD_ATTR_SWAP_SOCKETS is some new command that you have proposed for upstream nbd? I couldn't find a matching patch via Google. What was wrong w/ NBD_ATTR_SOCKETS?

Yeah, I was waiting for spinic/marc to update so I could add a link. Here it is:

https://www.spinics.net/lists/linux-block/msg47054.html

You use NBD_ATTR_SOCKETS to pass down sockets. For a device that has not been configured yet we will do the initial binding of the passed in sockets to the device. For reconfigure, we only reconnect the passed in sockets if we find one that is marked dead.

My patch modifies the reconfigure case so we use the socket being passed in instead of the current one even if the socket is not marked dead, and we restart running commands immediately instead of waiting for the commands to timeout.

The end state that we need to reach is that the k8s ceph-csi container should be able to be destroyed / recreated w/o killing or erroring IO to the clients. Right now in the CSI, the per-node RBD CSI node plugin will kick off a "rbd-nbd" daemon that runs within the CSI pod. However, upon upgrade, k8s will just destroy that pod (and its rbd-nbd daemons) and recreate a new one.

When you say it will destroy the pod, will it have done the equivalent of

// stop rbd-nbd and delete the kernel device so /dev/nbd accesses return failure:
rbd-nbd unmap someimage

?

Or do you mean we would do something like leave the kernel device intact and running (it basically just queues IO until a timeout or rbd-nbd restarts), then just stop rbd-nbd?

Therefore, the CSI driver won't have any persistent data available to it to attempt to restart the "rbd-nbd" daemons (not even the keys to talk to the Ceph cluster). That means the find_mapped_dev_by_spec method will fall to find a match since there won't be any running daemons.

I thought at some point we said we would be able to start a new rbd-nbd instance while the old one was still running. So that is why I switched form adding a new sysfs file to use the proc mapping used by find_mapped_dev_by_spec.

@dillaman
Copy link

dillaman commented Nov 18, 2019

The end state that we need to reach is that the k8s ceph-csi container should be able to be destroyed / recreated w/o killing or erroring IO to the clients. Right now in the CSI, the per-node RBD CSI node plugin will kick off a "rbd-nbd" daemon that runs within the CSI pod. However, upon upgrade, k8s will just destroy that pod (and its rbd-nbd daemons) and recreate a new one.

When you say it will destroy the pod, will it have done the equivalent of

// stop rbd-nbd and delete the kernel device so /dev/nbd accesses return failure:
rbd-nbd unmap someimage

I don't think it does it cleanly. It should send a SIGTERM to the CSI plugin process since that's all it knows about. However, the rbd-nbd daemon's parent will be process 1 -- which just so happens to be the CSI plugin process (since it's all running in a namespace). That kills the rbd-nbd daemon (you see "block nbd0: shutting down sockets" in dmesg due to the sig handler), but that block device would still be bind-mounted (w/ or w/o a filesystem on top) into user another pod that is still running.

Or do you mean we would do something like leave the kernel device intact and running (it basically just queues IO until a timeout or rbd-nbd restarts), then just stop rbd-nbd?

I think that should be our end-goal. Worst case, IO would be hung for X seconds while k8s restarts the pod. In the best case, we would run at least 2 concurrent rbd-nbd pods to support the built-in failover of the nbd driver. That would allow us to upgrade one pod at a time and avoid IO hangs.

I thought at some point we said we would be able to start a new rbd-nbd instance while the old one was still running. So that is why I switched form adding a new sysfs file to use the proc mapping used by find_mapped_dev_by_spec.

We wouldn't get the coordination at that level to say "reconnect NBD devices A, B, and C". You would have different pods that cannot see the processes of the other (due to the isolated "/proc" namespace), so the only way the could sync up would be through files on a shared file node-hosted system or via those ASOK commands. i.e. the CSI could concurrently call "map" ASOK commands on all visible ASOK sockets and upon start up a new rbd-nbd pod could ask the other pod for all block devices it currently has mapped so that it too could attach to them.

@Madhu-1
Copy link
Contributor

Madhu-1 commented Nov 19, 2019

Am I correct in presuming that NBD_ATTR_SWAP_SOCKETS is some new command that you have proposed for upstream nbd? I couldn't find a matching patch via Google. What was wrong w/ NBD_ATTR_SOCKETS?

The end state that we need to reach is that the k8s ceph-csi container should be able to be destroyed / recreated w/o killing or erroring IO to the clients. Right now in the CSI, the per-node RBD CSI node plugin will kick off a "rbd-nbd" daemon that runs within the CSI pod. However, upon upgrade, k8s will just destroy that pod (and its rbd-nbd daemons) and recreate a new one. Therefore, the CSI driver won't have any persistent data available to it to attempt to restart the "rbd-nbd" daemons (not even the keys to talk to the Ceph cluster). That means the find_mapped_dev_by_spec method will fall to find a match since there won't be any running daemons.

Yes ceph-csi won't have any persistent data

If we could use the k8s "hostPath" volume w/ the RBD CSI node plugin to offer a persistent data scratch area, we could record some json to allow rbd-nbd to restart all known mounts. I believe the CephFS CSI node plugin uses this strategy (looks like it currently uses a k8s "emptyDir" volume which the docs state will get wiped when a pod is destroyed so I don't know how that survives upgrades). However, that doesn't make it safe from random rbd-nbd daemon crashes since the CSI won't check to see if the daemon is still alive. Therefore, that was one of the design rationales for using a unified daemon for rbd-nbd so k8s could manage its lifecycle (and restart it if it crashes).

we have not tested upgrade for cephfs fuse client, the PR was sent by the user and it's not tested as we default to use kernel client for cephfs.

are we planning to store credentials to talk to backend ceph cluster here or just storing some plain data?

If we therefore end up w/ a rbd-nbd pod per node, we need some way for the RBD CSI node plugin to talk to that other pod. This is where we can use the Ceph admin socket to send commands from the CSI node plugin pod to the rbd-nbd pod on the same node if we just "hostMount" a common directory for the ASOK socket file between the two pods. The CSI could then run "ceph --admin-daemon /path/to/the-rbd-nbd.asok map/unmap " and the associated command handler in the rbd-nbd daemon could do the required work (it would also need to pass the cluster connection details as command arguments).

even if we have multiple CSI driver nodeplugins are running in a node, running one rbd-nbd is enough or do we need to run rbd-nbd per nodeplugin ds pod on the node?

Pinging @Madhu-1 and @ShyamsundarR since they can probably poke holes in any k8s / CSI data persistence issues better than I can.

@dillaman
Copy link

are we planning to store credentials to talk to backend ceph cluster here or just storing some plain data?

We would have to store credentials. Either in a plain file or we would need a way to pass credentials between rbd-nbd daemons via the domain socket they can export.

even if we have multiple CSI driver nodeplugins are running in a node, running one rbd-nbd is enough or do we need to run rbd-nbd per nodeplugin ds pod on the node?

Do we potentially run multiple RBD CSI node plugins on the same node? If we did, I don't think it would impact the fact that we would, end-state, want at least 2 rbd-nbd pods per node since each CSI pod would have the domain socket hostPath mapped into itself and could therefore talk to each rbd-nbd daemon.

@ShyamsundarR
Copy link
Contributor

If we could use the k8s "hostPath" volume w/ the RBD CSI node plugin to offer a persistent data scratch area, we could record some json to allow rbd-nbd to restart all known mounts. I believe the CephFS CSI node plugin uses this strategy (looks like it currently uses a k8s "emptyDir" volume which the docs state will get wiped when a pod is destroyed so I don't know how that survives upgrades). However, that doesn't make it safe from random rbd-nbd daemon crashes since the CSI won't check to see if the daemon is still alive. Therefore, that was one of the design rationales for using a unified daemon for rbd-nbd so k8s could manage its lifecycle (and restart it if it crashes).

we have not tested upgrade for cephfs fuse client, the PR was sent by the user and it's not tested as we default to use kernel client for cephfs.

The last I tested this feature in cephfs, it does not work as intended. The mounts are reestablished (by the CSI plugin) but the application pod does not see the data. I suspect some level of multiple bind mounts and mount propagation causing this not to work.

I would need to experiment with this a little, i.e reestablish a mount point backed by an rbd device, and see if the application pod regains access before commenting if this scheme works for CSI as such.

@ShyamsundarR
Copy link
Contributor

If we could use the k8s "hostPath" volume w/ the RBD CSI node plugin to offer a persistent data scratch area, we could record some json to allow rbd-nbd to restart all known mounts. I believe the CephFS CSI node plugin uses this strategy (looks like it currently uses a k8s "emptyDir" volume which the docs state will get wiped when a pod is destroyed so I don't know how that survives upgrades). However, that doesn't make it safe from random rbd-nbd daemon crashes since the CSI won't check to see if the daemon is still alive. Therefore, that was one of the design rationales for using a unified daemon for rbd-nbd so k8s could manage its lifecycle (and restart it if it crashes).

we have not tested upgrade for cephfs fuse client, the PR was sent by the user and it's not tested as we default to use kernel client for cephfs.

The last I tested this feature in cephfs, it does not work as intended. The mounts are reestablished (by the CSI plugin) but the application pod does not see the data. I suspect some level of multiple bind mounts and mount propagation causing this not to work.

The cephfs feature to re-mount cephfs FUSE mounts, does not work as intended and details are updated here.

NOTE: This does not mean CSI cannot stash information as such and reuse it to map images for RBD, just that, in the cephfs case the fuse process backing the mount for the application pod is stale, and creating a new cephfs FUSE instance does not change what the application pod is using, hence is not recoverable.

I would need to experiment with this a little, i.e reestablish a mount point backed by an rbd device, and see if the application pod regains access before commenting if this scheme works for CSI as such.

Played around with krbd and rbd-nbd on Centos7, attempting to bring down a rbd-nbd device and then restarting it (while attempting a touch of a file from the application pod in the interim).

The above does not work as such, but I guess that is what we intend to fix with this patch. IOW, my understanding is, if I used this patch in conjunction with the ndb kernel changes, and executed the map again, I would have been able to continue IO.

The useful facts gathered from my experiment was,

  • We leave the kernel nbd device intact
    • As the userspace rbd-nbd process is killed, and we really are not calling any cleanup (like unmap) when shutting down gracefully or otherwise
  • From a CSI perspective the mounting to stage and publish paths are really moot if we are reestablishing the nbd mapping, as the pod is using the nbd device directly, due to the bind mount (this is academic from this patch's perspective, as it is more CSI specific).

Here is some (hopefully) useful data:

  • container runtime configuration passed to runc,
    {"destination":"/pvc-rbd-mnt","type":"bind","source":"/var/lib/kubelet/pods/8d968c28-0786-40f3-a21d-6dee489bf202/volumes/kubernetes.io~csi/pvc-23c26687-0caa-4e73-910b-20ff58790f82/mount","options":["rbind","rprivate"]}

    • IOW, we initially bind mount the CSI publish path to the requested path within the appliction pod
  • mount appears within the container as,
    /dev/nbd0 on /pvc-rbd-mnt type ext4 (rw,seclabel,relatime,data=ordered)

  • nbd process was /usr/bin/rbd-nbd --id csi-rbd-node -m 10.103.87.84:6789,10.102.134.84:6789,10.101.116.78:6789 --keyfile=/tmp/csi/keys/keyfile-510533374 map replicapool/csi-vol-b0d3c161-0c92-11ea-81b0-cad321f16c9e that I killed and attempted IO, and I think failed badly

  • When I restarted the rbd-nbd process, it strangely mapped it back to nbd0 (I was assuming it would pick a new nbdx device number). Just noting the observation here, for the time being.

@ShyamsundarR
Copy link
Contributor

The end state that we need to reach is that the k8s ceph-csi container should be able to be destroyed / recreated w/o killing or erroring IO to the clients. Right now in the CSI, the per-node RBD CSI node plugin will kick off a "rbd-nbd" daemon that runs within the CSI pod. However, upon upgrade, k8s will just destroy that pod (and its rbd-nbd daemons) and recreate a new one.

When you say it will destroy the pod, will it have done the equivalent of
// stop rbd-nbd and delete the kernel device so /dev/nbd accesses return failure:
rbd-nbd unmap someimage

I don't think it does it cleanly. It should send a SIGTERM to the CSI plugin process since that's all it knows about. However, the rbd-nbd daemon's parent will be process 1 -- which just so happens to be the CSI plugin process (since it's all running in a namespace). That kills the rbd-nbd daemon (you see "block nbd0: shutting down sockets" in dmesg due to the sig handler), but that block device would still be bind-mounted (w/ or w/o a filesystem on top) into user another pod that is still running.

Ack! On terminate (gracefully or otherwise) CSI nodeplugin does not ensure all mapped devices are unmapped etc. As a matter of fact it cannot unmap, as it retains no such state, and is not expected to as well.

So the device would be still be present, orphaned of sorts as the user space rbd-nbd daemon would have terminated.

Or do you mean we would do something like leave the kernel device intact and running (it basically just queues IO until a timeout or rbd-nbd restarts), then just stop rbd-nbd?

I think that should be our end-goal. Worst case, IO would be hung for X seconds while k8s restarts the pod. In the best case, we would run at least 2 concurrent rbd-nbd pods to support the built-in failover of the nbd driver. That would allow us to upgrade one pod at a time and avoid IO hangs.

Ack! we would leave the kernel device intact.

Although the commands and comments seem to allude to the same noting the requirement that, we need to map the same image to the same ndb device, as pods using the image are already bound to the specific nbd device. Further, from the CSI perspective we can provide with accuracy the RBD image name, but most possibly not the nbd device name that it was mapped to initially.

@mikechristie
Copy link
Author

The end state that we need to reach is that the k8s ceph-csi container should be able to be destroyed / recreated w/o killing or erroring IO to the clients. Right now in the CSI, the per-node RBD CSI node plugin will kick off a "rbd-nbd" daemon that runs within the CSI pod. However, upon upgrade, k8s will just destroy that pod (and its rbd-nbd daemons) and recreate a new one.

When you say it will destroy the pod, will it have done the equivalent of
// stop rbd-nbd and delete the kernel device so /dev/nbd accesses return failure:
rbd-nbd unmap someimage

I don't think it does it cleanly. It should send a SIGTERM to the CSI plugin process since that's all it knows about. However, the rbd-nbd daemon's parent will be process 1 -- which just so happens to be the CSI plugin process (since it's all running in a namespace). That kills the rbd-nbd daemon (you see "block nbd0: shutting down sockets" in dmesg due to the sig handler), but that block device would still be bind-mounted (w/ or w/o a filesystem on top) into user another pod that is still running.

Ack! On terminate (gracefully or otherwise) CSI nodeplugin does not ensure all mapped devices are unmapped etc. As a matter of fact it cannot unmap, as it retains no such state, and is not expected to as well.

So the device would be still be present, orphaned of sorts as the user space rbd-nbd daemon would have terminated.

Just wanted to sync us up on some details, because how nbd works by default is a little odd.

If you send SIGTEM/SIGINT to rbd-nbd, it unmaps the image. It's the same as running

rbd-nbd unmap someimage

As far as the nbd kernel module is concerned the device is deleted. The rbd-nbd unmapping results in the kernel performing it's nbd disconnection process. By default, the nbd kernel module does not delete the /dev/nbd device node and related kernel structs when a device is unmapped. It's a result of how the old ioctl interface worked. The device though is considered removed / cleaned up, and will return IO errors if accessed. If you later run "rbd-nbd map some image" then the nbd kernel modules just loops over these cleaned up devices, selects the first unused one and then it reuses it for the newly mapped device.

There are some flags that can be set so that when the unmap operation is run and the kernel performs a nbd disconnect operation the /dev/nbd node and all the kernel related structs are deleted similar to how you see with modern block devices.

If you were to send SIGKILL to rbd-nbd, then only rbd-nbd would be killed. The daemon would just exit and it would leave the devices in the kernel running. If you did a R/W to /dev/nbd0 it would accept the IO, and other hang since rbd-nbd is not up, or report failure after N seconds if the nbd cmd timeout was set.

So if we wanted to leave the /dev/nbd$N device node intact in a way that IO is not going to immediately fail while rbd-nbd is restarting then you would need to do SIGKILL on it right now. I was going to add a new signal handler to tell rbd-nbd to stop, but leave devices running so you do not have to use SIGKILL.

@mikechristie
Copy link
Author

Or do you mean we would do something like leave the kernel device intact and running (it basically just queues IO until a timeout or rbd-nbd restarts), then just stop rbd-nbd?

I think that should be our end-goal. Worst case, IO would be hung for X seconds while k8s restarts the pod. In the best case, we would run at least 2 concurrent rbd-nbd pods to support the built-in failover of the nbd driver. That would allow us to upgrade one pod at a time and avoid IO hangs.

Ack! we would leave the kernel device intact.

Although the commands and comments seem to allude to the same noting the requirement that, we need to map the same image to the same ndb device, as pods using the image are already bound to the specific nbd device. Further, from the CSI perspective we can provide with accuracy the RBD image name, but most possibly not the nbd device name that it was mapped to initially.

Does rbd-nbd container have access to sysfs? If so and it makes it easier, I have a patch where for each nbd device we have a sysfs file that contains the settings it was mapped/created with like the proc cmdline file. If you did a SIGKILL on rbd-nbd so the kernel device is left intact, then you restart rbd-nbd with the --replace option but no image/dev argument then it just loops over /sys/block/nbd$N, checks if there is a rbd-nbd process already running for it, and if not then starts one. The nasty part is something doing the:

while ("rbd-nbd --replace" != "no devices need to be restarted)
    ;

When we have the single daemon done, then it would just restart, loop over sysfs, do the check, then start a thread per device. So the approach is less crazy then :)

We were not sure if the patch would be accepted upstream though. I can post it up if we want to go this route.

@dillaman
Copy link

Does rbd-nbd container have access to sysfs?

Yes, it will need to be a privileged container so it will have full access to the host.

If so and it makes it easier, I have a patch where for each nbd device we have a sysfs file that contains the settings it was mapped/created with like the proc cmdline file.

Does this exist now or is it a patch suggestion for the upstream kernel?

We were not sure if the patch would be accepted upstream though. I can post it up if we want to go this route.

We definitely don't want to wait for a year+ if we don't need to. If we can get something (more or less) functional, let's start there and improve it as time goes on.

@mikechristie
Copy link
Author

If so and it makes it easier, I have a patch where for each nbd device we have a sysfs file that contains the settings it was mapped/created with like the proc cmdline file.

Does this exist now or is it a patch suggestion for the upstream kernel?

We were not sure if the patch would be accepted upstream though. I can post it up if we want to go this route.

We definitely don't want to wait for a year+ if we don't need to. If we can get something (more or less) functional, let's start there and improve it as time goes on.

Yeah, this would be something we need to push upstream then port downstream so it would take a while.

@stale
Copy link

stale bot commented Jan 22, 2020

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jan 22, 2020
@stale
Copy link

stale bot commented Apr 23, 2020

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@stale stale bot closed this Apr 23, 2020
@Madhu-1
Copy link
Contributor

Madhu-1 commented Apr 23, 2020

@dillaman This is not resolved yet isn't it.?

@dillaman
Copy link

Correct -- still in-progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants