Document upgrade procedure for CSI nodeplugins #703

ShyamsundarR · 2019-11-04T20:52:34Z

CSI nodeplugins, specifically when using cephfs FUSE or rbd-nbd as the mounters, when upgraded, will cause existing mounts to become stale/not-rechable (usually connection timeout errors).

This is due to losing the mount processes running within the CSI nodeplugin pods.

We need documented steps to ensure upgrades are smooth, even when upgrading to minor image versions, for bug fixes.

travisn · 2019-11-04T21:07:51Z

Upgrading the CSI driver will cause all the mounts to become stale? This seems like a blocker for upgrades. What's the workaround to keep a mount available during any upgrade? You have to failover all pods on one node, then upgrade its csi driver?

dillaman · 2019-11-04T21:10:31Z

You have to failover all pods on one node, then upgrade its csi driver?

Yes (when using the referenced backend drivers). We are working on a way to preserve the rbd-nbd state post-upgrade so that it can recover.

ShyamsundarR · 2019-11-04T21:55:16Z

Also, for CephFS-FUSE driver we do have the feature to preserve mounts on the system, post restart via the --mountcachedir option. That needs to be tested better though.

ShyamsundarR · 2019-11-22T12:52:51Z

Also, for CephFS-FUSE driver we do have the feature to preserve mounts on the system, post restart via the --mountcachedir option. That needs to be tested better though.

The stated feature and options do not work for CephFS. The reasons are as follows,

When a volume is staged and published on a node it typically gets the following mounts (with kubernetes as the CO),
- mount output from nodeplugin or on th host:
  ceph-fuse on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e2f04422-9786-4ad5-8cd0-49f8b8ee9b66/globalmount type fuse.ceph-fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
  
  ceph-fuse on /var/lib/kubelet/pods/c290345c-6cae-4904-9d63-e707cec7fb1f/volumes/kubernetes.io~csi/pvc-e2f04422-9786-4ad5-8cd0-49f8b8ee9b66/mount type fuse.ceph-fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
The pod that is run on the node, gets a runc configuration like so,
{"destination":"/pvc-cephfs-mnt","type":"bind","source":"/var/lib/kubelet/pods/c290345c-6cae-4904-9d63-e707cec7fb1f/volumes/kubernetes.io~csi/pvc-e2f04422-9786-4ad5-8cd0-49f8b8ee9b66/mount","options":["rbind","rprivate"]}
The above ensures that the publish path is further bind mounted within the pod namespace as required, which hence ends up within the pod as the following mount,
- mount output: ceph-fuse on /pvc-cephfs-mnt type fuse.ceph-fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)

IOW, the pod gets its own bind mount, and it really does not matter at this point in time if we lose the stage and publish mount points, what matters is that the pod also depends on the specific instance of cephfs-fuse to back the mount point.

The feature to restart the bind mounts in cephFS are only contained to the CSI stage and publish paths, hence on a restart, when a new cephfs-fuse instance is mounted to the stage path and subsequently bind mounted to the publish path, these 2 paths become healthy. The pods bind mount however is never refreshed, as that is out of the control of CSI, and the CO (kubernetes in this case) has no reason to refresh that path due to nodeplugin restarts (also runc is already running the container, so even if it was required it would not be possible).

As a result of the above, when using cephfs, at present even with the --mountcachedir option and setting it to a non-emptydir local data stash, does not provide the required recover-ability semantics when the CSI nodeplugin is restarted.

ajarr · 2019-11-22T14:51:05Z

@batrick @joscollin ^^
We need to figure out what we need to here

batrick · 2019-11-22T22:49:18Z

It is ironic that the container movement partially started out of a desire to avoid dependency hell with shared libraries and other system files and yet here we are resolving dependencies between pods/infrastructure.

@ShyamsundarR it's not clear to me if this is a a result of dysfunction in the CSI "nodeplugin" interface or what. What is an example of a shared file system that is supposed to survive this upgrade process? The bind mount will surely become stale as soon as the file system proxy (FUSE in this case) is restarted?

dillaman · 2019-11-22T22:59:46Z

@batrick The CephFS kernel driver will survive (as will krbd). The issue arises when we start mixing userspace daemons that are backing kernel file systems / block devices. This upgrade issue is only applicable for the ceph-fuse and rbd-nbd userspace tools since those daemons are run within the CSI node plugin pod, so when that pod gets upgraded, it results in those daemons getting killed. We are moving rbd-nbd to its own pod in the future to better manage its lifecycle outside of the CSI driver, but it sounds like if the ceph-fuse daemon is killed, there is no way to recover a mount(?) even if you could restart the daemon.

batrick · 2019-11-22T23:32:44Z

@batrick The CephFS kernel driver will survive (as will krbd). The issue arises when we start mixing userspace daemons that are backing kernel file systems / block devices.

Right.

This upgrade issue is only applicable for the ceph-fuse and rbd-nbd userspace tools since those daemons are run within the CSI node plugin pod, so when that pod gets upgraded, it results in those daemons getting killed. We are moving rbd-nbd to its own pod in the future to better manage its lifecycle outside of the CSI driver,

I understand moving the rbd-nbd userspace agent to another pod. Is that so you can avoid upgrades for the running application pods?

For ceph-fuse, we would never want to upgrade the client mount while an application pod is using it.

but it sounds like if the ceph-fuse daemon is killed, there is no way to recover a mount(?) even if you could restart the daemon.

There's no way to recover the mount, no. That is unlikely to ever change.

dillaman · 2019-11-22T23:43:37Z

For ceph-fuse, we would never want to upgrade the client mount while an application pod is using it.

You can adopt a similar approach. Move ceph-fuse to its own pod, have it run in the foreground (it becomes the container's pid 1) but spawn child processes or threads for each mount. Re-use the Ceph admin-daemon to communicate between this new ceph-fuse pod and the CephFS CSI node plugin (e.g. "ceph --admin-daemon /shared/path/to/the/ceph-fuse-pod.asok mount ...."), and boom goes the dynamite so long as ceph-fuse doesn't crash.

batrick · 2019-11-23T02:07:21Z

For ceph-fuse, we would never want to upgrade the client mount while an application pod is using it.

You can adopt a similar approach. Move ceph-fuse to its own pod, have it run in the foreground (it becomes the container's pid 1) but spawn child processes or threads for each mount. Re-use the Ceph admin-daemon to communicate between this new ceph-fuse pod and the CephFS CSI node plugin (e.g. "ceph --admin-daemon /shared/path/to/the/ceph-fuse-pod.asok mount ...."), and boom goes the dynamite so long as ceph-fuse doesn't crash.

I'm not sure why we wouldn't have a ceph-fuse daemon for each pod. Sharing one libcephfs cache for all pods has dubious benefits. Also, a single ceph-fuse means all pods' I/O funnels through the single-threaded FUSE daemon.

ShyamsundarR · 2019-11-23T02:26:15Z

@ShyamsundarR it's not clear to me if this is a a result of dysfunction in the CSI "nodeplugin" interface or what. What is an example of a shared file system that is supposed to survive this upgrade process? The bind mount will surely become stale as soon as the file system proxy (FUSE in this case) is restarted?

The alternative here is to drain a node of application pods using PVs backed by fuse-cephfs (and as an extension rbd-nbd), before upgrading the CSI nodeplugin. IOW, move pods out of the node before an upgrade of the nodeplugin pod.

In prior to CSI cases, the mount proxy was run on the host/node, hence was not tied to a pod and there was no such pod to upgrade as well in the first place. (this may not be true for all storage providers).

If the proxy on the node needed to be upgraded, an upgrade of the node (resulting in application pod drain anyway) was performed.

I am adding @phlogistonjohn and @raghavendra-talur for more commentary on pre-CSI cases where the proxy service, for example gluster fuse client, needed to be upgraded.

ShyamsundarR · 2019-11-23T02:28:49Z

For ceph-fuse, we would never want to upgrade the client mount while an application pod is using it.

You can adopt a similar approach. Move ceph-fuse to its own pod, have it run in the foreground (it becomes the container's pid 1) but spawn child processes or threads for each mount. Re-use the Ceph admin-daemon to communicate between this new ceph-fuse pod and the CephFS CSI node plugin (e.g. "ceph --admin-daemon /shared/path/to/the/ceph-fuse-pod.asok mount ...."), and boom goes the dynamite so long as ceph-fuse doesn't crash.

I'm not sure why we wouldn't have a ceph-fuse daemon for each pod. Sharing one libcephfs cache for all pods has dubious benefits. Also, a single ceph-fuse means all pods' I/O funnels through the single-threaded FUSE daemon.

Interesting, this may mean we should close this issue as a result, where I was toying with the thought of running a single ceph-fuse for all subvolumes on that node.

ShyamsundarR · 2019-11-23T02:32:42Z

Also one downside of this discussion is we need to pull out this PR from the code, as it serves no purpose at present #282

raghavendra-talur · 2019-11-23T22:05:16Z

In prior to CSI cases, the mount proxy was run on the host/node, hence was not tied to a pod and there was no such pod to upgrade as well in the first place. (this may not be true for all storage providers).

If the proxy on the node needed to be upgraded, an upgrade of the node (resulting in application pod drain anyway) was performed.

I am adding @phlogistonjohn and @raghavendra-talur for more commentary on pre-CSI cases where the proxy service, for example gluster fuse client, needed to be upgraded.

That is right. Even though we did not have pods for client operation, we had client rpms that needed upgrade. We followed the same rules for client rpms that are recommended for the kubelet on the nodes.

I was not able to find any docs specifically for the CSI pods.

raghavendra-talur · 2019-11-23T22:14:02Z

I linked to the rules in the previous comment but the summary is that the worker nodes are drained, cordoned off before upgrading the kubelet.

Admins prefer to do this when the usage of cluster is low and the upgrade of all nodes in the cluster might take days. Hence it is expected that the some nodes have lower version client and the other have higher version at a given point.

CSI nodeplugins, specifically when using cephfs FUSE or rbd-nbd as the mounters, when upgraded, will cause existing mounts to become stale/not-rechable (usually connection timeout errors). This is due to losing the mount processes running within the CSI nodeplugin pods. This PR add updated the Daemonset update strategy based on the ENV variable to take care of above issue with some manual steps Moreinfo: ceph/ceph-csi#703 Resolves: rook#4248 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>

ShyamsundarR · 2019-12-13T16:38:22Z

The current strategy (as discussed with @Madhu-1) to address this is as follows,

Add an update strategy to the nodeplugin deamonsets to denote them as "OnDelete"
- This prevents the nodeplugins from being upgraded automatically without user/admin intervention
Document upgrade procedure to ensure, node is evicted prior to restarting the nodeplugin deamonset, hence preventing apps from losing access to storage

The above at least ensures that there are no surprises for app pods using the storage on said nodes and the upgrade can be admin controlled as well.

As updated above, @Madhu-1 is working on this in Rook to begin with rook/rook#4496

ShyamsundarR · 2019-12-16T15:07:41Z

Here is another community discussion on the topic that is a useful read.

CSI nodeplugins, specifically when using cephfs FUSE or rbd-nbd as the mounters, when upgraded, will cause existing mounts to become stale/not-rechable (usually connection timeout errors). This is due to losing the mount processes running within the CSI nodeplugin pods. This PR add updated the Daemonset update strategy based on the ENV variable to take care of above issue with some manual steps Moreinfo: ceph/ceph-csi#703 Resolves: rook#4248 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>

Madhu-1 added the bug Something isn't working label Nov 5, 2019

Madhu-1 mentioned this issue Nov 5, 2019

Document upgrade procedure for CSI rook/rook#4248

Closed

ajarr added the component/cephfs Issues related to CephFS label Nov 6, 2019

ShyamsundarR mentioned this issue Nov 22, 2019

rbd-nbd: Add restart support ceph/ceph#31687

Closed

Madhu-1 mentioned this issue Nov 29, 2019

Cephfs CSI volumes fail to attach with transport endpoint is not connected rook/rook#4389

Closed

Madhu-1 mentioned this issue Dec 13, 2019

CSI: cephfs and rbd daemonset upgrade strategy rook/rook#4496

Merged

9 tasks

This was referenced Jan 2, 2020

refact: namespace all labels in the helm charts #765

Closed

Add ceph-csi Upgrade documentation #770

Merged

mergify bot closed this as completed in #770 Jan 14, 2020

Madhu-1 mentioned this issue Feb 10, 2020

Revert code added to maintain cephfs mount cache on nodes #824

Closed

apollo13 mentioned this issue Nov 3, 2020

Nomad CSI plugin is started to often hashicorp/nomad#9248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document upgrade procedure for CSI nodeplugins #703

Document upgrade procedure for CSI nodeplugins #703

ShyamsundarR commented Nov 4, 2019

travisn commented Nov 4, 2019

dillaman commented Nov 4, 2019

ShyamsundarR commented Nov 4, 2019

ShyamsundarR commented Nov 22, 2019

ajarr commented Nov 22, 2019 •

edited

Loading

batrick commented Nov 22, 2019

dillaman commented Nov 22, 2019

batrick commented Nov 22, 2019

dillaman commented Nov 22, 2019

batrick commented Nov 23, 2019

ShyamsundarR commented Nov 23, 2019

ShyamsundarR commented Nov 23, 2019

ShyamsundarR commented Nov 23, 2019

raghavendra-talur commented Nov 23, 2019

raghavendra-talur commented Nov 23, 2019

ShyamsundarR commented Dec 13, 2019

ShyamsundarR commented Dec 16, 2019

Document upgrade procedure for CSI nodeplugins #703

Document upgrade procedure for CSI nodeplugins #703

Comments

ShyamsundarR commented Nov 4, 2019

travisn commented Nov 4, 2019

dillaman commented Nov 4, 2019

ShyamsundarR commented Nov 4, 2019

ShyamsundarR commented Nov 22, 2019

ajarr commented Nov 22, 2019 • edited Loading

batrick commented Nov 22, 2019

dillaman commented Nov 22, 2019

batrick commented Nov 22, 2019

dillaman commented Nov 22, 2019

batrick commented Nov 23, 2019

ShyamsundarR commented Nov 23, 2019

ShyamsundarR commented Nov 23, 2019

ShyamsundarR commented Nov 23, 2019

raghavendra-talur commented Nov 23, 2019

raghavendra-talur commented Nov 23, 2019

ShyamsundarR commented Dec 13, 2019

ShyamsundarR commented Dec 16, 2019

ajarr commented Nov 22, 2019 •

edited

Loading