Add multiNodeWritable option for RBD Volumes #239

j-griffith · 2019-03-01T17:50:47Z

This change adds the ability to define a multiNodeWritable option in
the Storage Class.

This change does a number of things:

Allow multi-node-multi-writer access modes if the SC options is
enabled
Bypass the watcher checks for MultiNodeMultiWriter Volumes
Maintains existing watcher checks for SingleNodeWriter access modes
regardless of the StorageClass option.

Makefile

rootfs · 2019-03-01T17:57:46Z

pkg/rbd/controllerserver.go

-	rbdVol, err := getRBDVolumeOptions(req.GetParameters())
+	// MultiNodeWriters are accepted but they're only for special cases, and we skip the watcher checks for them which isn't the greatest
+	// let's make sure we ONLY skip that if the user is requesting a MULTI Node accessbile mode
+	ignoreMultiWriterEnabled := true


nit:

disableMultiWriter := true

sure, that works

rootfs · 2019-03-01T18:00:28Z

pkg/rbd/controllerserver.go

+	// let's make sure we ONLY skip that if the user is requesting a MULTI Node accessbile mode
+	ignoreMultiWriterEnabled := true
+	for _, am := range req.VolumeCapabilities {
+		if am.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {


what about explicitly checking VolumeCapability_AccessMode_MULTI_NODE_SINGLE_WRITER and VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER, because there are two single node capabilities in csi spec:

const ( VolumeCapability_AccessMode_UNKNOWN VolumeCapability_AccessMode_Mode = 0 // Can only be published once as read/write on a single node, at // any given time. VolumeCapability_AccessMode_SINGLE_NODE_WRITER VolumeCapability_AccessMode_Mode = 1 // Can only be published once as readonly on a single node, at // any given time. VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY VolumeCapability_AccessMode_Mode = 2 // Can be published as readonly at multiple nodes simultaneously. VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY VolumeCapability_AccessMode_Mode = 3 // Can be published at multiple nodes simultaneously. Only one of // the node can be used as read/write. The rest will be readonly. VolumeCapability_AccessMode_MULTI_NODE_SINGLE_WRITER VolumeCapability_AccessMode_Mode = 4 // Can be published as read/write at multiple nodes // simultaneously. VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER VolumeCapability_AccessMode_Mode = 5 )

ack, I had that planned for a follow up: 6c473d2#diff-170d986ec944158cc00a94f7e60f989eR106

Let me take a look at handling that case now instead if you're good with it.

rootfs · 2019-03-01T18:01:23Z

pkg/rbd/controllerserver.go

-			return &csi.ValidateVolumeCapabilitiesResponse{Message: ""}, nil
+	params := req.GetParameters()
+	multiWriter, _ := params["multiNodeWritable"]
+	if multiWriter == "enabled" {


nit: strings.ToLower(multiWriter) == "enabled"

j-griffith · 2019-03-01T18:10:55Z

pkg/rbd/rbd.go

+	r.cd.AddVolumeCapabilityAccessModes(
+		[]csi.VolumeCapability_AccessMode_Mode{
+			csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
+			csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER})


@rootfs I agree, the problem though, is I settled on using the Storage Class here instead of a driver flag, so the problem was at driver Run I don't have a handle to the SC parameters; unless I'm missing an access path?

I considered doing a combination of SC Parameter and a Flag on driver start up, but that seemed redundant.

rootfs · 2019-03-01T18:04:17Z

pkg/rbd/rbd_attach.go

+		// In the case of multiattach we want to short circuit the retries when used (so r`if used; return used`)
+		// otherwise we're setting this to false which translates to !ok, which means backoff and try again
+		// NOTE: we ONLY do this if an multi-node access mode is requested for this volume
+		if (volOptions.MultiNodeWritable == "enabled") && (used) {


strings.ToLower(...)

rootfs · 2019-03-01T18:07:04Z

thanks @j-griffith some comments and some lint errors.

Would you please also provide a storageclass example and expand readme to include this multi node writer use case with screenshot?

j-griffith · 2019-03-01T18:10:55Z

pkg/rbd/rbd.go

+	r.cd.AddVolumeCapabilityAccessModes(
+		[]csi.VolumeCapability_AccessMode_Mode{
+			csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
+			csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER})


@rootfs I agree, the problem though, is I settled on using the Storage Class here instead of a driver flag, so the problem was at driver Run I don't have a handle to the SC parameters; unless I'm missing an access path?

I considered doing a combination of SC Parameter and a Flag on driver start up, but that seemed redundant.

j-griffith · 2019-03-01T18:12:12Z

thanks @j-griffith some comments and some lint errors.

Would you please also provide a storageclass example and expand readme to include this multi node writer use case with screenshot?

Ahh, yeah; I see the examples dir, sorry I intended to do that; will get it added.

j-griffith · 2019-03-01T21:34:47Z

cleaning up the import problem from the rebase, and configuring the md linter locally so I can get this squared away; sorry about that.

This change adds the ability to define a `multiNodeWritable` option in the Storage Class. This change does a number of things: 1. Allow multi-node-multi-writer access modes if the SC options is enabled 2. Bypass the watcher checks for MultiNodeMultiWriter Volumes 3. Maintains existing watcher checks for SingleNodeWriter access modes regardless of the StorageClass option. fix lint-errors

rootfs · 2019-03-01T21:58:46Z

pkg/rbd/nodeserver.go

 	if err != nil {
 		return nil, err
 	}
+	// Check access mode settings in the request, even if SC is RW-Many, if the request is a normal Single Writer volume, we ignore this setting and proceed as normal


this can be cleaned now

tombarron

Sorry I didn't review this one earlier but I think that it would make more sense to illustrate use of multiNodeWritable option with raw block volumeMode rather than file. I don't see support there yet for real clustered filesystems (lustre, GFS, etc) which is what would be needed to make the nginx example work. If on the other hand the volume is accessed as raw block and shows up on mutiple nodes as /dev/vdxx rather than via some regular filesystem path it's more obvious that the writing applications on the multiple nodes need to coordinate among themselves. Maybe an example could be a parallel file system wiper: multiple workers on multiple nodes elect a leader/coordinator and divide up the volume's blocks to work on.

tombarron · 2019-03-11T12:35:26Z

pkg/rbd/nodeserver.go

-	volOptions, err := getRBDVolumeOptions(req.GetVolumeContext())
+
+	ignoreMultiWriterEnabled := true
+	if req.VolumeCapability.AccessMode.Mode != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {


Wouldn't it make more sense to have:
if req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER
here? As written the test passes for five of the six enum values, including e.g. VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY

The only valid modes that can even make it here from the CO are the ones we advertise in the capabilities (SINGLE_NODE_WRITER and MULTI_NODE_MULTI_WRITER), anything else will be rejected by the CO because it's not in the capabilities. The next step here was to add the additional capabilities and continue handling cases as needed (those that weren't sub/super sets). In this case you have one or the other, so the != works.

Well the suggested comparison expresses your intent and the actual comparison only works accidentally, as long as we continue to only advertise those two capabilities. I can't see any reason why we wouldn't decide in the future to also advertise other capabilities -- e.g. SINGLE_NODE_MULTI_WRITER or the read-only capabilities.

This is all really rather irrelevant, you're reviewing a merged PR :) In addition, based on your objections, this PR is being reverted and replaced. Let's see how the new PR goes and just save the discussions for that.

tombarron · 2019-03-11T12:55:34Z

docs/deploy-rbd.md

@@ -58,6 +58,21 @@ Parameter | Required | Description
 `csi.storage.k8s.io/provisioner-secret-name`, `csi.storage.k8s.io/node-publish-secret-name` | for Kubernetes | name of the Kubernetes Secret object containing Ceph client credentials. Both parameters should have the same value
 `csi.storage.k8s.io/provisioner-secret-namespace`, `csi.storage.k8s.io/node-publish-secret-namespace` | for Kubernetes | namespaces of the above Secret objects
 `mounter`| no | if set to `rbd-nbd`, use `rbd-nbd` on nodes that have `rbd-nbd` and `nbd` kernel modules to map rbd images
+`fsType` | no | allows setting to `ext3 | ext-4 | xfs`, default is `ext-4`


Wouldn't we need to support a clustered file system for fsType if we allow multiNodeWritable on block storage with non-Block access type?

tombarron · 2019-03-11T12:59:13Z

examples/README.md

+    csiNodePublishSecretNamespace: default
+    adminid: admin
+    userid: admin
+    fsType: xfs


xfs is a node-local, non-clustered filesystem so accessing via the filesystem rather than as raw block with multiNodeWritable will lead to data/filesystem corruption.

tombarron · 2019-03-11T13:22:56Z

examples/README.md

+  name: pvc-1
+spec:
+  accessModes:
+  - ReadWriteMany


Can we add 'volumeMode: Block' here? That way we get away from needing to have a clustered file system on the RBD block device. If we are doing raw block access then the volume can have no file system or it can have a regular, node-local filesystem as long as we aren't using the filesystem to access it ReadWriteMany.

tombarron · 2019-03-11T13:37:03Z

examples/README.md

+       readOnly: false
+```
+
+If you access the pod you can check that your data is avaialable at


When the node with the second pod does the node PublishVolume it will run fsck against an active filesystem mounted on another node. We'll get past this with XFS since (see fsck.xfs man page) it's a journaling filesystem that performs recovery at mount time instead so that fsck.xfs just exits with zero status. But then node with the second pod will attempt the mount and run xfs_repair while the node with the first pod is writing to disk and journal. We'll likely fail the mount and/or corrupt the filesystem at this point. If by some miracle we got past this to actually writing to /var/lib/www/html we'd have two brains with their own journals trying to write to the same disk without coordinating with one another.

That's not completely accurate; but regardless, keep in mind there is the multi-reader case that's very useful here as well. Anyway, I get what you're saying and I'm not arguing really. I am however saying that IMO the job of the storage plugin is to provide an interface to the device to expose it's capabilities and let users do what they want with them. I'm working on a change to the multi-writer capability that I think will fit your view of the world better so hopefully you'll be happier with that one. I'll be sure to ping you for reviews as soon as the PR is posted.

The important part of this is whether I'm wrong that nodePublishVolume with volumeMode File and a node-local filesystem like ext4 or xfs has potential for corrupting the filesystem when it does fsck and mount from the second node. If I'm mis-reading the code for that part then yeah, I'm being inaccurate and should have head adjusted with a 2x4.

j-griffith · 2019-03-13T14:19:00Z

Sorry I didn't review this one earlier but I think that it would make more sense to illustrate use of multiNodeWritable option with raw block volumeMode rather than file. I don't see support there yet for real clustered filesystems (lustre, GFS, etc) which is what would be needed to make the nginx example work. If on the other hand the volume is accessed as raw block and shows up on mutiple nodes as /dev/vdxx rather than via some regular filesystem path it's more obvious that the writing applications on

Sure, although 90+ % of people are then going to turn around and do a mkfs/mount anyway. I get what you're saying and I know you're of the opinion that only ceph-fs should be used for this sort of thing but I don't think that's realistic. There are overlay applications like RAC etc that deal with this sort of thing quite well. There's also an important case to be made for single-node-multi-writer (although it's perhaps the most contentious access-mode in K8s right now). There's no risk to sharing an ext-4 across multiple pods on the same node (in most cases).

the multiple nodes need to coordinate among themselves. Maybe an example could be a parallel file system wiper: multiple workers on multiple nodes elect a leader/coordinator and divide up the volume's blocks to work on.

Yeah, but the thing is the CSI-Plugin IMO isn't responsible for implementing a FS, or the application, it's simply a provisioner, and its job should be to provide the ability for consumers to build things on the storage. Yes, users can do "bad things" to themselves, should we have a monitor that doesn't allow "rm -rf /*" because it's typically a mistake?

I realize your preference for FS based provisioning, and in a number of cases I completely agree; but I don't think we should force users to choose one vs the other. Instead I'm of the opinion that we should provide the tools and the options for users to consume storage in a manner that best suits their needs.

tombarron · 2019-03-13T15:32:21Z

"I know you're of the opinion that only ceph-fs should be used for this sort of thing but I don't think that's realistic. "
That's not my opinion at all. I think raw-block is suitable for MultiModeMultiWriter for apps that coordinate multiple writers themselves and that access the block device as /dev/vdxxx rather than by a node-specific filesystem.

And I totally agree that single-node-multi-writer is fine with a node-local filesystem like ext4 or XFS.

"Yes, users can do "bad things" to themselves, should we have a monitor that doesn't allow "rm -rf /*" because it's typically a mistake?"

My concern is that users won't do it, k8s will when in runs nodePublishVolume - it will fsck and mount unless raw block volumeMode is specified in the PVC.

Syncing latest changes from upstream devel for ceph-csi

rootfs reviewed Mar 1, 2019

View reviewed changes

Makefile Outdated Show resolved Hide resolved

j-griffith force-pushed the add_multinode_write branch from 9c51bbb to 6c473d2 Compare March 1, 2019 17:56

rootfs reviewed Mar 1, 2019

View reviewed changes

j-griffith commented Mar 1, 2019

View reviewed changes

j-griffith force-pushed the add_multinode_write branch 2 times, most recently from fe7eea0 to 69d19cd Compare March 1, 2019 20:56

j-griffith force-pushed the add_multinode_write branch from e57e315 to 0d87e47 Compare March 1, 2019 21:41

j-griffith force-pushed the add_multinode_write branch from 0d87e47 to ba4ba5e Compare March 1, 2019 21:51

rootfs reviewed Mar 1, 2019

View reviewed changes

rootfs approved these changes Mar 1, 2019

View reviewed changes

mergify bot merged commit b5b8e46 into ceph:csi-v1.0 Mar 1, 2019

j-griffith deleted the add_multinode_write branch March 1, 2019 22:08

j-griffith mentioned this pull request Mar 1, 2019

Multinode Access for RBD Plugin #237

Closed

tombarron reviewed Mar 11, 2019

View reviewed changes

nixpanic pushed a commit to nixpanic/ceph-csi that referenced this pull request Mar 4, 2024

Merge pull request ceph#239 from red-hat-storage/sync_us--devel

81dde0a

Syncing latest changes from upstream devel for ceph-csi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiNodeWritable option for RBD Volumes #239

Add multiNodeWritable option for RBD Volumes #239

j-griffith commented Mar 1, 2019 •

edited by rootfs

Loading

rootfs Mar 1, 2019

j-griffith Mar 1, 2019

rootfs Mar 1, 2019

j-griffith Mar 1, 2019

rootfs Mar 1, 2019

j-griffith Mar 1, 2019

rootfs Mar 1, 2019

rootfs commented Mar 1, 2019

j-griffith Mar 1, 2019

j-griffith commented Mar 1, 2019 •

edited

Loading

j-griffith commented Mar 1, 2019

rootfs Mar 1, 2019

tombarron left a comment

tombarron Mar 11, 2019

j-griffith Mar 13, 2019

tombarron Mar 13, 2019

j-griffith Mar 13, 2019

tombarron Mar 11, 2019

tombarron Mar 11, 2019

tombarron Mar 11, 2019

tombarron Mar 11, 2019

j-griffith Mar 13, 2019

tombarron Mar 13, 2019

j-griffith commented Mar 13, 2019

tombarron commented Mar 13, 2019

Add multiNodeWritable option for RBD Volumes #239

Add multiNodeWritable option for RBD Volumes #239

Conversation

j-griffith commented Mar 1, 2019 • edited by rootfs Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rootfs commented Mar 1, 2019

Choose a reason for hiding this comment

j-griffith commented Mar 1, 2019 • edited Loading

j-griffith commented Mar 1, 2019

Choose a reason for hiding this comment

tombarron left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j-griffith commented Mar 13, 2019

tombarron commented Mar 13, 2019

j-griffith commented Mar 1, 2019 •

edited by rootfs

Loading

j-griffith commented Mar 1, 2019 •

edited

Loading