Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiNodeWritable option for RBD Volumes #239

Merged
merged 1 commit into from
Mar 1, 2019

Conversation

j-griffith
Copy link
Contributor

@j-griffith j-griffith commented Mar 1, 2019

This change adds the ability to define a multiNodeWritable option in
the Storage Class.

This change does a number of things:

  1. Allow multi-node-multi-writer access modes if the SC options is
    enabled
  2. Bypass the watcher checks for MultiNodeMultiWriter Volumes
  3. Maintains existing watcher checks for SingleNodeWriter access modes
    regardless of the StorageClass option.

fix #237

Makefile Outdated Show resolved Hide resolved
rbdVol, err := getRBDVolumeOptions(req.GetParameters())
// MultiNodeWriters are accepted but they're only for special cases, and we skip the watcher checks for them which isn't the greatest
// let's make sure we ONLY skip that if the user is requesting a MULTI Node accessbile mode
ignoreMultiWriterEnabled := true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

disableMultiWriter := true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that works

// let's make sure we ONLY skip that if the user is requesting a MULTI Node accessbile mode
ignoreMultiWriterEnabled := true
for _, am := range req.VolumeCapabilities {
if am.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about explicitly checking VolumeCapability_AccessMode_MULTI_NODE_SINGLE_WRITER and VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER, because there are two single node capabilities in csi spec:

const (
        VolumeCapability_AccessMode_UNKNOWN VolumeCapability_AccessMode_Mode = 0
        // Can only be published once as read/write on a single node, at
        // any given time.
        VolumeCapability_AccessMode_SINGLE_NODE_WRITER VolumeCapability_AccessMode_Mode = 1
        // Can only be published once as readonly on a single node, at
        // any given time.
        VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY VolumeCapability_AccessMode_Mode = 2
        // Can be published as readonly at multiple nodes simultaneously.
        VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY VolumeCapability_AccessMode_Mode = 3
        // Can be published at multiple nodes simultaneously. Only one of
        // the node can be used as read/write. The rest will be readonly.
        VolumeCapability_AccessMode_MULTI_NODE_SINGLE_WRITER VolumeCapability_AccessMode_Mode = 4
        // Can be published as read/write at multiple nodes
        // simultaneously.
        VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER VolumeCapability_AccessMode_Mode = 5
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, I had that planned for a follow up: 6c473d2#diff-170d986ec944158cc00a94f7e60f989eR106

Let me take a look at handling that case now instead if you're good with it.

return &csi.ValidateVolumeCapabilitiesResponse{Message: ""}, nil
params := req.GetParameters()
multiWriter, _ := params["multiNodeWritable"]
if multiWriter == "enabled" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: strings.ToLower(multiWriter) == "enabled"

r.cd.AddVolumeCapabilityAccessModes(
[]csi.VolumeCapability_AccessMode_Mode{
csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rootfs I agree, the problem though, is I settled on using the Storage Class here instead of a driver flag, so the problem was at driver Run I don't have a handle to the SC parameters; unless I'm missing an access path?

I considered doing a combination of SC Parameter and a Flag on driver start up, but that seemed redundant.

// In the case of multiattach we want to short circuit the retries when used (so r`if used; return used`)
// otherwise we're setting this to false which translates to !ok, which means backoff and try again
// NOTE: we ONLY do this if an multi-node access mode is requested for this volume
if (volOptions.MultiNodeWritable == "enabled") && (used) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.ToLower(...)

@rootfs
Copy link
Member

rootfs commented Mar 1, 2019

thanks @j-griffith some comments and some lint errors.

Would you please also provide a storageclass example and expand readme to include this multi node writer use case with screenshot?

r.cd.AddVolumeCapabilityAccessModes(
[]csi.VolumeCapability_AccessMode_Mode{
csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rootfs I agree, the problem though, is I settled on using the Storage Class here instead of a driver flag, so the problem was at driver Run I don't have a handle to the SC parameters; unless I'm missing an access path?

I considered doing a combination of SC Parameter and a Flag on driver start up, but that seemed redundant.

@j-griffith
Copy link
Contributor Author

j-griffith commented Mar 1, 2019

thanks @j-griffith some comments and some lint errors.

Would you please also provide a storageclass example and expand readme to include this multi node writer use case with screenshot?

Ahh, yeah; I see the examples dir, sorry I intended to do that; will get it added.

@j-griffith j-griffith force-pushed the add_multinode_write branch 2 times, most recently from fe7eea0 to 69d19cd Compare March 1, 2019 20:56
@j-griffith
Copy link
Contributor Author

cleaning up the import problem from the rebase, and configuring the md linter locally so I can get this squared away; sorry about that.

This change adds the ability to define a `multiNodeWritable` option in
the Storage Class.

This change does a number of things:
1. Allow multi-node-multi-writer access modes if the SC options is
enabled
2. Bypass the watcher checks for MultiNodeMultiWriter Volumes
3. Maintains existing watcher checks for SingleNodeWriter access modes
regardless of the StorageClass option.

fix lint-errors
if err != nil {
return nil, err
}
// Check access mode settings in the request, even if SC is RW-Many, if the request is a normal Single Writer volume, we ignore this setting and proceed as normal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be cleaned now

@mergify mergify bot merged commit b5b8e46 into ceph:csi-v1.0 Mar 1, 2019
@j-griffith j-griffith deleted the add_multinode_write branch March 1, 2019 22:08
Copy link

@tombarron tombarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't review this one earlier but I think that it would make more sense to illustrate use of multiNodeWritable option with raw block volumeMode rather than file. I don't see support there yet for real clustered filesystems (lustre, GFS, etc) which is what would be needed to make the nginx example work. If on the other hand the volume is accessed as raw block and shows up on mutiple nodes as /dev/vdxx rather than via some regular filesystem path it's more obvious that the writing applications on the multiple nodes need to coordinate among themselves. Maybe an example could be a parallel file system wiper: multiple workers on multiple nodes elect a leader/coordinator and divide up the volume's blocks to work on.

volOptions, err := getRBDVolumeOptions(req.GetVolumeContext())

ignoreMultiWriterEnabled := true
if req.VolumeCapability.AccessMode.Mode != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to have:
if req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER
here? As written the test passes for five of the six enum values, including e.g. VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only valid modes that can even make it here from the CO are the ones we advertise in the capabilities (SINGLE_NODE_WRITER and MULTI_NODE_MULTI_WRITER), anything else will be rejected by the CO because it's not in the capabilities. The next step here was to add the additional capabilities and continue handling cases as needed (those that weren't sub/super sets). In this case you have one or the other, so the != works.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the suggested comparison expresses your intent and the actual comparison only works accidentally, as long as we continue to only advertise those two capabilities. I can't see any reason why we wouldn't decide in the future to also advertise other capabilities -- e.g. SINGLE_NODE_MULTI_WRITER or the read-only capabilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all really rather irrelevant, you're reviewing a merged PR :) In addition, based on your objections, this PR is being reverted and replaced. Let's see how the new PR goes and just save the discussions for that.

@@ -58,6 +58,21 @@ Parameter | Required | Description
`csi.storage.k8s.io/provisioner-secret-name`, `csi.storage.k8s.io/node-publish-secret-name` | for Kubernetes | name of the Kubernetes Secret object containing Ceph client credentials. Both parameters should have the same value
`csi.storage.k8s.io/provisioner-secret-namespace`, `csi.storage.k8s.io/node-publish-secret-namespace` | for Kubernetes | namespaces of the above Secret objects
`mounter`| no | if set to `rbd-nbd`, use `rbd-nbd` on nodes that have `rbd-nbd` and `nbd` kernel modules to map rbd images
`fsType` | no | allows setting to `ext3 | ext-4 | xfs`, default is `ext-4`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we need to support a clustered file system for fsType if we allow multiNodeWritable on block storage with non-Block access type?

csiNodePublishSecretNamespace: default
adminid: admin
userid: admin
fsType: xfs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xfs is a node-local, non-clustered filesystem so accessing via the filesystem rather than as raw block with multiNodeWritable will lead to data/filesystem corruption.

name: pvc-1
spec:
accessModes:
- ReadWriteMany

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add 'volumeMode: Block' here? That way we get away from needing to have a clustered file system on the RBD block device. If we are doing raw block access then the volume can have no file system or it can have a regular, node-local filesystem as long as we aren't using the filesystem to access it ReadWriteMany.

readOnly: false
```

If you access the pod you can check that your data is avaialable at

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the node with the second pod does the node PublishVolume it will run fsck against an active filesystem mounted on another node. We'll get past this with XFS since (see fsck.xfs man page) it's a journaling filesystem that performs recovery at mount time instead so that fsck.xfs just exits with zero status. But then node with the second pod will attempt the mount and run xfs_repair while the node with the first pod is writing to disk and journal. We'll likely fail the mount and/or corrupt the filesystem at this point. If by some miracle we got past this to actually writing to /var/lib/www/html we'd have two brains with their own journals trying to write to the same disk without coordinating with one another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not completely accurate; but regardless, keep in mind there is the multi-reader case that's very useful here as well. Anyway, I get what you're saying and I'm not arguing really. I am however saying that IMO the job of the storage plugin is to provide an interface to the device to expose it's capabilities and let users do what they want with them. I'm working on a change to the multi-writer capability that I think will fit your view of the world better so hopefully you'll be happier with that one. I'll be sure to ping you for reviews as soon as the PR is posted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important part of this is whether I'm wrong that nodePublishVolume with volumeMode File and a node-local filesystem like ext4 or xfs has potential for corrupting the filesystem when it does fsck and mount from the second node. If I'm mis-reading the code for that part then yeah, I'm being inaccurate and should have head adjusted with a 2x4.

@j-griffith
Copy link
Contributor Author

Sorry I didn't review this one earlier but I think that it would make more sense to illustrate use of multiNodeWritable option with raw block volumeMode rather than file. I don't see support there yet for real clustered filesystems (lustre, GFS, etc) which is what would be needed to make the nginx example work. If on the other hand the volume is accessed as raw block and shows up on mutiple nodes as /dev/vdxx rather than via some regular filesystem path it's more obvious that the writing applications on

Sure, although 90+ % of people are then going to turn around and do a mkfs/mount anyway. I get what you're saying and I know you're of the opinion that only ceph-fs should be used for this sort of thing but I don't think that's realistic. There are overlay applications like RAC etc that deal with this sort of thing quite well. There's also an important case to be made for single-node-multi-writer (although it's perhaps the most contentious access-mode in K8s right now). There's no risk to sharing an ext-4 across multiple pods on the same node (in most cases).

the multiple nodes need to coordinate among themselves. Maybe an example could be a parallel file system wiper: multiple workers on multiple nodes elect a leader/coordinator and divide up the volume's blocks to work on.

Yeah, but the thing is the CSI-Plugin IMO isn't responsible for implementing a FS, or the application, it's simply a provisioner, and its job should be to provide the ability for consumers to build things on the storage. Yes, users can do "bad things" to themselves, should we have a monitor that doesn't allow "rm -rf /*" because it's typically a mistake?

I realize your preference for FS based provisioning, and in a number of cases I completely agree; but I don't think we should force users to choose one vs the other. Instead I'm of the opinion that we should provide the tools and the options for users to consume storage in a manner that best suits their needs.

@tombarron
Copy link

"I know you're of the opinion that only ceph-fs should be used for this sort of thing but I don't think that's realistic. "
That's not my opinion at all. I think raw-block is suitable for MultiModeMultiWriter for apps that coordinate multiple writers themselves and that access the block device as /dev/vdxxx rather than by a node-specific filesystem.

And I totally agree that single-node-multi-writer is fine with a node-local filesystem like ext4 or XFS.

"Yes, users can do "bad things" to themselves, should we have a monitor that doesn't allow "rm -rf /*" because it's typically a mistake?"

My concern is that users won't do it, k8s will when in runs nodePublishVolume - it will fsck and mount unless raw block volumeMode is specified in the PVC.

nixpanic pushed a commit to nixpanic/ceph-csi that referenced this pull request Mar 4, 2024
Syncing latest changes from upstream devel for ceph-csi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multinode Access for RBD Plugin
3 participants