-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multiNodeWritable option for RBD Volumes #239
Conversation
9c51bbb
to
6c473d2
Compare
pkg/rbd/controllerserver.go
Outdated
rbdVol, err := getRBDVolumeOptions(req.GetParameters()) | ||
// MultiNodeWriters are accepted but they're only for special cases, and we skip the watcher checks for them which isn't the greatest | ||
// let's make sure we ONLY skip that if the user is requesting a MULTI Node accessbile mode | ||
ignoreMultiWriterEnabled := true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
disableMultiWriter := true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, that works
// let's make sure we ONLY skip that if the user is requesting a MULTI Node accessbile mode | ||
ignoreMultiWriterEnabled := true | ||
for _, am := range req.VolumeCapabilities { | ||
if am.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about explicitly checking VolumeCapability_AccessMode_MULTI_NODE_SINGLE_WRITER
and VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER
, because there are two single node capabilities in csi spec:
const (
VolumeCapability_AccessMode_UNKNOWN VolumeCapability_AccessMode_Mode = 0
// Can only be published once as read/write on a single node, at
// any given time.
VolumeCapability_AccessMode_SINGLE_NODE_WRITER VolumeCapability_AccessMode_Mode = 1
// Can only be published once as readonly on a single node, at
// any given time.
VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY VolumeCapability_AccessMode_Mode = 2
// Can be published as readonly at multiple nodes simultaneously.
VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY VolumeCapability_AccessMode_Mode = 3
// Can be published at multiple nodes simultaneously. Only one of
// the node can be used as read/write. The rest will be readonly.
VolumeCapability_AccessMode_MULTI_NODE_SINGLE_WRITER VolumeCapability_AccessMode_Mode = 4
// Can be published as read/write at multiple nodes
// simultaneously.
VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER VolumeCapability_AccessMode_Mode = 5
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, I had that planned for a follow up: 6c473d2#diff-170d986ec944158cc00a94f7e60f989eR106
Let me take a look at handling that case now instead if you're good with it.
pkg/rbd/controllerserver.go
Outdated
return &csi.ValidateVolumeCapabilitiesResponse{Message: ""}, nil | ||
params := req.GetParameters() | ||
multiWriter, _ := params["multiNodeWritable"] | ||
if multiWriter == "enabled" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: strings.ToLower(multiWriter) == "enabled"
r.cd.AddVolumeCapabilityAccessModes( | ||
[]csi.VolumeCapability_AccessMode_Mode{ | ||
csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER, | ||
csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rootfs I agree, the problem though, is I settled on using the Storage Class here instead of a driver flag, so the problem was at driver Run I don't have a handle to the SC parameters; unless I'm missing an access path?
I considered doing a combination of SC Parameter and a Flag on driver start up, but that seemed redundant.
pkg/rbd/rbd_attach.go
Outdated
// In the case of multiattach we want to short circuit the retries when used (so r`if used; return used`) | ||
// otherwise we're setting this to false which translates to !ok, which means backoff and try again | ||
// NOTE: we ONLY do this if an multi-node access mode is requested for this volume | ||
if (volOptions.MultiNodeWritable == "enabled") && (used) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strings.ToLower(...)
thanks @j-griffith some comments and some lint errors. Would you please also provide a storageclass example and expand readme to include this multi node writer use case with screenshot? |
r.cd.AddVolumeCapabilityAccessModes( | ||
[]csi.VolumeCapability_AccessMode_Mode{ | ||
csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER, | ||
csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rootfs I agree, the problem though, is I settled on using the Storage Class here instead of a driver flag, so the problem was at driver Run I don't have a handle to the SC parameters; unless I'm missing an access path?
I considered doing a combination of SC Parameter and a Flag on driver start up, but that seemed redundant.
Ahh, yeah; I see the examples dir, sorry I intended to do that; will get it added. |
fe7eea0
to
69d19cd
Compare
cleaning up the import problem from the rebase, and configuring the md linter locally so I can get this squared away; sorry about that. |
e57e315
to
0d87e47
Compare
This change adds the ability to define a `multiNodeWritable` option in the Storage Class. This change does a number of things: 1. Allow multi-node-multi-writer access modes if the SC options is enabled 2. Bypass the watcher checks for MultiNodeMultiWriter Volumes 3. Maintains existing watcher checks for SingleNodeWriter access modes regardless of the StorageClass option. fix lint-errors
0d87e47
to
ba4ba5e
Compare
if err != nil { | ||
return nil, err | ||
} | ||
// Check access mode settings in the request, even if SC is RW-Many, if the request is a normal Single Writer volume, we ignore this setting and proceed as normal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be cleaned now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't review this one earlier but I think that it would make more sense to illustrate use of multiNodeWritable option with raw block volumeMode rather than file. I don't see support there yet for real clustered filesystems (lustre, GFS, etc) which is what would be needed to make the nginx example work. If on the other hand the volume is accessed as raw block and shows up on mutiple nodes as /dev/vdxx rather than via some regular filesystem path it's more obvious that the writing applications on the multiple nodes need to coordinate among themselves. Maybe an example could be a parallel file system wiper: multiple workers on multiple nodes elect a leader/coordinator and divide up the volume's blocks to work on.
volOptions, err := getRBDVolumeOptions(req.GetVolumeContext()) | ||
|
||
ignoreMultiWriterEnabled := true | ||
if req.VolumeCapability.AccessMode.Mode != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make more sense to have:
if req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER
here? As written the test passes for five of the six enum values, including e.g. VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only valid modes that can even make it here from the CO are the ones we advertise in the capabilities (SINGLE_NODE_WRITER and MULTI_NODE_MULTI_WRITER), anything else will be rejected by the CO because it's not in the capabilities. The next step here was to add the additional capabilities and continue handling cases as needed (those that weren't sub/super sets). In this case you have one or the other, so the !=
works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the suggested comparison expresses your intent and the actual comparison only works accidentally, as long as we continue to only advertise those two capabilities. I can't see any reason why we wouldn't decide in the future to also advertise other capabilities -- e.g. SINGLE_NODE_MULTI_WRITER or the read-only capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all really rather irrelevant, you're reviewing a merged PR :) In addition, based on your objections, this PR is being reverted and replaced. Let's see how the new PR goes and just save the discussions for that.
@@ -58,6 +58,21 @@ Parameter | Required | Description | |||
`csi.storage.k8s.io/provisioner-secret-name`, `csi.storage.k8s.io/node-publish-secret-name` | for Kubernetes | name of the Kubernetes Secret object containing Ceph client credentials. Both parameters should have the same value | |||
`csi.storage.k8s.io/provisioner-secret-namespace`, `csi.storage.k8s.io/node-publish-secret-namespace` | for Kubernetes | namespaces of the above Secret objects | |||
`mounter`| no | if set to `rbd-nbd`, use `rbd-nbd` on nodes that have `rbd-nbd` and `nbd` kernel modules to map rbd images | |||
`fsType` | no | allows setting to `ext3 | ext-4 | xfs`, default is `ext-4` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we need to support a clustered file system for fsType
if we allow multiNodeWritable on block storage with non-Block access type?
csiNodePublishSecretNamespace: default | ||
adminid: admin | ||
userid: admin | ||
fsType: xfs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xfs is a node-local, non-clustered filesystem so accessing via the filesystem rather than as raw block with multiNodeWritable will lead to data/filesystem corruption.
name: pvc-1 | ||
spec: | ||
accessModes: | ||
- ReadWriteMany |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add 'volumeMode: Block' here? That way we get away from needing to have a clustered file system on the RBD block device. If we are doing raw block access then the volume can have no file system or it can have a regular, node-local filesystem as long as we aren't using the filesystem to access it ReadWriteMany.
readOnly: false | ||
``` | ||
|
||
If you access the pod you can check that your data is avaialable at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the node with the second pod does the node PublishVolume it will run fsck against an active filesystem mounted on another node. We'll get past this with XFS since (see fsck.xfs
man page) it's a journaling filesystem that performs recovery at mount
time instead so that fsck.xfs
just exits with zero status. But then node with the second pod will attempt the mount and run xfs_repair
while the node with the first pod is writing to disk and journal. We'll likely fail the mount and/or corrupt the filesystem at this point. If by some miracle we got past this to actually writing to /var/lib/www/html we'd have two brains with their own journals trying to write to the same disk without coordinating with one another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not completely accurate; but regardless, keep in mind there is the multi-reader case that's very useful here as well. Anyway, I get what you're saying and I'm not arguing really. I am however saying that IMO the job of the storage plugin is to provide an interface to the device to expose it's capabilities and let users do what they want with them. I'm working on a change to the multi-writer capability that I think will fit your view of the world better so hopefully you'll be happier with that one. I'll be sure to ping you for reviews as soon as the PR is posted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The important part of this is whether I'm wrong that nodePublishVolume with volumeMode File and a node-local filesystem like ext4 or xfs has potential for corrupting the filesystem when it does fsck and mount from the second node. If I'm mis-reading the code for that part then yeah, I'm being inaccurate and should have head adjusted with a 2x4.
Sure, although 90+ % of people are then going to turn around and do a mkfs/mount anyway. I get what you're saying and I know you're of the opinion that only ceph-fs should be used for this sort of thing but I don't think that's realistic. There are overlay applications like RAC etc that deal with this sort of thing quite well. There's also an important case to be made for single-node-multi-writer (although it's perhaps the most contentious access-mode in K8s right now). There's no risk to sharing an ext-4 across multiple pods on the same node (in most cases).
Yeah, but the thing is the CSI-Plugin IMO isn't responsible for implementing a FS, or the application, it's simply a provisioner, and its job should be to provide the ability for consumers to build things on the storage. Yes, users can do "bad things" to themselves, should we have a monitor that doesn't allow "rm -rf /*" because it's typically a mistake? I realize your preference for FS based provisioning, and in a number of cases I completely agree; but I don't think we should force users to choose one vs the other. Instead I'm of the opinion that we should provide the tools and the options for users to consume storage in a manner that best suits their needs. |
"I know you're of the opinion that only ceph-fs should be used for this sort of thing but I don't think that's realistic. " And I totally agree that single-node-multi-writer is fine with a node-local filesystem like ext4 or XFS. "Yes, users can do "bad things" to themselves, should we have a monitor that doesn't allow "rm -rf /*" because it's typically a mistake?" My concern is that users won't do it, k8s will when in runs nodePublishVolume - it will fsck and mount unless raw block volumeMode is specified in the PVC. |
Syncing latest changes from upstream devel for ceph-csi
This change adds the ability to define a
multiNodeWritable
option inthe Storage Class.
This change does a number of things:
enabled
regardless of the StorageClass option.
fix #237