-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multiNodeWritable option for RBD Volumes #239
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -114,3 +114,105 @@ To restore the snapshot to a new PVC, deploy | |
kubectl create -f pvc-restore.yaml | ||
kubectl create -f pod-restore.yaml | ||
``` | ||
|
||
## How to enable multi node attach support for RBD | ||
|
||
*WARNING* This feature is strictly for workloads that know how to deal | ||
with concurrent acces to the Volume (eg Active/Passive applications). | ||
Using RWX modes on non clustered file systems with applications trying | ||
to simultaneously access the Volume will likely result in data corruption! | ||
|
||
### Example process to test the multiNodeWritable feature | ||
|
||
Modify your current storage class, or create a new storage class specifically | ||
for multi node writers by adding the `multiNodeWritable: "enabled"` entry to | ||
your parameters. Here's an example: | ||
|
||
``` | ||
apiVersion: storage.k8s.io/v1 | ||
kind: StorageClass | ||
metadata: | ||
name: csi-rbd | ||
provisioner: csi-rbdplugin | ||
parameters: | ||
monitors: rook-ceph-mon-b.rook-ceph.svc.cluster.local:6789 | ||
pool: rbd | ||
imageFormat: "2" | ||
imageFeatures: layering | ||
csiProvisionerSecretName: csi-rbd-secret | ||
csiProvisionerSecretNamespace: default | ||
csiNodePublishSecretName: csi-rbd-secret | ||
csiNodePublishSecretNamespace: default | ||
adminid: admin | ||
userid: admin | ||
fsType: xfs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. xfs is a node-local, non-clustered filesystem so accessing via the filesystem rather than as raw block with multiNodeWritable will lead to data/filesystem corruption. |
||
multiNodeWritable: "enabled" | ||
reclaimPolicy: Delete | ||
``` | ||
|
||
Now, you can request Claims from the configured storage class that include | ||
the `ReadWriteMany` access mode: | ||
|
||
``` | ||
apiVersion: v1 | ||
kind: PersistentVolumeClaim | ||
metadata: | ||
name: pvc-1 | ||
spec: | ||
accessModes: | ||
- ReadWriteMany | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add 'volumeMode: Block' here? That way we get away from needing to have a clustered file system on the RBD block device. If we are doing raw block access then the volume can have no file system or it can have a regular, node-local filesystem as long as we aren't using the filesystem to access it ReadWriteMany. |
||
resources: | ||
requests: | ||
storage: 1Gi | ||
storageClassName: csi-rbd | ||
``` | ||
|
||
Create a POD that uses this PVC: | ||
|
||
``` | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: test-1 | ||
spec: | ||
containers: | ||
- name: web-server | ||
image: nginx | ||
volumeMounts: | ||
- name: mypvc | ||
mountPath: /var/lib/www/html | ||
volumes: | ||
- name: mypvc | ||
persistentVolumeClaim: | ||
claimName: pvc-1 | ||
readOnly: false | ||
``` | ||
|
||
Wait for the POD to enter Running state, write some data to | ||
`/var/lib/www/html` | ||
|
||
Now, we can create a second POD (ensure the POD is scheduled on a different | ||
node; multiwriter single node works without this feature) that also uses this | ||
PVC at the same time | ||
|
||
``` | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: test-2 | ||
spec: | ||
containers: | ||
- name: web-server | ||
image: nginx | ||
volumeMounts: | ||
- name: mypvc | ||
mountPath: /var/lib/www/html | ||
volumes: | ||
- name: mypvc | ||
persistentVolumeClaim: | ||
claimName: pvc-1 | ||
readOnly: false | ||
``` | ||
|
||
If you access the pod you can check that your data is avaialable at | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When the node with the second pod does the node PublishVolume it will run fsck against an active filesystem mounted on another node. We'll get past this with XFS since (see There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's not completely accurate; but regardless, keep in mind there is the multi-reader case that's very useful here as well. Anyway, I get what you're saying and I'm not arguing really. I am however saying that IMO the job of the storage plugin is to provide an interface to the device to expose it's capabilities and let users do what they want with them. I'm working on a change to the multi-writer capability that I think will fit your view of the world better so hopefully you'll be happier with that one. I'll be sure to ping you for reviews as soon as the PR is posted. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The important part of this is whether I'm wrong that nodePublishVolume with volumeMode File and a node-local filesystem like ext4 or xfs has potential for corrupting the filesystem when it does fsck and mount from the second node. If I'm mis-reading the code for that part then yeah, I'm being inaccurate and should have head adjusted with a 2x4. |
||
`/var/lib/www/html` |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ import ( | |
"os/exec" | ||
"sort" | ||
"strconv" | ||
"strings" | ||
"syscall" | ||
|
||
"github.com/ceph/ceph-csi/pkg/csi-common" | ||
|
@@ -92,7 +93,16 @@ func (cs *ControllerServer) validateVolumeReq(req *csi.CreateVolumeRequest) erro | |
func parseVolCreateRequest(req *csi.CreateVolumeRequest) (*rbdVolume, error) { | ||
// TODO (sbezverk) Last check for not exceeding total storage capacity | ||
|
||
rbdVol, err := getRBDVolumeOptions(req.GetParameters()) | ||
// MultiNodeWriters are accepted but they're only for special cases, and we skip the watcher checks for them which isn't the greatest | ||
// let's make sure we ONLY skip that if the user is requesting a MULTI Node accessible mode | ||
disableMultiWriter := true | ||
for _, am := range req.VolumeCapabilities { | ||
if am.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what about explicitly checking const (
VolumeCapability_AccessMode_UNKNOWN VolumeCapability_AccessMode_Mode = 0
// Can only be published once as read/write on a single node, at
// any given time.
VolumeCapability_AccessMode_SINGLE_NODE_WRITER VolumeCapability_AccessMode_Mode = 1
// Can only be published once as readonly on a single node, at
// any given time.
VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY VolumeCapability_AccessMode_Mode = 2
// Can be published as readonly at multiple nodes simultaneously.
VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY VolumeCapability_AccessMode_Mode = 3
// Can be published at multiple nodes simultaneously. Only one of
// the node can be used as read/write. The rest will be readonly.
VolumeCapability_AccessMode_MULTI_NODE_SINGLE_WRITER VolumeCapability_AccessMode_Mode = 4
// Can be published as read/write at multiple nodes
// simultaneously.
VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER VolumeCapability_AccessMode_Mode = 5
) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ack, I had that planned for a follow up: 6c473d2#diff-170d986ec944158cc00a94f7e60f989eR106 Let me take a look at handling that case now instead if you're good with it. |
||
disableMultiWriter = false | ||
} | ||
} | ||
|
||
rbdVol, err := getRBDVolumeOptions(req.GetParameters(), disableMultiWriter) | ||
if err != nil { | ||
return nil, err | ||
} | ||
|
@@ -330,11 +340,20 @@ func (cs *ControllerServer) ListVolumes(ctx context.Context, req *csi.ListVolume | |
// ValidateVolumeCapabilities checks whether the volume capabilities requested | ||
// are supported. | ||
func (cs *ControllerServer) ValidateVolumeCapabilities(ctx context.Context, req *csi.ValidateVolumeCapabilitiesRequest) (*csi.ValidateVolumeCapabilitiesResponse, error) { | ||
for _, cap := range req.VolumeCapabilities { | ||
if cap.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER { | ||
return &csi.ValidateVolumeCapabilitiesResponse{Message: ""}, nil | ||
params := req.GetParameters() | ||
multiWriter := params["multiNodeWritable"] | ||
if strings.ToLower(multiWriter) == "enabled" { | ||
klog.V(3).Info("detected multiNodeWritable parameter in Storage Class, allowing multi-node access modes") | ||
|
||
} else { | ||
for _, cap := range req.VolumeCapabilities { | ||
if cap.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER { | ||
return &csi.ValidateVolumeCapabilitiesResponse{Message: ""}, nil | ||
} | ||
} | ||
|
||
} | ||
|
||
return &csi.ValidateVolumeCapabilitiesResponse{ | ||
Confirmed: &csi.ValidateVolumeCapabilitiesResponse_Confirmed{ | ||
VolumeCapabilities: req.VolumeCapabilities, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -70,10 +70,18 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis | |
if !notMnt { | ||
return &csi.NodePublishVolumeResponse{}, nil | ||
} | ||
volOptions, err := getRBDVolumeOptions(req.GetVolumeContext()) | ||
|
||
ignoreMultiWriterEnabled := true | ||
if req.VolumeCapability.AccessMode.Mode != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't it make more sense to have: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The only valid modes that can even make it here from the CO are the ones we advertise in the capabilities (SINGLE_NODE_WRITER and MULTI_NODE_MULTI_WRITER), anything else will be rejected by the CO because it's not in the capabilities. The next step here was to add the additional capabilities and continue handling cases as needed (those that weren't sub/super sets). In this case you have one or the other, so the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well the suggested comparison expresses your intent and the actual comparison only works accidentally, as long as we continue to only advertise those two capabilities. I can't see any reason why we wouldn't decide in the future to also advertise other capabilities -- e.g. SINGLE_NODE_MULTI_WRITER or the read-only capabilities. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is all really rather irrelevant, you're reviewing a merged PR :) In addition, based on your objections, this PR is being reverted and replaced. Let's see how the new PR goes and just save the discussions for that. |
||
ignoreMultiWriterEnabled = false | ||
} | ||
|
||
volOptions, err := getRBDVolumeOptions(req.GetVolumeContext(), ignoreMultiWriterEnabled) | ||
if err != nil { | ||
return nil, err | ||
} | ||
// Check access mode settings in the request, even if SC is RW-Many, if the request is a normal Single Writer volume, we ignore this setting and proceed as normal | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this can be cleaned now |
||
|
||
volOptions.VolName = volName | ||
// Mapping RBD image | ||
devicePath, err := attachRBDImage(volOptions, volOptions.UserID, req.GetSecrets()) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -102,7 +102,12 @@ func (r *Driver) Run(driverName, nodeID, endpoint string, containerized bool, ca | |
csi.ControllerServiceCapability_RPC_LIST_SNAPSHOTS, | ||
csi.ControllerServiceCapability_RPC_CLONE_VOLUME, | ||
}) | ||
r.cd.AddVolumeCapabilityAccessModes([]csi.VolumeCapability_AccessMode_Mode{csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER}) | ||
|
||
// TODO: JDG Should also look at remaining modes like MULT_NODE_READER (SINGLE_READER) | ||
r.cd.AddVolumeCapabilityAccessModes( | ||
[]csi.VolumeCapability_AccessMode_Mode{ | ||
csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER, | ||
csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rootfs I agree, the problem though, is I settled on using the Storage Class here instead of a driver flag, so the problem was at driver Run I don't have a handle to the SC parameters; unless I'm missing an access path? I considered doing a combination of SC Parameter and a Flag on driver start up, but that seemed redundant. |
||
|
||
// Create GRPC servers | ||
r.ids = NewIdentityServer(r.cd) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we need to support a clustered file system for
fsType
if we allow multiNodeWritable on block storage with non-Block access type?