Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiNodeWritable option for RBD Volumes #239

Merged
merged 1 commit into from
Mar 1, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ go-test:
./scripts/test-go.sh

static-check:
./scripts/lint-go.sh
./scripts/lint-go.sh
./scripts/lint-text.sh

rbdplugin:
Expand Down
15 changes: 15 additions & 0 deletions docs/deploy-rbd.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,21 @@ Parameter | Required | Description
`csi.storage.k8s.io/provisioner-secret-name`, `csi.storage.k8s.io/node-publish-secret-name` | for Kubernetes | name of the Kubernetes Secret object containing Ceph client credentials. Both parameters should have the same value
`csi.storage.k8s.io/provisioner-secret-namespace`, `csi.storage.k8s.io/node-publish-secret-namespace` | for Kubernetes | namespaces of the above Secret objects
`mounter`| no | if set to `rbd-nbd`, use `rbd-nbd` on nodes that have `rbd-nbd` and `nbd` kernel modules to map rbd images
`fsType` | no | allows setting to `ext3 | ext-4 | xfs`, default is `ext-4`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we need to support a clustered file system for fsType if we allow multiNodeWritable on block storage with non-Block access type?

`multiNodeWritable` | no | if set to `enabled` allows RBD volumes with MultiNode Access Modes to bypass watcher checks. By default multiple attachments of an RBD volume are NOT allowed. Even if this option is set in the StorageClass, it's ignored if a standard SingleNodeWriter Access Mode is requested

**Warning for multiNodeWritable:**

*NOTE* the `multiNodeWritable` setting is NOT safe for use by workloads
that are not designed to coordinate access. This does NOT add any sort
of a clustered filesystem or write syncronization, it's specifically for
special workloads that handle access coordination on their own
(ie Active/Passive scenarios).

Using this mode for general purposes *WILL RESULT IN DATA CORRUPTION*.
We attempt to limit exposure to trouble here but ignoring the Storage Class
setting unless your Volume explicitly asks for multi node access, and assume
you know what you're doing.

**Required secrets:**

Expand Down
102 changes: 102 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,3 +114,105 @@ To restore the snapshot to a new PVC, deploy
kubectl create -f pvc-restore.yaml
kubectl create -f pod-restore.yaml
```

## How to enable multi node attach support for RBD

*WARNING* This feature is strictly for workloads that know how to deal
with concurrent acces to the Volume (eg Active/Passive applications).
Using RWX modes on non clustered file systems with applications trying
to simultaneously access the Volume will likely result in data corruption!

### Example process to test the multiNodeWritable feature

Modify your current storage class, or create a new storage class specifically
for multi node writers by adding the `multiNodeWritable: "enabled"` entry to
your parameters. Here's an example:

```
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-rbd
provisioner: csi-rbdplugin
parameters:
monitors: rook-ceph-mon-b.rook-ceph.svc.cluster.local:6789
pool: rbd
imageFormat: "2"
imageFeatures: layering
csiProvisionerSecretName: csi-rbd-secret
csiProvisionerSecretNamespace: default
csiNodePublishSecretName: csi-rbd-secret
csiNodePublishSecretNamespace: default
adminid: admin
userid: admin
fsType: xfs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xfs is a node-local, non-clustered filesystem so accessing via the filesystem rather than as raw block with multiNodeWritable will lead to data/filesystem corruption.

multiNodeWritable: "enabled"
reclaimPolicy: Delete
```

Now, you can request Claims from the configured storage class that include
the `ReadWriteMany` access mode:

```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-1
spec:
accessModes:
- ReadWriteMany

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add 'volumeMode: Block' here? That way we get away from needing to have a clustered file system on the RBD block device. If we are doing raw block access then the volume can have no file system or it can have a regular, node-local filesystem as long as we aren't using the filesystem to access it ReadWriteMany.

resources:
requests:
storage: 1Gi
storageClassName: csi-rbd
```

Create a POD that uses this PVC:

```
apiVersion: v1
kind: Pod
metadata:
name: test-1
spec:
containers:
- name: web-server
image: nginx
volumeMounts:
- name: mypvc
mountPath: /var/lib/www/html
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: pvc-1
readOnly: false
```

Wait for the POD to enter Running state, write some data to
`/var/lib/www/html`

Now, we can create a second POD (ensure the POD is scheduled on a different
node; multiwriter single node works without this feature) that also uses this
PVC at the same time

```
apiVersion: v1
kind: Pod
metadata:
name: test-2
spec:
containers:
- name: web-server
image: nginx
volumeMounts:
- name: mypvc
mountPath: /var/lib/www/html
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: pvc-1
readOnly: false
```

If you access the pod you can check that your data is avaialable at

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the node with the second pod does the node PublishVolume it will run fsck against an active filesystem mounted on another node. We'll get past this with XFS since (see fsck.xfs man page) it's a journaling filesystem that performs recovery at mount time instead so that fsck.xfs just exits with zero status. But then node with the second pod will attempt the mount and run xfs_repair while the node with the first pod is writing to disk and journal. We'll likely fail the mount and/or corrupt the filesystem at this point. If by some miracle we got past this to actually writing to /var/lib/www/html we'd have two brains with their own journals trying to write to the same disk without coordinating with one another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not completely accurate; but regardless, keep in mind there is the multi-reader case that's very useful here as well. Anyway, I get what you're saying and I'm not arguing really. I am however saying that IMO the job of the storage plugin is to provide an interface to the device to expose it's capabilities and let users do what they want with them. I'm working on a change to the multi-writer capability that I think will fit your view of the world better so hopefully you'll be happier with that one. I'll be sure to ping you for reviews as soon as the PR is posted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important part of this is whether I'm wrong that nodePublishVolume with volumeMode File and a node-local filesystem like ext4 or xfs has potential for corrupting the filesystem when it does fsck and mount from the second node. If I'm mis-reading the code for that part then yeah, I'm being inaccurate and should have head adjusted with a 2x4.

`/var/lib/www/html`
3 changes: 3 additions & 0 deletions examples/rbd/storageclass.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,7 @@ parameters:
userid: kubernetes
# uncomment the following to use rbd-nbd as mounter on supported nodes
# mounter: rbd-nbd
# fsType: xfs
# uncomment the following line to enable multi-attach on RBD volumes
# multiNodeWritable: enabled
reclaimPolicy: Delete
27 changes: 23 additions & 4 deletions pkg/rbd/controllerserver.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (
"os/exec"
"sort"
"strconv"
"strings"
"syscall"

"github.com/ceph/ceph-csi/pkg/csi-common"
Expand Down Expand Up @@ -92,7 +93,16 @@ func (cs *ControllerServer) validateVolumeReq(req *csi.CreateVolumeRequest) erro
func parseVolCreateRequest(req *csi.CreateVolumeRequest) (*rbdVolume, error) {
// TODO (sbezverk) Last check for not exceeding total storage capacity

rbdVol, err := getRBDVolumeOptions(req.GetParameters())
// MultiNodeWriters are accepted but they're only for special cases, and we skip the watcher checks for them which isn't the greatest
// let's make sure we ONLY skip that if the user is requesting a MULTI Node accessible mode
disableMultiWriter := true
for _, am := range req.VolumeCapabilities {
if am.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about explicitly checking VolumeCapability_AccessMode_MULTI_NODE_SINGLE_WRITER and VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER, because there are two single node capabilities in csi spec:

const (
        VolumeCapability_AccessMode_UNKNOWN VolumeCapability_AccessMode_Mode = 0
        // Can only be published once as read/write on a single node, at
        // any given time.
        VolumeCapability_AccessMode_SINGLE_NODE_WRITER VolumeCapability_AccessMode_Mode = 1
        // Can only be published once as readonly on a single node, at
        // any given time.
        VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY VolumeCapability_AccessMode_Mode = 2
        // Can be published as readonly at multiple nodes simultaneously.
        VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY VolumeCapability_AccessMode_Mode = 3
        // Can be published at multiple nodes simultaneously. Only one of
        // the node can be used as read/write. The rest will be readonly.
        VolumeCapability_AccessMode_MULTI_NODE_SINGLE_WRITER VolumeCapability_AccessMode_Mode = 4
        // Can be published as read/write at multiple nodes
        // simultaneously.
        VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER VolumeCapability_AccessMode_Mode = 5
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, I had that planned for a follow up: 6c473d2#diff-170d986ec944158cc00a94f7e60f989eR106

Let me take a look at handling that case now instead if you're good with it.

disableMultiWriter = false
}
}

rbdVol, err := getRBDVolumeOptions(req.GetParameters(), disableMultiWriter)
if err != nil {
return nil, err
}
Expand Down Expand Up @@ -330,11 +340,20 @@ func (cs *ControllerServer) ListVolumes(ctx context.Context, req *csi.ListVolume
// ValidateVolumeCapabilities checks whether the volume capabilities requested
// are supported.
func (cs *ControllerServer) ValidateVolumeCapabilities(ctx context.Context, req *csi.ValidateVolumeCapabilitiesRequest) (*csi.ValidateVolumeCapabilitiesResponse, error) {
for _, cap := range req.VolumeCapabilities {
if cap.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {
return &csi.ValidateVolumeCapabilitiesResponse{Message: ""}, nil
params := req.GetParameters()
multiWriter := params["multiNodeWritable"]
if strings.ToLower(multiWriter) == "enabled" {
klog.V(3).Info("detected multiNodeWritable parameter in Storage Class, allowing multi-node access modes")

} else {
for _, cap := range req.VolumeCapabilities {
if cap.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {
return &csi.ValidateVolumeCapabilitiesResponse{Message: ""}, nil
}
}

}

return &csi.ValidateVolumeCapabilitiesResponse{
Confirmed: &csi.ValidateVolumeCapabilitiesResponse_Confirmed{
VolumeCapabilities: req.VolumeCapabilities,
Expand Down
10 changes: 9 additions & 1 deletion pkg/rbd/nodeserver.go
Original file line number Diff line number Diff line change
Expand Up @@ -70,10 +70,18 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
if !notMnt {
return &csi.NodePublishVolumeResponse{}, nil
}
volOptions, err := getRBDVolumeOptions(req.GetVolumeContext())

ignoreMultiWriterEnabled := true
if req.VolumeCapability.AccessMode.Mode != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to have:
if req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER
here? As written the test passes for five of the six enum values, including e.g. VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only valid modes that can even make it here from the CO are the ones we advertise in the capabilities (SINGLE_NODE_WRITER and MULTI_NODE_MULTI_WRITER), anything else will be rejected by the CO because it's not in the capabilities. The next step here was to add the additional capabilities and continue handling cases as needed (those that weren't sub/super sets). In this case you have one or the other, so the != works.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the suggested comparison expresses your intent and the actual comparison only works accidentally, as long as we continue to only advertise those two capabilities. I can't see any reason why we wouldn't decide in the future to also advertise other capabilities -- e.g. SINGLE_NODE_MULTI_WRITER or the read-only capabilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all really rather irrelevant, you're reviewing a merged PR :) In addition, based on your objections, this PR is being reverted and replaced. Let's see how the new PR goes and just save the discussions for that.

ignoreMultiWriterEnabled = false
}

volOptions, err := getRBDVolumeOptions(req.GetVolumeContext(), ignoreMultiWriterEnabled)
if err != nil {
return nil, err
}
// Check access mode settings in the request, even if SC is RW-Many, if the request is a normal Single Writer volume, we ignore this setting and proceed as normal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be cleaned now


volOptions.VolName = volName
// Mapping RBD image
devicePath, err := attachRBDImage(volOptions, volOptions.UserID, req.GetSecrets())
Expand Down
7 changes: 6 additions & 1 deletion pkg/rbd/rbd.go
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,12 @@ func (r *Driver) Run(driverName, nodeID, endpoint string, containerized bool, ca
csi.ControllerServiceCapability_RPC_LIST_SNAPSHOTS,
csi.ControllerServiceCapability_RPC_CLONE_VOLUME,
})
r.cd.AddVolumeCapabilityAccessModes([]csi.VolumeCapability_AccessMode_Mode{csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER})

// TODO: JDG Should also look at remaining modes like MULT_NODE_READER (SINGLE_READER)
r.cd.AddVolumeCapabilityAccessModes(
[]csi.VolumeCapability_AccessMode_Mode{
csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rootfs I agree, the problem though, is I settled on using the Storage Class here instead of a driver flag, so the problem was at driver Run I don't have a handle to the SC parameters; unless I'm missing an access path?

I considered doing a combination of SC Parameter and a Flag on driver start up, but that seemed redundant.


// Create GRPC servers
r.ids = NewIdentityServer(r.cd)
Expand Down
8 changes: 8 additions & 0 deletions pkg/rbd/rbd_attach.go
Original file line number Diff line number Diff line change
Expand Up @@ -313,8 +313,16 @@ func waitForrbdImage(backoff wait.Backoff, volOptions *rbdVolume, userID string,
if err != nil {
return false, fmt.Errorf("fail to check rbd image status with: (%v), rbd output: (%s)", err, rbdOutput)
}
// In the case of multiattach we want to short circuit the retries when used (so r`if used; return used`)
// otherwise we're setting this to false which translates to !ok, which means backoff and try again
// NOTE: we ONLY do this if an multi-node access mode is requested for this volume
if (strings.ToLower(volOptions.MultiNodeWritable) == "enabled") && (used) {
klog.V(2).Info("detected MultiNodeWritable enabled, ignoring watcher in-use result")
return used, nil
}
return !used, nil
})

// return error if rbd image has not become available for the specified timeout
if err == wait.ErrWaitTimeout {
return fmt.Errorf("rbd image %s is still being used", imagePath)
Expand Down
9 changes: 8 additions & 1 deletion pkg/rbd/rbd_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ type rbdVolume struct {
AdminID string `json:"adminId"`
UserID string `json:"userId"`
Mounter string `json:"mounter"`
MultiNodeWritable string `json:"multiNodeWritable"`
}

type rbdSnapshot struct {
Expand Down Expand Up @@ -226,7 +227,7 @@ func execCommand(command string, args []string) ([]byte, error) {
return cmd.CombinedOutput()
}

func getRBDVolumeOptions(volOptions map[string]string) (*rbdVolume, error) {
func getRBDVolumeOptions(volOptions map[string]string, ignoreMultiNodeWritable bool) (*rbdVolume, error) {
var ok bool
rbdVol := &rbdVolume{}
rbdVol.Pool, ok = volOptions["pool"]
Expand Down Expand Up @@ -260,6 +261,12 @@ func getRBDVolumeOptions(volOptions map[string]string) (*rbdVolume, error) {

}
getCredsFromVol(rbdVol, volOptions)

klog.V(3).Infof("ignoreMultiNodeWritable flag in parse getRBDVolumeOptions is: %v", ignoreMultiNodeWritable)
// If the volume we're working with is NOT requesting multi-node attach then don't treat it special, ignore the setting in the SC and just keep our watcher checks
if !ignoreMultiNodeWritable {
rbdVol.MultiNodeWritable = volOptions["multiNodeWritable"]
}
return rbdVol, nil
}

Expand Down