Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default multiwrite blockmode #261

Merged
merged 3 commits into from
Mar 20, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 0 additions & 15 deletions docs/deploy-rbd.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,21 +58,6 @@ Parameter | Required | Description
`csi.storage.k8s.io/provisioner-secret-name`, `csi.storage.k8s.io/node-publish-secret-name` | for Kubernetes | name of the Kubernetes Secret object containing Ceph client credentials. Both parameters should have the same value
`csi.storage.k8s.io/provisioner-secret-namespace`, `csi.storage.k8s.io/node-publish-secret-namespace` | for Kubernetes | namespaces of the above Secret objects
`mounter`| no | if set to `rbd-nbd`, use `rbd-nbd` on nodes that have `rbd-nbd` and `nbd` kernel modules to map rbd images
`fsType` | no | allows setting to `ext3 | ext-4 | xfs`, default is `ext-4`
`multiNodeWritable` | no | if set to `enabled` allows RBD volumes with MultiNode Access Modes to bypass watcher checks. By default multiple attachments of an RBD volume are NOT allowed. Even if this option is set in the StorageClass, it's ignored if a standard SingleNodeWriter Access Mode is requested

**Warning for multiNodeWritable:**

*NOTE* the `multiNodeWritable` setting is NOT safe for use by workloads
that are not designed to coordinate access. This does NOT add any sort
of a clustered filesystem or write syncronization, it's specifically for
special workloads that handle access coordination on their own
(ie Active/Passive scenarios).

Using this mode for general purposes *WILL RESULT IN DATA CORRUPTION*.
We attempt to limit exposure to trouble here but ignoring the Storage Class
setting unless your Volume explicitly asks for multi node access, and assume
you know what you're doing.

**Required secrets:**

Expand Down
167 changes: 51 additions & 116 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,52 +115,30 @@ kubectl create -f pvc-restore.yaml
kubectl create -f pod-restore.yaml
```

## How to enable multi node attach support for RBD
## How to test RBD MULTI_NODE_MULTI_WRITER BLOCK feature

Requires feature-gates: `BlockVolume=true` `CSIBlockVolume=true`

*NOTE* The MULTI_NODE_MULTI_WRITER capability is only available for
Volumes that are of access_type `block`

*WARNING* This feature is strictly for workloads that know how to deal
with concurrent acces to the Volume (eg Active/Passive applications).
with concurrent access to the Volume (eg Active/Passive applications).
Using RWX modes on non clustered file systems with applications trying
to simultaneously access the Volume will likely result in data corruption!

### Example process to test the multiNodeWritable feature

Modify your current storage class, or create a new storage class specifically
for multi node writers by adding the `multiNodeWritable: "enabled"` entry to
your parameters. Here's an example:

```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-rbd
provisioner: rbd.csi.ceph.com
parameters:
monitors: rook-ceph-mon-b.rook-ceph.svc.cluster.local:6789
pool: rbd
imageFormat: "2"
imageFeatures: layering
csiProvisionerSecretName: csi-rbd-secret
csiProvisionerSecretNamespace: default
csiNodePublishSecretName: csi-rbd-secret
csiNodePublishSecretNamespace: default
adminid: admin
userid: admin
fsType: xfs
multiNodeWritable: "enabled"
reclaimPolicy: Delete
```

Now, you can request Claims from the configured storage class that include
the `ReadWriteMany` access mode:
Following are examples for issuing a request for a `Block`
`ReadWriteMany` Claim, and using the resultant Claim for a POD

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-1
name: block-pvc
spec:
accessModes:
- ReadWriteMany
volumeMode: Block
resources:
requests:
storage: 1Gi
Expand All @@ -173,108 +151,65 @@ Create a POD that uses this PVC:
apiVersion: v1
kind: Pod
metadata:
name: test-1
name: my-pod
spec:
containers:
- name: web-server
image: nginx
volumeMounts:
- name: mypvc
mountPath: /var/lib/www/html
- name: my-container
image: debian
command: ["/bin/bash", "-c"]
args: [ "tail -f /dev/null" ]
volumeDevices:
- devicePath: /dev/rbdblock
name: my-volume
imagePullPolicy: IfNotPresent
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: pvc-1
readOnly: false
```
- name: my-volume
persistentVolumeClaim:
claimName: block-pvc

Wait for the POD to enter Running state, write some data to
`/var/lib/www/html`
```

Now, we can create a second POD (ensure the POD is scheduled on a different
node; multiwriter single node works without this feature) that also uses this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is there a way to use some kind of anti-affinity to ensure that the second POD is scheduled on a different node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you could use anti-affinity or node-selector. See Kube docs here: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

PVC at the same time

```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-2
spec:
containers:
- name: web-server
image: nginx
volumeMounts:
- name: mypvc
mountPath: /var/lib/www/html
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: pvc-1
readOnly: false
```

If you access the pod you can check that your data is avaialable at
`/var/lib/www/html`

## Testing Raw Block feature in kubernetes with RBD volumes

CSI block volume support is feature-gated and turned off by default. To run CSI
with block volume support enabled, a cluster administrator must enable the
feature for each Kubernetes component using the following feature gate flags:

--feature-gates=BlockVolume=true,CSIBlockVolume=true

these feature-gates must be enabled on both api-server and kubelet

### create a raw-block PVC

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: raw-block-pvc
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block
resources:
requests:
storage: 1Gi
storageClassName: csi-rbd
```

create raw block pvc

```console
kubectl create -f raw-block-pvc.yaml
```

### create a pod to mount raw-block PVC
PVC at the same time, again wait for the pod to enter running state, and verify
the block device is available.

```yaml
---
apiVersion: v1
kind: Pod
metadata:
name: pod-with-raw-block-volume
name: another-pod
spec:
containers:
- name: fc-container
image: fedora:26
command: ["/bin/sh", "-c"]
- name: my-container
image: debian
command: ["/bin/bash", "-c"]
args: [ "tail -f /dev/null" ]
volumeDevices:
- name: data
devicePath: /dev/xvda
- devicePath: /dev/rbdblock
name: my-volume
imagePullPolicy: IfNotPresent
volumes:
- name: data
- name: my-volume
persistentVolumeClaim:
claimName: raw-block-pvc
claimName: block-pvc
```

Create a POD that uses raw block PVC
Wait for the PODs to enter Running state, check that our block device
is available in the container at `/dev/rdbblock` in both containers:

```console
kubectl create -f raw-block-pod.yaml
```bash
$ kubectl exec -it my-pod -- fdisk -l /dev/rbdblock
Disk /dev/rbdblock: 1 GiB, 1073741824 bytes, 2097152 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4194304 bytes / 4194304 bytes
```

```bash
$ kubectl exec -it another-pod -- fdisk -l /dev/rbdblock
Disk /dev/rbdblock: 1 GiB, 1073741824 bytes, 2097152 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4194304 bytes / 4194304 bytes
```
3 changes: 0 additions & 3 deletions examples/rbd/storageclass.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,4 @@ parameters:
userid: kubernetes
# uncomment the following to use rbd-nbd as mounter on supported nodes
# mounter: rbd-nbd
# fsType: xfs
# uncomment the following line to enable multi-attach on RBD volumes
# multiNodeWritable: enabled
reclaimPolicy: Delete
39 changes: 19 additions & 20 deletions pkg/rbd/controllerserver.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ import (
"os/exec"
"sort"
"strconv"
"strings"
"syscall"

csicommon "github.com/ceph/ceph-csi/pkg/csi-common"
Expand Down Expand Up @@ -94,16 +93,25 @@ func (cs *ControllerServer) validateVolumeReq(req *csi.CreateVolumeRequest) erro
func parseVolCreateRequest(req *csi.CreateVolumeRequest) (*rbdVolume, error) {
// TODO (sbezverk) Last check for not exceeding total storage capacity

// MultiNodeWriters are accepted but they're only for special cases, and we skip the watcher checks for them which isn't the greatest
// let's make sure we ONLY skip that if the user is requesting a MULTI Node accessible mode
disableMultiWriter := true
for _, am := range req.VolumeCapabilities {
if am.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {
disableMultiWriter = false
isMultiNode := false
isBlock := false
for _, cap := range req.VolumeCapabilities {
// RO modes need to be handled indepedently (ie right now even if access mode is RO, they'll be RW upon attach)
if cap.GetAccessMode().GetMode() == csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER {
isMultiNode = true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what about setting isMultiNode to true if the access mode is neither SINGLE_MODE_WRITER or SINGLE_MODE_READER_ONLY ?

Copy link
Contributor Author

@j-griffith j-griffith Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's "NODE", not "MODE.. If the mode is "neither" SINGLE_NODE_xxxx" then by elimination it's "MULTI_NODE_xxx".

I'm not exactly sure what you're looking for here, but I think the safest answer is to just make this explicitly for MULTI_NODE_MULTI_WRITER option only, and then I can handle other cases involving RO and implementing the RO option separately. Trying to infer the fix or work around the issues with the single-node-multi cases doesn't belong in this PR I don't think, and I'd prefer to see how those things end up sorting out upstream. Does that work?

Note that the valid options are:

message AccessMode {
enum Mode {
UNKNOWN = 0;

  // Can only be published once as read/write on a single node, at
  // any given time.
  SINGLE_NODE_WRITER = 1;

  // Can only be published once as readonly on a single node, at
  // any given time.
  SINGLE_NODE_READER_ONLY = 2;

  // Can be published as readonly at multiple nodes simultaneously.
  MULTI_NODE_READER_ONLY = 3;

  // Can be published at multiple nodes simultaneously. Only one of
  // the node can be used as read/write. The rest will be readonly.
  MULTI_NODE_SINGLE_WRITER = 4;

  // Can be published as read/write at multiple nodes
  // simultaneously.
  MULTI_NODE_MULTI_WRITER = 5;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's "NODE", not "MODE.. If the mode is "neither" SINGLE_NODE_xxxx" then by elimination it's "MULTI_NODE_xxx".

I think the safest answer is to just make this explicitly for MULTI_NODE_MULTI_WRITER option only, and then I can handle other cases involving RO and implementing the RO option separately. Trying to infer the fix or work around the issues with the single-node-multi cases doesn't belong in this PR I don't think, and I'd prefer to see how those things end up sorting out upstream. Does that work?

Yes, and thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yessir

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry to prolong this, but don't you want to change the sense of the comparison as well?

if cap.GetAccessMode().GetMode() == csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER {
      isMultiNode = true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gahhh, yes!

}
if cap.GetBlock() != nil {
isBlock = true
}
}

// We want to fail early if the user is trying to create a RWX on a non-block type device
if isMultiNode && !isBlock {
return nil, status.Error(codes.InvalidArgument, "multi node access modes are only supported on rbd `block` type volumes")
}

rbdVol, err := getRBDVolumeOptions(req.GetParameters(), disableMultiWriter)
// if it's NOT SINGLE_NODE_WRITER and it's BLOCK we'll set the parameter to ignore the in-use checks
rbdVol, err := getRBDVolumeOptions(req.GetParameters(), (isMultiNode && isBlock))
if err != nil {
return nil, status.Error(codes.InvalidArgument, err.Error())
}
Expand Down Expand Up @@ -344,20 +352,11 @@ func (cs *ControllerServer) ListVolumes(ctx context.Context, req *csi.ListVolume
// ValidateVolumeCapabilities checks whether the volume capabilities requested
// are supported.
func (cs *ControllerServer) ValidateVolumeCapabilities(ctx context.Context, req *csi.ValidateVolumeCapabilitiesRequest) (*csi.ValidateVolumeCapabilitiesResponse, error) {
params := req.GetParameters()
multiWriter := params["multiNodeWritable"]
if strings.ToLower(multiWriter) == "enabled" {
klog.V(3).Info("detected multiNodeWritable parameter in Storage Class, allowing multi-node access modes")

} else {
for _, cap := range req.VolumeCapabilities {
if cap.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {
return &csi.ValidateVolumeCapabilitiesResponse{Message: ""}, nil
}
for _, cap := range req.VolumeCapabilities {
if cap.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {
return &csi.ValidateVolumeCapabilitiesResponse{Message: ""}, nil
}

}

return &csi.ValidateVolumeCapabilitiesResponse{
Confirmed: &csi.ValidateVolumeCapabilitiesResponse_Confirmed{
VolumeCapabilities: req.VolumeCapabilities,
Expand Down
17 changes: 10 additions & 7 deletions pkg/rbd/nodeserver.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ type NodeServer struct {
func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest) (*csi.NodePublishVolumeResponse, error) {
targetPath := req.GetTargetPath()
targetPathMutex.LockKey(targetPath)
disableInUseChecks := false

defer func() {
if err := targetPathMutex.UnlockKey(targetPath); err != nil {
Expand All @@ -71,18 +72,20 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
return &csi.NodePublishVolumeResponse{}, nil
}

// if our access mode is a simple SINGLE_NODE_WRITER we're going to ignore the SC directive and use the
// watcher still
ignoreMultiWriterEnabled := true
if req.VolumeCapability.AccessMode.Mode != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {
ignoreMultiWriterEnabled = false
// MULTI_NODE_MULTI_WRITER is supported by default for Block access type volumes
if req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER {
if isBlock {
disableInUseChecks = true
} else {
klog.Warningf("MULTI_NODE_MULTI_WRITER currently only supported with volumes of access type `block`, invalid AccessMode for volume: %v", req.GetVolumeId())
return nil, status.Error(codes.InvalidArgument, "rbd: RWX access mode request is only valid for volumes with access type `block`")
}
}

volOptions, err := getRBDVolumeOptions(req.GetVolumeContext(), ignoreMultiWriterEnabled)
volOptions, err := getRBDVolumeOptions(req.GetVolumeContext(), disableInUseChecks)
if err != nil {
return nil, err
}

volOptions.VolName = volName
// Mapping RBD image
devicePath, err := attachRBDImage(volOptions, volOptions.UserID, req.GetSecrets())
Expand Down
8 changes: 5 additions & 3 deletions pkg/rbd/rbd.go
Original file line number Diff line number Diff line change
Expand Up @@ -105,10 +105,12 @@ func (r *Driver) Run(driverName, nodeID, endpoint string, containerized bool, ca
csi.ControllerServiceCapability_RPC_CLONE_VOLUME,
})

// TODO: JDG Should also look at remaining modes like MULT_NODE_READER (SINGLE_READER)
// We only support the multi-writer option when using block, but it's a supported capability for the plugin in general
// In addition, we want to add the remaining modes like MULTI_NODE_READER_ONLY,
// MULTI_NODE_SINGLE_WRITER etc, but need to do some verification of RO modes first
// will work those as follow up features
r.cd.AddVolumeCapabilityAccessModes(
[]csi.VolumeCapability_AccessMode_Mode{
csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
[]csi.VolumeCapability_AccessMode_Mode{csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER})

// Create GRPC servers
Expand Down
9 changes: 3 additions & 6 deletions pkg/rbd/rbd_attach.go
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,7 @@ func attachRBDImage(volOptions *rbdVolume, userID string, credentials map[string
Factor: rbdImageWatcherFactor,
Steps: rbdImageWatcherSteps,
}

err = waitForrbdImage(backoff, volOptions, userID, credentials)

if err != nil {
Expand Down Expand Up @@ -313,16 +314,12 @@ func waitForrbdImage(backoff wait.Backoff, volOptions *rbdVolume, userID string,
if err != nil {
return false, fmt.Errorf("fail to check rbd image status with: (%v), rbd output: (%s)", err, rbdOutput)
}
// In the case of multiattach we want to short circuit the retries when used (so r`if used; return used`)
// otherwise we're setting this to false which translates to !ok, which means backoff and try again
// NOTE: we ONLY do this if an multi-node access mode is requested for this volume
if (strings.ToLower(volOptions.MultiNodeWritable) == "enabled") && (used) {
klog.V(2).Info("detected MultiNodeWritable enabled, ignoring watcher in-use result")
if (volOptions.DisableInUseChecks) && (used) {
klog.V(2).Info("valid multi-node attach requested, ignoring watcher in-use result")
return used, nil
}
return !used, nil
})

// return error if rbd image has not become available for the specified timeout
if err == wait.ErrWaitTimeout {
return fmt.Errorf("rbd image %s is still being used", imagePath)
Expand Down
Loading