Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default multiwrite blockmode #261

Merged
merged 3 commits into from
Mar 20, 2019

Conversation

j-griffith
Copy link
Contributor

Rework multi-node-multi-writer feature

This commit reverts the initial implementation of the
multi-node-multi-writer feature:
commit: b5b8e46

It replaces that implementation with a more restrictive version that
only allows multi-node-multi-writer for volumes of type block

With this change there are no volume parameters required in the stoarge
class, we also fail any attempt to create a file based device with
multi-node-multi-write being specified, this way a user doesn't have to
wait until they try and do the publish before realizing it doesn't work.

@j-griffith
Copy link
Contributor Author

Need to implement rpc status codes in the return for the case of users trying to provision a file based rbd volume with multi-node-multi-writer set

@humblec
Copy link
Collaborator

humblec commented Mar 14, 2019

@j-griffith Do we have a test case for this PR ?

Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j-griffith as @humblec suggested it would be great to have a test case of this one, unfortunately we don't have an e2e test framework, it would be great if you give some points on how to test this feature (+ve and -ve scenarios) so that we would like use these points to write some test cases later.

@Madhu-1 Madhu-1 added the DNM DO NOT MERGE label Mar 14, 2019
@j-griffith
Copy link
Contributor Author

j-griffith commented Mar 14, 2019

@j-griffith as @humblec suggested it would be great to have a test case of this one, unfortunately we don't have an e2e test framework, it would be great if you give some points on how to test this feature (+ve and -ve scenarios) so that we would like use these points to write some test cases later.

@Madhu-1 Yeah, e2e testing could be as simple as create a block mode pvc, and attach it to a pod, ensure that it's mounted at the provided device path (eg: 6a3af1f#diff-1c0f522a61adc59307209c8e0296db49R118). Feel free to ping me if you want to chat, if we get an infrastructure in place the test itself is fairly easy.

@humblec
Copy link
Collaborator

humblec commented Mar 14, 2019

@Madhu-1 Yeah, e2e testing could be as simple as create a block mode pvc, and attach it to a pod, ensure that it's mounted at the provided device path. Feel free to ping me if you want to chat, if we get an infrastructure in place the test itself is fairly easy.

iic, we already tried this and it was working even without this patch.. I think thats the base for test case question :).. @j-griffith it seems that, the main concern should be on data consistency , Isnt it ?

@j-griffith
Copy link
Contributor Author

iic, we already tried this and it was working even without this patch iic :).. I think thats the base for this question :).. @j-griffith I think the main point is about the data consistency , Isnt it ?

@humblec Sorry, not sure I'm following; this did work when we merged b5b8e46 and that was cool; but consistency was a concern for file based mode (for good reason). So this is basically a revert of the file based option, and only allows RWX with Block mode. Consistency checks would have to be done with something like dd, or direct IO calls.

@humblec
Copy link
Collaborator

humblec commented Mar 14, 2019

but consistency was a concern for file based mode (for good reason). So this is basically a revert of the file based option, and only allows RWX with Block mode

@j-griffith IMO, file based PVCs should also have support for RWX , isnt it ? if there are consistency issues on file, I think we have to track it down without delay.

ps# looks like we have some disconnect about this PR.. lets discuss about it first :P,

@j-griffith
Copy link
Contributor Author

but consistency was a concern for file based mode (for good reason). So this is basically a revert of the file based option, and only allows RWX with Block mode

@j-griffith IMO, file based PVCs should also have support for RWX , isnt it ? if there are consistency issues on file, I think we have to track it down without delay.

It was my opinion that RWX was reasonable for file mode as well (tbc rbd block with k8s file mode); the concerns that came up pointed out that since Kubernetes and the CRI in use will mkfs and mount these on the node; and that it's typically using ext-4 or xfs, this is not going to be consistent if two nodes try and use the volume at the same time.

There are valid use-cases IMO, but they are limited, and of course they're dependent on very good application coordination.

ps# looks like we have some disconnect about this PR.. lets discuss about it first :P,

For sure, it's great to flush things out and make sure we get what we want long term here :)

@j-griffith
Copy link
Contributor Author

Need to implement rpc status codes in the return for the case of users trying to provision a file based rbd volume with multi-node-multi-writer set

fixed the rpc response, as far as tests; given we don't have a framework in place perhaps that should be an independent item?

@j-griffith
Copy link
Contributor Author

@Madhu-1 thoughts and the DNM label?

@Madhu-1 Madhu-1 removed the DNM DO NOT MERGE label Mar 18, 2019
Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good, I have few comments.

Makefile Outdated
@@ -31,7 +31,7 @@ go-test:
./scripts/test-go.sh

static-check:
./scripts/lint-go.sh
./scripts/lint-go.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to remove extra space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

```

Create a POD that uses this PVC:

```yaml
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding yaml results in a good rendering in markdown files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point adding the tag to all the yaml items I added, thanks!

userid: admin
fsType: xfs
multiNodeWritable: "enabled"
reclaimPolicy: Delete
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add yaml here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Wait for the POD to enter Running state, write some data to
`/var/lib/www/html`
Wait for the POD to enter Running state, check that our block device
is available in the container at `/dev/rdbblock`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add an example here how to check the block device and a sample output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, added!

resources:
requests:
storage: 1Gi
storageClassName: csi-rbd
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we require storage class name in PVC template?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, probably make this match https://github.com/ceph/ceph-csi/blob/csi-v1.0/examples/rbd/raw-block-pvc.yaml except s/ReadWriteOnce/ReadWriteMany/ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required if you set it to be the default SC (I should've had a note in there pointing out the doc assumes that). I'll put the SC in there explicitly though so it's "safe" to copy/paste :)

disableInUseChecks = true
} else {
klog.Warningf("MULTI_NODE_MULTI_WRITER currently only supported with volumes of access type `block`, invalid AccessMode for volume: %v", req.GetVolumeId())
return nil, fmt.Errorf("rbd: MULTI_NODE_MULTI_WRITER access mode only allowed with BLOCK access type")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to return status.Error with codes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, good catch

Volumes that are of access_type `block`

*WARNING* This feature is strictly for workloads that know how to deal
with concurrent acces to the Volume (eg Active/Passive applications).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: access

is available in the container at `/dev/rdbblock`

Now, we can create a second POD (ensure the POD is scheduled on a different
node; multiwriter single node works without this feature) that also uses this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is there a way to use some kind of anti-affinity to ensure that the second POD is scheduled on a different node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you could use anti-affinity or node-selector. See Kube docs here: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

isMultiNode := false
isBlock := false
for _, cap := range req.VolumeCapabilities {
// Only checking SINGLE_NODE_SINGLE_WRITER here because regardless of the other types (MULTI READER) we need to implement the same logic to ignore the in-use response

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work with today's advertised capabilities but it is fragile if future capabilites are added in k8s. SINGLE_NODE_READER_ONLY is not MultiNode. MULTI_NODE_READER_ONLY & MULTI_NODE_SINGLE_WRITER may be safe because there is no concurrent writing going on.

Would you consider testing explicitly for MULTI_NODE_MULTI_WRITER doing the fail early at line 108ff if that is true ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is it's not a check of whether it's "safe" or not, it's a check of whether K8s and the plugin will let you even request the mode. I'm not sure I follow the "fragile" bit? We're explicitly supporting a single capability here, if other capabilities are added to K8s or if we choose to support additional capabilities in the plugin we should do that explicitly.

By the way, one of the reasons I thought about this and changed it was because you raised concerns about not being explicit and I was using an "any multi-node" logic; so I'm sort of confused. Let me know if I'm missing something; I might just not be seeing something.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize I didn't express my concern well and that what bothers me here is the name of the flag is_multinode. One of the access_modes that isn't SINGLE_MODE_SINGLE_WRITER is SINGLE_NODE_SINGLE_READER so it seems wrong to say is_multinode would be true. In fact SINGLE_NODE_SINGLE_READER should be safe w/o block mode too, right?

Copy link

@tombarron tombarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey John -- a couple remarks inline for your consideration -- overall looks good.

This commit reverts the initial implementation of the
multi-node-multi-writer feature:
  commit: b5b8e46

It replaces that implementation with a more restrictive version that
only allows multi-node-multi-writer for volumes of type `block`

With this change there are no volume parameters required in the stoarge
class, we also fail any attempt to create a file based device with
multi-node-multi-write being specified, this way a user doesn't have to
wait until they try and do the publish before realizing it doesn't work.
for _, cap := range req.VolumeCapabilities {
// Only checking SINGLE_NODE_SINGLE_WRITER here because regardless of the other types (MULTI READER) we need to implement the same logic to ignore the in-use response
if cap.GetAccessMode().GetMode() != csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER {
isMultiNode = true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what about setting isMultiNode to true if the access mode is neither SINGLE_MODE_WRITER or SINGLE_MODE_READER_ONLY ?

Copy link
Contributor Author

@j-griffith j-griffith Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's "NODE", not "MODE.. If the mode is "neither" SINGLE_NODE_xxxx" then by elimination it's "MULTI_NODE_xxx".

I'm not exactly sure what you're looking for here, but I think the safest answer is to just make this explicitly for MULTI_NODE_MULTI_WRITER option only, and then I can handle other cases involving RO and implementing the RO option separately. Trying to infer the fix or work around the issues with the single-node-multi cases doesn't belong in this PR I don't think, and I'd prefer to see how those things end up sorting out upstream. Does that work?

Note that the valid options are:

message AccessMode {
enum Mode {
UNKNOWN = 0;

  // Can only be published once as read/write on a single node, at
  // any given time.
  SINGLE_NODE_WRITER = 1;

  // Can only be published once as readonly on a single node, at
  // any given time.
  SINGLE_NODE_READER_ONLY = 2;

  // Can be published as readonly at multiple nodes simultaneously.
  MULTI_NODE_READER_ONLY = 3;

  // Can be published at multiple nodes simultaneously. Only one of
  // the node can be used as read/write. The rest will be readonly.
  MULTI_NODE_SINGLE_WRITER = 4;

  // Can be published as read/write at multiple nodes
  // simultaneously.
  MULTI_NODE_MULTI_WRITER = 5;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's "NODE", not "MODE.. If the mode is "neither" SINGLE_NODE_xxxx" then by elimination it's "MULTI_NODE_xxx".

I think the safest answer is to just make this explicitly for MULTI_NODE_MULTI_WRITER option only, and then I can handle other cases involving RO and implementing the RO option separately. Trying to infer the fix or work around the issues with the single-node-multi cases doesn't belong in this PR I don't think, and I'd prefer to see how those things end up sorting out upstream. Does that work?

Yes, and thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yessir

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry to prolong this, but don't you want to change the sense of the comparison as well?

if cap.GetAccessMode().GetMode() == csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER {
      isMultiNode = true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gahhh, yes!

} else {
klog.Warningf("MULTI_NODE_MULTI_WRITER currently only supported with volumes of access type `block`, invalid AccessMode for volume: %v", req.GetVolumeId())
e := fmt.Errorf("rbd: MULTI_NODE_MULTI_WRITER access mode only allowed with BLOCK access type")
return nil, status.Error(codes.InvalidArgument, e.Error())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just return status.Error(codes.InvalidArgument, "rbd: MULTI_NODE_MULTI_WRITER access mode only allowed with BLOCK access type") instead of creating a new variable e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, changed

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Mar 19, 2019

@tombarron do you have any other comments on this one?

@tombarron
Copy link

@tombarron do you have any other comments on this one?

No, I'm OK with this PR now and thanks John for working on this!

@rootfs
Copy link
Member

rootfs commented Mar 19, 2019

thanks @j-griffith @tombarron

@rootfs rootfs merged commit fa54e5c into ceph:csi-v1.0 Mar 20, 2019
wilmardo pushed a commit to wilmardo/ceph-csi that referenced this pull request Jul 29, 2019
nixpanic pushed a commit to nixpanic/ceph-csi that referenced this pull request Mar 4, 2024
Syncing latest changes from upstream devel for ceph-csi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants