spec: Clarify calling multiple NodePublishVolume #150

jieyu · 2017-11-17T01:09:59Z

Clarifies that NodePublishVolume can only be called multiple times on
a node for a volume if that volume can be published on multiple nodes
(i.e., has MULTI_NODE capability).

If a volume cannot be published on multiple node, it's weird that we
allow it to be publish multiple times on a single node. In other words,
it's weird that we allow two workloads to share a volume if they are on
the same node, but don't have a way to do that if they are on different
nodes.

saad-ali · 2017-11-17T01:38:50Z

LGTM

jdef · 2017-11-17T19:17:39Z

I agree with the semantics of the proposed change. I'm a bit concerned that the description for this RPC is difficult to unpack for a plugin writer that's coming to spec with fresh eyes: there's a lot here to digest.

Furthermore, if a plugin doesn't support MULTI_NODE_xxx but a CO attempts to invoke this RPC in a way that MULTI_NODE_xxx is implied .. what's the error code that should be returned?

This RPC is called by the CO when a workload that wants to use the specified volume is placed (scheduled) on a node.
...
If this RPC failed, or the CO does not know if it failed or not, it MAY choose to call `NodePublishVolume` again, or choose to call `NodeUnpublishVolume`.

Scenario 1: plugin supports MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ the same target_path's.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK (idempotent)

Scenario 2: plugin supports MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ different target_path's.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/2, credentials=nil
** OK

Scenario 3: plugin supports MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ the same target_path's and different credentials.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials={foo:bar}
** OK

Scenario 4: plugin supports MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ different target_path's and different credentials.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/2, credentials={foo:bar}
** OK

Scenario 5: plugin does not support MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ the same target_path's.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK (idempotent)

Scenario 6: plugin does not support MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ different target_path's.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/2, credentials=nil
** ERROR

Scenario 7: plugin does not MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ the same target_path's and different credentials.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials={foo:bar}
** ERROR

Scenario 8: plugin does not support MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ different target_path's and different credentials.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/2, credentials={foo:bar}
** ERROR

jieyu · 2017-11-21T01:09:31Z

@jdef updated. the T1=T2, C1!=C2 case is interesting. I feel that it's not OK.

amaliaAvraham · 2017-11-21T12:10:12Z

@jieyu - What if the user would like to mount the volume to two different paths in the same node? A storage system may support that scenario without supporting multi-nodes.

jdef · 2017-11-21T13:15:28Z

@jieyu you're talking about scenario 3 from my earlier comment? I think I was moving too fast: since it's the same target path the second call should probably generate an ERROR.

jdef

Thanks, this is more clear. In light of these changes it probably makes sense to update the documentation for the AccessMode.Mode enumerated types: the SINGLE_xxx types are actually more restrictive than their documentation indicates.

jdef · 2017-11-21T13:50:21Z

spec.md

@@ -1070,6 +1079,7 @@ Condition | gRPC Code | Description | Recovery Behavior
 | --- | --- | --- | --- |
 | Volume does not exists | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
 | Operation pending for volume | 9 FAILED_PRECONDITION | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. |
+| Not supposed to be called | 10 ABORTED | Indicates that the CO is not supposed to call the RPC because the volume does not have MULTI_NODE capability. | Caller MAY retry at a higher-level by calling `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. |


"Not supposed to be called" isn't quite right. Maybe something more like "Exceeds capabilities"?

s/Indicates that the CO is not supposed to call the RPC/Indicates that the CO has exceeded the volume's capabilities/

That makes me think if RESOURCE_EXHAUSTED is more appropriate here?

I prefer ABORTED: #150 (comment)

jdef · 2017-11-21T13:52:32Z

spec.md

 If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO MUST guarantee that this RPC is called after `ControllerPublishVolume` is called for the given volume on the given node and returns a success.

 This operation MUST be idempotent.
 If this RPC failed, or the CO does not know if it failed or not, it MAY choose to call `NodePublishVolume` again, or choose to call `NodeUnpublishVolume`.

+This RPC MAY be called by the CO multiple times on the same node for the same volume with possibly different `target_path` and/or auth credentials if the volume has MULTI_NODE capability (i.e., `volume_capability` is either `MULTI_NODE_READER_ONLY`, `MULTI_NODE_SINGLE_WRITER` or `MULTI_NODE_MULTI_WRITER`). The following table shows what the Plugin SHOULD return when receiving a second `NodePublishVolume` on the same volume on the same node:


nit: there are two sentences here --> should be two separate lines

jdef · 2017-11-21T13:53:38Z

spec.md

 If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO MUST guarantee that this RPC is called after `ControllerPublishVolume` is called for the given volume on the given node and returns a success.

 This operation MUST be idempotent.
 If this RPC failed, or the CO does not know if it failed or not, it MAY choose to call `NodePublishVolume` again, or choose to call `NodeUnpublishVolume`.

+This RPC MAY be called by the CO multiple times on the same node for the same volume with possibly different `target_path` and/or auth credentials if the volume has MULTI_NODE capability (i.e., `volume_capability` is either `MULTI_NODE_READER_ONLY`, `MULTI_NODE_SINGLE_WRITER` or `MULTI_NODE_MULTI_WRITER`). The following table shows what the Plugin SHOULD return when receiving a second `NodePublishVolume` on the same volume on the same node:


s/volume_capability/access_mode/

did you forget to push?

jieyu · 2017-11-21T17:02:38Z

@amaliaAvraham As I mentioned in the description of this PR, if a volume cannot be published on multiple node, it's weird that we allow it to be published multiple times on a single node. In other words, it's weird that we allow two workloads to share a volume if they are on the same node, but don't have a way to do that if they are on different nodes.

I think it's always good to start with a stricter requirement. If we do have a use case later that requires lifting this requirement, we can always do that.

jdef · 2017-11-21T17:53:21Z

I'm pretty happy w/ ABORTED because the CO is violating capabilities that it should have been aware of, vs consuming some resource that the node just happened to run out of.

…

On Tue, Nov 21, 2017 at 12:19 PM, Jie Yu ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In spec.md <#150 (comment)> : > @@ -1070,6 +1079,7 @@ Condition | gRPC Code | Description | Recovery Behavior | --- | --- | --- | --- | | Volume does not exists | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. | | Operation pending for volume | 9 FAILED_PRECONDITION | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. | +| Not supposed to be called | 10 ABORTED | Indicates that the CO is not supposed to call the RPC because the volume does not have MULTI_NODE capability. | Caller MAY retry at a higher-level by calling `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. | That makes me think if RESOURCE_EXHAUSTED is more appropriate here? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#150 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACPVLIkolIVHtG0yv0Xxsg_VHN-B8HYVks5s4wYbgaJpZM4QhY30> .

jieyu · 2017-11-21T18:09:22Z

@jdef can you take a look again. I addressed your comments.

jdef

LGTM once make is re-run. Please squash prior to merge

Clarifies that `NodePublishVolume` can only be called multiple times on a node for a volume if that volume can be published on multiple nodes (i.e., has MULTI_NODE capability). If a volume cannot be published on multiple node, it's weird that we allow it to be publish multiple times on a single node. In other words, it's weird that we allow two workloads to share a volume if they are on the same node, but don't have a way to do that if they are on different nodes.

clintkitson · 2017-11-30T19:15:21Z

@jieyu Let me add some practical color to this. An AWS EBS volume is not able to be shared among hosts, so with this change you're saying that multiple containers would never be able to share a volume with EBS. Why is this restrictive language useful?

codenrhoden · 2017-11-30T19:17:35Z

Apologies for chiming in much too late, but I am very surprised by the logic of not having a volume be published to multiple target paths unless it has MULTI_NODE capabilities. I think this is a very valid use case, but I may not understand the CO perspective very well.

If you wanted to have two containers on the same node sharing a volume, wouldn't you want those published to two different target paths? One for each workload? This is the assumption I have been working under.

Or even better -- two containers sharing the volume, but one is rw and the other is ro. This is perfectly reasonable and achievable with a volume formatted with xfs or ext4 -- something that can not be used on multiple nodes at the same time.

akutz · 2017-11-30T19:34:35Z

Hi @codenrhoden,

I think the issue is really just one of naming. I believe things make more sense when I alter the existing enum names and descriptions:

// Can only be published once as read/write, at any given time.
SINGLE_WRITER = 1;

// Can only be published once as readonly, at any given time.
SINGLE_READER_ONLY = 2;

// Can be published as readonly multiple times simultaneously.
MULTI_READER_ONLY = 3;

// Can be published multiple times. Only one publication can be used
// as read/write. The rest will be readonly.
MULTI_SINGLE_WRITER = 4;

// Can be published as read/write multiple times.
MULTI_WRITER = 5;

In fact, I propose we rename the enums to match the above, updated names and documentation. The above changes more accurately reflect your use case @codenrhoden and still maintain the intent of the original names and descriptions.

cc @jieyu @saad-ali @jdef @clintkitson,

jieyu · 2017-11-30T20:11:33Z

@codenrhoden

If you wanted to have two containers on the same node sharing a volume, wouldn't you want those published to two different target paths? One for each workload? This is the assumption I have been working under.

This totally depends on CO's implementation. Some CO might decide to just call NodePublishVolume once for each volume (one target_path), and do a bind mount for each workload. And some CO might want to call NodePublishVolume multiple times, one for each workload. The spec doesn't dictate which way the CO should go with.

Or even better -- two containers sharing the volume, but one is rw and the other is ro. This is perfectly reasonable and achievable with a volume formatted with xfs or ext4 -- something that can not be used on multiple nodes at the same time.

Take @clintkitson's EBS example, if you have two workloads (e.g., pods) both want to use the same EBS volume. This is only possible if these two workloads are placed together on one node. As a result, you create an implicit constraint on the scheduling of those two workloads. From CO's perspective, it'll be much simpler if there is no such implicit scheduling constraint. And the CO can perform validation early: if the volume does not support access mode multi-node publish, don't allow multiple workload to share it.

clintkitson · 2017-11-30T20:54:05Z

How about just expanding the enum into another dimension to cover that aspect? So you'd have a workload enum and separately a node enum?

SINGLE_READER
SINGLE_WRITER
SINGLE_WRITER_MULTI_READER
MULTI_WRITER
MULTI_READER_ONLY

MULTIN_MULTI_WRITER 
MULTIN_SINGLE_WRITER 
SINGLEN_SINGLE_WRITER

The combination of these enums should enable a orchestrator to know what it can and can't do with a volume.

clintkitson · 2017-11-30T20:56:05Z

As an example due to AWS's limitation it would be limited to the SINGLEN_SINGLE_WRITER. Then it would have any of the 5 enums to pursue on that node and also know that it would need to schedule other workloads to this same node to share the volume?

codenrhoden · 2017-11-30T22:09:58Z

So, my misunderstanding here is I took MULTI_NODE to mean, ahem, multiple nodes. E.G. a volume that can be used at multiple nodes at once, like NFS or a clustered filesystem.

If you wanted to have two containers on the same node sharing a volume, wouldn't you want those published to two different target paths? One for each workload? This is the assumption I have been working under.

This totally depends on CO's implementation. Some CO might decide to just call NodePublishVolume once for each volume (one target_path), and do a bind mount for each workload. And some CO might want to call NodePublishVolume multiple times, one for each workload. The spec doesn't dictate which way the CO should go with.

So, I agree with this, except that when I interpreted the meaning of MULTI_NODE to mean publishing to multiple nodes at the same time rather than multiple workloads on the same node, I thought the spec was specifically precluding it.

With that misunderstanding cleared up, I do suggest that I will not be the only one to make this mistake, and perhaps there is clarification we can do around those access mode. I've been working with this spec for months now, and I've been misinterpreting those meanings from day 1.

If I interpret the fields more inline with what @akutz laid out, it makes sense. Would it be fair then to say that the CSI spec does not allow for any volumes to be attached to multiple nodes at the same time? That the access modes only apply to usage on a single node?

codenrhoden · 2017-12-01T17:21:58Z

The current comments in the spec outline the following:

      // Can only be published once as read/write on a single node, at
      // any given time.
      SINGLE_NODE_WRITER = 1;

      // Can only be published once as readonly on a single node, at
      // any given time.
      SINGLE_NODE_READER_ONLY = 2;

      // Can be published as readonly at multiple nodes simultaneously.
      MULTI_NODE_READER_ONLY = 3;

      // Can be published at multiple nodes simultaneously. Only one of
      // the node can be used as read/write. The rest will be readonly.
      MULTI_NODE_SINGLE_WRITER = 4;

      // Can be published as read/write at multiple nodes
      // simultaneously.
      MULTI_NODE_MULTI_WRITER = 5;

Note they all reference nodes, not targets. I still think the comments, and the intent are completely at odds with each other. @jieyu's logic seems to interpret "nodes" as targets. I do need some clarification on how this is supposed to work.

Does CSI intend to support volumes that can mounted to multiple nodes, as in different physical/virtual servers, at the same time?

Here is how I have been interpreting a simplified workflow...

NodePublish request comes in for a volume with SINGLE_NODE_WRITER. It's previously been attached, and we mount the volume (formatting it if necessary). we mount it to a private mount area (remember the discussions around that?). Then we bind-mount it to the target as rw or ro, depending on the readonly flag. If the access mode had been READER_ONLY, we would have mounted the underlying filesystem as ro.
Any additional NodePublish requests to that same node with different targets can be satisfied just fine, with additional bind mounts to the new target. (I'm ignoring different credentials right now).
Any NodePublish requests to different nodes would fail, as the mode is SINGLE_NODE
Any NodePublish requests to a node with readonly set to false, and an access mode of READERONLY would fail.
Storage systems that can attach volumes to multiple hosts and have volumes formatted with xfs or ext4 should be able to support MULTI_NODE_READER_ONLY safely, so long as all mounts are done read-only. If you want to have a volume mounted on multiple nodes and have any writers, you need a clustered filesystem, NFS, or similar.

I think this workflow works fine, but recognize that it is based on what can be mounted to nodes, and that multiple target paths is always fine. A CO may not want to use them, but prohibiting them outright seems silly to me.

jieyu · 2017-12-01T17:24:09Z

@codenrhoden I think you misinterpret my comments. The access mode currently reference nodes. If it's SINGLE_NODE_WRITER, the volume cannot be published to multiple nodes simultaneously.

jieyu · 2017-12-01T17:27:07Z

NodePublish request comes in for a volume with SINGLE_NODE_WRITER. It's previously been attached, and we mount the volume (formatting it if necessary). we mount it to a private mount area (remember the discussions around that?). Then we bind-mount it to the target as rw or ro, depending on the readonly flag. If the access mode had been READER_ONLY, we would have mounted the underlying filesystem as ro.

The currently spec disallows this. If the volume access mode is SINGLE_NODE_WRITER, the CO won't call NodePublishVolume multiple times on a node with different flag or target_path. If the CO mistakenly does that, the Plugin should return ABORTED.

codenrhoden · 2017-12-01T17:40:21Z

NodePublish request comes in for a volume with SINGLE_NODE_WRITER. It's previously been attached, and we mount the volume (formatting it if necessary). we mount it to a private mount area (remember the discussions around that?). Then we bind-mount it to the target as rw or ro, depending on the readonly flag. If the access mode had been READER_ONLY, we would have mounted the underlying filesystem as ro.

The currently spec disallows this. If the volume access mode is SINGLE_NODE_WRITER, the CO won't call NodePublishVolume multiple times on a node with different flag or target_path. If the CO mistakenly does that, the Plugin should return ABORTED.

I guess that is what I am protesting. But it is ultimately up to the COs. Have all COs said this is the paradigm they will follow? We have done a lot of work to handle multiple target paths gracefully, through the use of private mount areas and bind mounting, and none of that work is necessary if that capability will never be used.

Earlier when you told @amaliaAvraham:

In other words, it's weird that we allow two workloads to share a volume if they are on the same node, but don't have a way to do that if they are on different nodes.

I don't think that's weird at all. One can be done with a regular filesystem (ext4, xfs), the other requires different technology altogether.

Again, I understand it is ultimately how the COs use the spec, so I need to just adhere to it, I'm just surprised, and mostly disappointed that this was not clear. I must have gotten incorrect information early on and just clung to it.

jieyu · 2017-12-01T17:51:54Z

I don't think that's weird at all. One can be done with a regular filesystem (ext4, xfs), the other requires different technology altogether.

It's weird because it poses an implicit scheduling constraints for the workloads. The orchestrator will need to make sure those two workloads are placed together. If that's the case, why not just put them into the same pod so that it's guaranteed to be able to share a volume on a node?

akutz · 2017-12-01T18:52:51Z

I'm sorry @jieyu, but this doesn't make any sense. First, I would like you and @saad-ali to please respond to my above comment.

Secondly, if there was no intention of having a private mount area for volumes to be published to multiple target paths (workloads) on a single node then why on earth did @saad-ali suggest NodePublishDevice? That's only valuable in the aforementioned scenario.

jieyu · 2017-12-01T19:12:10Z

@akutz I think all the confusing comes from the fact that we don't have a way to express whether a volume can be published multiple times on a node. The current access mode captures if a volume can be published to multiple node. I do believe that we probably need both. I liked @clintkitson's direction of separate these two:

SINGLE_NODE_READER
SINGLE_NODE_WRITER
MULTI_NODE_READER_ONLY
MULTI_NODE_SINGLE_WRITER
MULTI_NODE_MULTI_WRITER

(This specifies if a volume can be published to multiple nodes or not)

and

SINGLE_PUBLISH_READER
SINGLE_PUBLISH_WRITER
MULTI_PUBLISH_READER_ONLY
MULTI_PUBLISH_SINGLE_WRITER
MULTI_PUBLISH_MULTI_WRITER

(This specifies if a volume can be published to multiple targets on a given node or not)

And we have a max of 5*5 combinations. Some probably does not make sense. For instance: SINGLE_NODE_READER + MULTI_PUBLISH_MULTI_WRITER does not make sense.

akutz · 2017-12-01T19:16:33Z

Hi @jieyu,

Doesn't my suggested approach from above handle that?

// Can only be published once as read/write, at any given time.
SINGLE_WRITER = 1;

// Can only be published once as readonly, at any given time.
SINGLE_READER_ONLY = 2;

// Can be published as readonly multiple times simultaneously.
MULTI_READER_ONLY = 3;

// Can be published multiple times. Only one publication can be used
// as read/write. The rest will be readonly.
MULTI_SINGLE_WRITER = 4;

// Can be published as read/write multiple times.
MULTI_WRITER = 5;

The difference between MULTI_SINGLE_WRITER and MULTI_WRITER indicate a non-clustered and clustered filesystem. MULTI_SINGLE_WRITER would enable the use case of publishing a volume with a non-clustered FS multiple times on a single Node Host with one publication as RW and all the others as RO.

jieyu · 2017-12-01T19:29:49Z

@akutz ok, IIUC, what you are suggesting is that the CO should infer from the access mode that whether a volume can be published to multiple node or not.

SINGLE_WRITER = SINGLE_NODE_WRITER + SINGLE_PUBLISH_WRITER
SINGLE_READER_ONLY = SINGLE_NODE_READ_ONLY + SINGLE_PUBLISH_READER_ONLY
MULTI_READER_ONLY = ???
MULTI_SINGLE_WRITER = SINGLE_NODE_WRITER + MULTI_PUBLISH_SINGLE_WRITER
MULTI_WIRTER = MULTI_NODE_WRITER + MULTI_PUBLISH_MULTI_WRITER

Can you clarify what MULTI_READER_ONLY means?

akutz · 2017-12-01T19:36:20Z

Hi @jieyu,

MULTI_READER_ONLY is used in conjunction with MULTI_SINGLE_WRITER to enable the "mount rw once and mount ro multi times on the same node"

jieyu · 2017-12-01T19:44:34Z

OK, so the following? I am fine with that if it captures all potential use cases.

SINGLE_WRITER = SINGLE_NODE_WRITER + SINGLE_PUBLISH_WRITER
SINGLE_READER_ONLY = SINGLE_NODE_READER_ONLY + SINGLE_PUBLISH_READER_ONLY
MULTI_READER_ONLY = SINGLE_NODE_READER_ONLY + MULTI_PUBLISH_READER_ONLY
MULTI_SINGLE_WRITER = SINGLE_NODE_WRITER + MULTI_PUBLISH_SINGLE_WRITER
MULTI_WIRTER = MULTI_NODE_WRITER + MULTI_PUBLISH_MULTI_WRITER

codenrhoden · 2017-12-01T20:03:17Z

I think there is a missing use case for having a volume published to a single node, but have multiple rw targets. There is nothing wrong with that. The application needs to deal with file locking, but the filesystem is fine. There is no technical barrier to it. I should be able to give the same volume to many containers on the same node as rw if i choose to. and that doesn't require a clustered filesystem. That's correct, right @cduchesne?

akutz · 2017-12-01T20:12:12Z

Hi @codenrhoden,

While you can technically do that I don't know if you ever should. That's essentially NFS you're describing, and so if the underlying FS doesn't know how to handle shared access you're kind of hosed.

cduchesne · 2017-12-01T20:20:25Z

@akutz @codenrhoden - this is the way docker, kubernetes, and mesos operate today. If the same volume is mapped to more than 1 container, it works just fine, but it up to the application to deal with file locking. If you read the documentation for Kubernetes PV Access Modes , it specifically states ReadWriteOnce doesn't allow a volume to be mapped to multiple nodes. This doesn't stop a volume from being mapped to 100 pods on the same node though.

Multiple pods accessing a volume is a perfectly valid scenario, such as some helper pod that is created/scheduled with a constraint to land it on the same node as the already-running pod.

@akutz this isn't like NFS because the local host os is managing the primary mount of the block device. In essence the mounted filesystem is being shared to other pods on the same host like NFS because it can be.

akutz · 2017-12-01T20:21:19Z

Hi @cduchesne,

this isn't like NFS because the local host os is managing the primary mount of the block device. In essence the mounted filesystem is being shared to other pods on the same host like NFS because it can be.

Oh, yeah, doh! Brain fart :)

cduchesne · 2017-12-01T20:24:29Z

For me, it is important that SINGLE_NODE_WRITER allows multiple WRITERS on the same NODE because that's what it implies: The volume can be mounted to a single node and is writable.

codenrhoden · 2017-12-01T20:38:50Z

For me, I don't know that there needs to be new definitions, just clarification. This all became crazy once we realized there was an explicit disallowing of multiple targets. I'd say, if a CO doesn't want to use multiple targets (and is therefor doing all the refcounting itself) then it shouldn't, but it shouldn't blocked. I see the definitions as the following:

SINGLE_NODE_WRITER - Can be attached on a single node, and has rw privileges. Can be published multiple times to targets on that node, as either rw or ro (controlled by the readonly flag in the NodePublishRequest).

SINGLE_NODE_READER_ONLY - Can be attached on a single node. Mount must be ro. Can be published multiple times to targets, all as ro. If a NodePublishRequest comes in with readonly set to false, its an error.

All "attach to one node at a time" use cases are covered by the above.

MULTI_NODE_READER_ONLY - Can be attached to multiple nodes at the same time, but all mounts have to be ro. Also can be published to multiple targets (or just one), and have to be readonly. Storage providers like ScaleIO can do this - where they can attach a block device to multiple nodes, have a non-clustered filesystem, and make sure everything is readonly.

MULTI_NODE_SINGLE_WRITER - Can be attached to multiple nodes, but only one node can have rw priveleges (and that node can publish to multiple targets, with any combo of ro and rw). I see this as the trickiest to handle, as it still really requires a clustered filesystem unless you are happy having stale data at your readers. It would also require a node to know if something is attached to another node (often easy to do), but also knowing how it has it mounted (no idea who can do this).

MULTI_NODE_MULTI_WRITER - Everything goes! can be attached everywhere, with any combination of ro and rw. Definitely requires a clustered filesystem, or NFS running rpc.lockd (or whatever it is).

I honestly thought this was pretty clear, but i was starting with a base assumption of allowing multiple target paths, and that the SPs deal with bind-mounting to those multiple targets. I still think its fine for a CO to not do that, but why block it?

cduchesne · 2017-12-01T20:46:34Z

What @codenrhoden said is exactly how I would interpret these definitions. I too don't know how MULTI_NODE_SINGLE_WRITER is supportable without all agents being aware of each-other's mount tables but I'm okay with that.

This is a special case that both kubelet and the volume driver should support, because users might expect it. One Kubernetes mechanism to deploy pods like this is via pod affinity. However, strictly speaking the CSI spec does not allow this usage mode (see container-storage-interface/spec#150) and there is an on-going debate to enable it (see container-storage-interface/spec#178). Therefore this test gets skipped unless explicitly enabled for a driver. CSI drivers which create a block device for a remote volume in NodePublishVolume fail this test. They have to make the volume available in NodeStageVolume and then in NodePublishVolume merely do a bind mount (as for example in https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/node.go#L150).

jieyu force-pushed the access_mode_node_publish branch from 53e6a70 to 348b8e8 Compare November 21, 2017 01:06

jdef requested changes Nov 21, 2017

View reviewed changes

jdef added this to the v0.1 milestone Nov 21, 2017

jdef approved these changes Nov 21, 2017

View reviewed changes

jieyu force-pushed the access_mode_node_publish branch from 0dd9c04 to 52ed1bb Compare November 21, 2017 21:27

jieyu merged commit 4ac2d13 into container-storage-interface:master Nov 21, 2017

jieyu mentioned this pull request Jan 13, 2018

Consider a "MountDevice" equivalent step #119

Closed

akutz mentioned this pull request Jan 29, 2018

NodeStageVolume/NodeUnstageVolume #169

Merged

jieyu mentioned this pull request Jan 29, 2018

Support for mounting the same volume on the same node by multiple workloads #178

Open

This was referenced Jan 9, 2019

E2E storage: more tests for different pod/node combinations kubernetes/kubernetes#72002

Merged

usage of same volume by multiple pods on same node kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#219

Closed

jieyu mentioned this pull request Feb 6, 2019

NodePublishVolume should not require multi-node capability if called on the same node with different target_paths #349

Open

spec: Clarify calling multiple NodePublishVolume #150

spec: Clarify calling multiple NodePublishVolume #150

Conversation

jieyu commented Nov 17, 2017

saad-ali commented Nov 17, 2017

jdef commented Nov 17, 2017 • edited Loading

jieyu commented Nov 21, 2017

amaliaAvraham commented Nov 21, 2017 • edited Loading

jdef commented Nov 21, 2017

jdef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jieyu commented Nov 21, 2017

jdef commented Nov 21, 2017 via email

jieyu commented Nov 21, 2017

jdef left a comment

Choose a reason for hiding this comment

clintkitson commented Nov 30, 2017

codenrhoden commented Nov 30, 2017

akutz commented Nov 30, 2017 • edited Loading

jieyu commented Nov 30, 2017

clintkitson commented Nov 30, 2017

clintkitson commented Nov 30, 2017

codenrhoden commented Nov 30, 2017

codenrhoden commented Dec 1, 2017

jieyu commented Dec 1, 2017

jieyu commented Dec 1, 2017

codenrhoden commented Dec 1, 2017

jieyu commented Dec 1, 2017

akutz commented Dec 1, 2017

jieyu commented Dec 1, 2017

akutz commented Dec 1, 2017 • edited Loading

jieyu commented Dec 1, 2017

akutz commented Dec 1, 2017

jieyu commented Dec 1, 2017

codenrhoden commented Dec 1, 2017

akutz commented Dec 1, 2017

cduchesne commented Dec 1, 2017 • edited Loading

akutz commented Dec 1, 2017

cduchesne commented Dec 1, 2017

codenrhoden commented Dec 1, 2017

cduchesne commented Dec 1, 2017 • edited Loading

jdef commented Nov 17, 2017 •

edited

Loading

amaliaAvraham commented Nov 21, 2017 •

edited

Loading

akutz commented Nov 30, 2017 •

edited

Loading

akutz commented Dec 1, 2017 •

edited

Loading

cduchesne commented Dec 1, 2017 •

edited

Loading

cduchesne commented Dec 1, 2017 •

edited

Loading