Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spec: Clarify calling multiple NodePublishVolume #150

Merged

Conversation

jieyu
Copy link
Member

@jieyu jieyu commented Nov 17, 2017

Clarifies that NodePublishVolume can only be called multiple times on
a node for a volume if that volume can be published on multiple nodes
(i.e., has MULTI_NODE capability).

If a volume cannot be published on multiple node, it's weird that we
allow it to be publish multiple times on a single node. In other words,
it's weird that we allow two workloads to share a volume if they are on
the same node, but don't have a way to do that if they are on different
nodes.

@saad-ali
Copy link
Member

LGTM

@jdef
Copy link
Member

jdef commented Nov 17, 2017

I agree with the semantics of the proposed change. I'm a bit concerned that the description for this RPC is difficult to unpack for a plugin writer that's coming to spec with fresh eyes: there's a lot here to digest.

Furthermore, if a plugin doesn't support MULTI_NODE_xxx but a CO attempts to invoke this RPC in a way that MULTI_NODE_xxx is implied .. what's the error code that should be returned?

This RPC is called by the CO when a workload that wants to use the specified volume is placed (scheduled) on a node.
...
If this RPC failed, or the CO does not know if it failed or not, it MAY choose to call `NodePublishVolume` again, or choose to call `NodeUnpublishVolume`.

Scenario 1: plugin supports MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ the same target_path's.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK (idempotent)

Scenario 2: plugin supports MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ different target_path's.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/2, credentials=nil
** OK

Scenario 3: plugin supports MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ the same target_path's and different credentials.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials={foo:bar}
** OK

Scenario 4: plugin supports MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ different target_path's and different credentials.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/2, credentials={foo:bar}
** OK

Scenario 5: plugin does not support MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ the same target_path's.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK (idempotent)

Scenario 6: plugin does not support MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ different target_path's.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/2, credentials=nil
** ERROR

Scenario 7: plugin does not MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ the same target_path's and different credentials.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials={foo:bar}
** ERROR

Scenario 8: plugin does not support MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ different target_path's and different credentials.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/2, credentials={foo:bar}
** ERROR

@jieyu jieyu force-pushed the access_mode_node_publish branch from 53e6a70 to 348b8e8 Compare November 21, 2017 01:06
@jieyu
Copy link
Member Author

jieyu commented Nov 21, 2017

@jdef updated. the T1=T2, C1!=C2 case is interesting. I feel that it's not OK.

@amaliaAvraham
Copy link

amaliaAvraham commented Nov 21, 2017

@jieyu - What if the user would like to mount the volume to two different paths in the same node? A storage system may support that scenario without supporting multi-nodes.

@jdef
Copy link
Member

jdef commented Nov 21, 2017

@jieyu you're talking about scenario 3 from my earlier comment? I think I was moving too fast: since it's the same target path the second call should probably generate an ERROR.

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is more clear. In light of these changes it probably makes sense to update the documentation for the AccessMode.Mode enumerated types: the SINGLE_xxx types are actually more restrictive than their documentation indicates.

spec.md Outdated
@@ -1070,6 +1079,7 @@ Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Volume does not exists | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
| Operation pending for volume | 9 FAILED_PRECONDITION | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. |
| Not supposed to be called | 10 ABORTED | Indicates that the CO is not supposed to call the RPC because the volume does not have MULTI_NODE capability. | Caller MAY retry at a higher-level by calling `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Not supposed to be called" isn't quite right. Maybe something more like "Exceeds capabilities"?

s/Indicates that the CO is not supposed to call the RPC/Indicates that the CO has exceeded the volume's capabilities/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes me think if RESOURCE_EXHAUSTED is more appropriate here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer ABORTED: #150 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

spec.md Outdated
If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO MUST guarantee that this RPC is called after `ControllerPublishVolume` is called for the given volume on the given node and returns a success.

This operation MUST be idempotent.
If this RPC failed, or the CO does not know if it failed or not, it MAY choose to call `NodePublishVolume` again, or choose to call `NodeUnpublishVolume`.

This RPC MAY be called by the CO multiple times on the same node for the same volume with possibly different `target_path` and/or auth credentials if the volume has MULTI_NODE capability (i.e., `volume_capability` is either `MULTI_NODE_READER_ONLY`, `MULTI_NODE_SINGLE_WRITER` or `MULTI_NODE_MULTI_WRITER`). The following table shows what the Plugin SHOULD return when receiving a second `NodePublishVolume` on the same volume on the same node:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there are two sentences here --> should be two separate lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

spec.md Outdated
If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO MUST guarantee that this RPC is called after `ControllerPublishVolume` is called for the given volume on the given node and returns a success.

This operation MUST be idempotent.
If this RPC failed, or the CO does not know if it failed or not, it MAY choose to call `NodePublishVolume` again, or choose to call `NodeUnpublishVolume`.

This RPC MAY be called by the CO multiple times on the same node for the same volume with possibly different `target_path` and/or auth credentials if the volume has MULTI_NODE capability (i.e., `volume_capability` is either `MULTI_NODE_READER_ONLY`, `MULTI_NODE_SINGLE_WRITER` or `MULTI_NODE_MULTI_WRITER`). The following table shows what the Plugin SHOULD return when receiving a second `NodePublishVolume` on the same volume on the same node:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/volume_capability/access_mode/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you forget to push?

@jdef jdef added this to the v0.1 milestone Nov 21, 2017
@jieyu
Copy link
Member Author

jieyu commented Nov 21, 2017

@amaliaAvraham As I mentioned in the description of this PR, if a volume cannot be published on multiple node, it's weird that we allow it to be published multiple times on a single node. In other words, it's weird that we allow two workloads to share a volume if they are on the same node, but don't have a way to do that if they are on different nodes.

I think it's always good to start with a stricter requirement. If we do have a use case later that requires lifting this requirement, we can always do that.

@jdef
Copy link
Member

jdef commented Nov 21, 2017 via email

@jieyu
Copy link
Member Author

jieyu commented Nov 21, 2017

@jdef can you take a look again. I addressed your comments.

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once make is re-run. Please squash prior to merge

Clarifies that `NodePublishVolume` can only be called multiple times on
a node for a volume if that volume can be published on multiple nodes
(i.e., has MULTI_NODE capability).

If a volume cannot be published on multiple node, it's weird that we
allow it to be publish multiple times on a single node. In other words,
it's weird that we allow two workloads to share a volume if they are on
the same node, but don't have a way to do that if they are on different
nodes.
@jieyu jieyu force-pushed the access_mode_node_publish branch from 0dd9c04 to 52ed1bb Compare November 21, 2017 21:27
@jieyu jieyu merged commit 4ac2d13 into container-storage-interface:master Nov 21, 2017
@clintkitson
Copy link

@jieyu Let me add some practical color to this. An AWS EBS volume is not able to be shared among hosts, so with this change you're saying that multiple containers would never be able to share a volume with EBS. Why is this restrictive language useful?

@codenrhoden
Copy link

Apologies for chiming in much too late, but I am very surprised by the logic of not having a volume be published to multiple target paths unless it has MULTI_NODE capabilities. I think this is a very valid use case, but I may not understand the CO perspective very well.

If you wanted to have two containers on the same node sharing a volume, wouldn't you want those published to two different target paths? One for each workload? This is the assumption I have been working under.

Or even better -- two containers sharing the volume, but one is rw and the other is ro. This is perfectly reasonable and achievable with a volume formatted with xfs or ext4 -- something that can not be used on multiple nodes at the same time.

@akutz
Copy link
Contributor

akutz commented Nov 30, 2017

Hi @codenrhoden,

I think the issue is really just one of naming. I believe things make more sense when I alter the existing enum names and descriptions:

// Can only be published once as read/write, at any given time.
SINGLE_WRITER = 1;

// Can only be published once as readonly, at any given time.
SINGLE_READER_ONLY = 2;

// Can be published as readonly multiple times simultaneously.
MULTI_READER_ONLY = 3;

// Can be published multiple times. Only one publication can be used
// as read/write. The rest will be readonly.
MULTI_SINGLE_WRITER = 4;

// Can be published as read/write multiple times.
MULTI_WRITER = 5;

In fact, I propose we rename the enums to match the above, updated names and documentation. The above changes more accurately reflect your use case @codenrhoden and still maintain the intent of the original names and descriptions.

cc @jieyu @saad-ali @jdef @clintkitson,

@jieyu
Copy link
Member Author

jieyu commented Nov 30, 2017

@codenrhoden

If you wanted to have two containers on the same node sharing a volume, wouldn't you want those published to two different target paths? One for each workload? This is the assumption I have been working under.

This totally depends on CO's implementation. Some CO might decide to just call NodePublishVolume once for each volume (one target_path), and do a bind mount for each workload. And some CO might want to call NodePublishVolume multiple times, one for each workload. The spec doesn't dictate which way the CO should go with.

Or even better -- two containers sharing the volume, but one is rw and the other is ro. This is perfectly reasonable and achievable with a volume formatted with xfs or ext4 -- something that can not be used on multiple nodes at the same time.

Take @clintkitson's EBS example, if you have two workloads (e.g., pods) both want to use the same EBS volume. This is only possible if these two workloads are placed together on one node. As a result, you create an implicit constraint on the scheduling of those two workloads. From CO's perspective, it'll be much simpler if there is no such implicit scheduling constraint. And the CO can perform validation early: if the volume does not support access mode multi-node publish, don't allow multiple workload to share it.

@clintkitson
Copy link

How about just expanding the enum into another dimension to cover that aspect? So you'd have a workload enum and separately a node enum?

SINGLE_READER
SINGLE_WRITER
SINGLE_WRITER_MULTI_READER
MULTI_WRITER
MULTI_READER_ONLY
MULTIN_MULTI_WRITER 
MULTIN_SINGLE_WRITER 
SINGLEN_SINGLE_WRITER

The combination of these enums should enable a orchestrator to know what it can and can't do with a volume.

@clintkitson
Copy link

As an example due to AWS's limitation it would be limited to the SINGLEN_SINGLE_WRITER. Then it would have any of the 5 enums to pursue on that node and also know that it would need to schedule other workloads to this same node to share the volume?

@codenrhoden
Copy link

So, my misunderstanding here is I took MULTI_NODE to mean, ahem, multiple nodes. E.G. a volume that can be used at multiple nodes at once, like NFS or a clustered filesystem.

If you wanted to have two containers on the same node sharing a volume, wouldn't you want those published to two different target paths? One for each workload? This is the assumption I have been working under.

This totally depends on CO's implementation. Some CO might decide to just call NodePublishVolume once for each volume (one target_path), and do a bind mount for each workload. And some CO might want to call NodePublishVolume multiple times, one for each workload. The spec doesn't dictate which way the CO should go with.

So, I agree with this, except that when I interpreted the meaning of MULTI_NODE to mean publishing to multiple nodes at the same time rather than multiple workloads on the same node, I thought the spec was specifically precluding it.

With that misunderstanding cleared up, I do suggest that I will not be the only one to make this mistake, and perhaps there is clarification we can do around those access mode. I've been working with this spec for months now, and I've been misinterpreting those meanings from day 1.

If I interpret the fields more inline with what @akutz laid out, it makes sense. Would it be fair then to say that the CSI spec does not allow for any volumes to be attached to multiple nodes at the same time? That the access modes only apply to usage on a single node?

@codenrhoden
Copy link

The current comments in the spec outline the following:

      // Can only be published once as read/write on a single node, at
      // any given time.
      SINGLE_NODE_WRITER = 1;

      // Can only be published once as readonly on a single node, at
      // any given time.
      SINGLE_NODE_READER_ONLY = 2;

      // Can be published as readonly at multiple nodes simultaneously.
      MULTI_NODE_READER_ONLY = 3;

      // Can be published at multiple nodes simultaneously. Only one of
      // the node can be used as read/write. The rest will be readonly.
      MULTI_NODE_SINGLE_WRITER = 4;

      // Can be published as read/write at multiple nodes
      // simultaneously.
      MULTI_NODE_MULTI_WRITER = 5;

Note they all reference nodes, not targets. I still think the comments, and the intent are completely at odds with each other. @jieyu's logic seems to interpret "nodes" as targets. I do need some clarification on how this is supposed to work.

Does CSI intend to support volumes that can mounted to multiple nodes, as in different physical/virtual servers, at the same time?

Here is how I have been interpreting a simplified workflow...

  • NodePublish request comes in for a volume with SINGLE_NODE_WRITER. It's previously been attached, and we mount the volume (formatting it if necessary). we mount it to a private mount area (remember the discussions around that?). Then we bind-mount it to the target as rw or ro, depending on the readonly flag. If the access mode had been READER_ONLY, we would have mounted the underlying filesystem as ro.
  • Any additional NodePublish requests to that same node with different targets can be satisfied just fine, with additional bind mounts to the new target. (I'm ignoring different credentials right now).
  • Any NodePublish requests to different nodes would fail, as the mode is SINGLE_NODE
  • Any NodePublish requests to a node with readonly set to false, and an access mode of READERONLY would fail.
  • Storage systems that can attach volumes to multiple hosts and have volumes formatted with xfs or ext4 should be able to support MULTI_NODE_READER_ONLY safely, so long as all mounts are done read-only. If you want to have a volume mounted on multiple nodes and have any writers, you need a clustered filesystem, NFS, or similar.

I think this workflow works fine, but recognize that it is based on what can be mounted to nodes, and that multiple target paths is always fine. A CO may not want to use them, but prohibiting them outright seems silly to me.

@jieyu
Copy link
Member Author

jieyu commented Dec 1, 2017

@codenrhoden I think you misinterpret my comments. The access mode currently reference nodes. If it's SINGLE_NODE_WRITER, the volume cannot be published to multiple nodes simultaneously.

@jieyu
Copy link
Member Author

jieyu commented Dec 1, 2017

NodePublish request comes in for a volume with SINGLE_NODE_WRITER. It's previously been attached, and we mount the volume (formatting it if necessary). we mount it to a private mount area (remember the discussions around that?). Then we bind-mount it to the target as rw or ro, depending on the readonly flag. If the access mode had been READER_ONLY, we would have mounted the underlying filesystem as ro.

The currently spec disallows this. If the volume access mode is SINGLE_NODE_WRITER, the CO won't call NodePublishVolume multiple times on a node with different flag or target_path. If the CO mistakenly does that, the Plugin should return ABORTED.

@codenrhoden
Copy link

NodePublish request comes in for a volume with SINGLE_NODE_WRITER. It's previously been attached, and we mount the volume (formatting it if necessary). we mount it to a private mount area (remember the discussions around that?). Then we bind-mount it to the target as rw or ro, depending on the readonly flag. If the access mode had been READER_ONLY, we would have mounted the underlying filesystem as ro.

The currently spec disallows this. If the volume access mode is SINGLE_NODE_WRITER, the CO won't call NodePublishVolume multiple times on a node with different flag or target_path. If the CO mistakenly does that, the Plugin should return ABORTED.

I guess that is what I am protesting. But it is ultimately up to the COs. Have all COs said this is the paradigm they will follow? We have done a lot of work to handle multiple target paths gracefully, through the use of private mount areas and bind mounting, and none of that work is necessary if that capability will never be used.

Earlier when you told @amaliaAvraham:

In other words, it's weird that we allow two workloads to share a volume if they are on the same node, but don't have a way to do that if they are on different nodes.

I don't think that's weird at all. One can be done with a regular filesystem (ext4, xfs), the other requires different technology altogether.

Again, I understand it is ultimately how the COs use the spec, so I need to just adhere to it, I'm just surprised, and mostly disappointed that this was not clear. I must have gotten incorrect information early on and just clung to it.

@jieyu
Copy link
Member Author

jieyu commented Dec 1, 2017

I don't think that's weird at all. One can be done with a regular filesystem (ext4, xfs), the other requires different technology altogether.

It's weird because it poses an implicit scheduling constraints for the workloads. The orchestrator will need to make sure those two workloads are placed together. If that's the case, why not just put them into the same pod so that it's guaranteed to be able to share a volume on a node?

@akutz
Copy link
Contributor

akutz commented Dec 1, 2017

I'm sorry @jieyu, but this doesn't make any sense. First, I would like you and @saad-ali to please respond to my above comment.

Secondly, if there was no intention of having a private mount area for volumes to be published to multiple target paths (workloads) on a single node then why on earth did @saad-ali suggest NodePublishDevice? That's only valuable in the aforementioned scenario.

@jieyu
Copy link
Member Author

jieyu commented Dec 1, 2017

@akutz I think all the confusing comes from the fact that we don't have a way to express whether a volume can be published multiple times on a node. The current access mode captures if a volume can be published to multiple node. I do believe that we probably need both. I liked @clintkitson's direction of separate these two:

SINGLE_NODE_READER
SINGLE_NODE_WRITER
MULTI_NODE_READER_ONLY
MULTI_NODE_SINGLE_WRITER
MULTI_NODE_MULTI_WRITER

(This specifies if a volume can be published to multiple nodes or not)

and

SINGLE_PUBLISH_READER
SINGLE_PUBLISH_WRITER
MULTI_PUBLISH_READER_ONLY
MULTI_PUBLISH_SINGLE_WRITER
MULTI_PUBLISH_MULTI_WRITER

(This specifies if a volume can be published to multiple targets on a given node or not)

And we have a max of 5*5 combinations. Some probably does not make sense. For instance: SINGLE_NODE_READER + MULTI_PUBLISH_MULTI_WRITER does not make sense.

@akutz
Copy link
Contributor

akutz commented Dec 1, 2017

Hi @jieyu,

Doesn't my suggested approach from above handle that?

// Can only be published once as read/write, at any given time.
SINGLE_WRITER = 1;

// Can only be published once as readonly, at any given time.
SINGLE_READER_ONLY = 2;

// Can be published as readonly multiple times simultaneously.
MULTI_READER_ONLY = 3;

// Can be published multiple times. Only one publication can be used
// as read/write. The rest will be readonly.
MULTI_SINGLE_WRITER = 4;

// Can be published as read/write multiple times.
MULTI_WRITER = 5;

The difference between MULTI_SINGLE_WRITER and MULTI_WRITER indicate a non-clustered and clustered filesystem. MULTI_SINGLE_WRITER would enable the use case of publishing a volume with a non-clustered FS multiple times on a single Node Host with one publication as RW and all the others as RO.

@jieyu
Copy link
Member Author

jieyu commented Dec 1, 2017

@akutz ok, IIUC, what you are suggesting is that the CO should infer from the access mode that whether a volume can be published to multiple node or not.

SINGLE_WRITER = SINGLE_NODE_WRITER + SINGLE_PUBLISH_WRITER
SINGLE_READER_ONLY = SINGLE_NODE_READ_ONLY + SINGLE_PUBLISH_READER_ONLY
MULTI_READER_ONLY = ???
MULTI_SINGLE_WRITER = SINGLE_NODE_WRITER + MULTI_PUBLISH_SINGLE_WRITER
MULTI_WIRTER = MULTI_NODE_WRITER + MULTI_PUBLISH_MULTI_WRITER

Can you clarify what MULTI_READER_ONLY means?

@akutz
Copy link
Contributor

akutz commented Dec 1, 2017

Hi @jieyu,

MULTI_READER_ONLY is used in conjunction with MULTI_SINGLE_WRITER to enable the "mount rw once and mount ro multi times on the same node"

@jieyu
Copy link
Member Author

jieyu commented Dec 1, 2017

OK, so the following? I am fine with that if it captures all potential use cases.

SINGLE_WRITER = SINGLE_NODE_WRITER + SINGLE_PUBLISH_WRITER
SINGLE_READER_ONLY = SINGLE_NODE_READER_ONLY + SINGLE_PUBLISH_READER_ONLY
MULTI_READER_ONLY = SINGLE_NODE_READER_ONLY + MULTI_PUBLISH_READER_ONLY
MULTI_SINGLE_WRITER = SINGLE_NODE_WRITER + MULTI_PUBLISH_SINGLE_WRITER
MULTI_WIRTER = MULTI_NODE_WRITER + MULTI_PUBLISH_MULTI_WRITER

@codenrhoden
Copy link

I think there is a missing use case for having a volume published to a single node, but have multiple rw targets. There is nothing wrong with that. The application needs to deal with file locking, but the filesystem is fine. There is no technical barrier to it. I should be able to give the same volume to many containers on the same node as rw if i choose to. and that doesn't require a clustered filesystem. That's correct, right @cduchesne?

@akutz
Copy link
Contributor

akutz commented Dec 1, 2017

Hi @codenrhoden,

While you can technically do that I don't know if you ever should. That's essentially NFS you're describing, and so if the underlying FS doesn't know how to handle shared access you're kind of hosed.

@cduchesne
Copy link

cduchesne commented Dec 1, 2017

@akutz @codenrhoden - this is the way docker, kubernetes, and mesos operate today. If the same volume is mapped to more than 1 container, it works just fine, but it up to the application to deal with file locking. If you read the documentation for Kubernetes PV Access Modes , it specifically states ReadWriteOnce doesn't allow a volume to be mapped to multiple nodes. This doesn't stop a volume from being mapped to 100 pods on the same node though.

Multiple pods accessing a volume is a perfectly valid scenario, such as some helper pod that is created/scheduled with a constraint to land it on the same node as the already-running pod.

@akutz this isn't like NFS because the local host os is managing the primary mount of the block device. In essence the mounted filesystem is being shared to other pods on the same host like NFS because it can be.

@akutz
Copy link
Contributor

akutz commented Dec 1, 2017

Hi @cduchesne,

this isn't like NFS because the local host os is managing the primary mount of the block device. In essence the mounted filesystem is being shared to other pods on the same host like NFS because it can be.

Oh, yeah, doh! Brain fart :)

@cduchesne
Copy link

For me, it is important that SINGLE_NODE_WRITER allows multiple WRITERS on the same NODE because that's what it implies: The volume can be mounted to a single node and is writable.

@codenrhoden
Copy link

For me, I don't know that there needs to be new definitions, just clarification. This all became crazy once we realized there was an explicit disallowing of multiple targets. I'd say, if a CO doesn't want to use multiple targets (and is therefor doing all the refcounting itself) then it shouldn't, but it shouldn't blocked. I see the definitions as the following:

SINGLE_NODE_WRITER - Can be attached on a single node, and has rw privileges. Can be published multiple times to targets on that node, as either rw or ro (controlled by the readonly flag in the NodePublishRequest).

SINGLE_NODE_READER_ONLY - Can be attached on a single node. Mount must be ro. Can be published multiple times to targets, all as ro. If a NodePublishRequest comes in with readonly set to false, its an error.

All "attach to one node at a time" use cases are covered by the above.

MULTI_NODE_READER_ONLY - Can be attached to multiple nodes at the same time, but all mounts have to be ro. Also can be published to multiple targets (or just one), and have to be readonly. Storage providers like ScaleIO can do this - where they can attach a block device to multiple nodes, have a non-clustered filesystem, and make sure everything is readonly.

MULTI_NODE_SINGLE_WRITER - Can be attached to multiple nodes, but only one node can have rw priveleges (and that node can publish to multiple targets, with any combo of ro and rw). I see this as the trickiest to handle, as it still really requires a clustered filesystem unless you are happy having stale data at your readers. It would also require a node to know if something is attached to another node (often easy to do), but also knowing how it has it mounted (no idea who can do this).

MULTI_NODE_MULTI_WRITER - Everything goes! can be attached everywhere, with any combination of ro and rw. Definitely requires a clustered filesystem, or NFS running rpc.lockd (or whatever it is).

I honestly thought this was pretty clear, but i was starting with a base assumption of allowing multiple target paths, and that the SPs deal with bind-mounting to those multiple targets. I still think its fine for a CO to not do that, but why block it?

@cduchesne
Copy link

cduchesne commented Dec 1, 2017

What @codenrhoden said is exactly how I would interpret these definitions. I too don't know how MULTI_NODE_SINGLE_WRITER is supportable without all agents being aware of each-other's mount tables but I'm okay with that.

pohly added a commit to pohly/kubernetes that referenced this pull request Jan 17, 2019
This is a special case that both kubelet and the volume driver should
support, because users might expect it. One Kubernetes mechanism to
deploy pods like this is via pod affinity.

However, strictly speaking the CSI spec does not allow this usage
mode (see container-storage-interface/spec#150) and
there is an on-going debate to enable it (see
container-storage-interface/spec#178). Therefore
this test gets skipped unless explicitly enabled for a driver.

CSI drivers which create a block device for a remote volume in
NodePublishVolume fail this test. They have to make the volume
available in NodeStageVolume and then in NodePublishVolume merely do a
bind mount (as for example in
https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/node.go#L150).
pohly added a commit to pohly/kubernetes that referenced this pull request Feb 12, 2019
This is a special case that both kubelet and the volume driver should
support, because users might expect it. One Kubernetes mechanism to
deploy pods like this is via pod affinity.

However, strictly speaking the CSI spec does not allow this usage
mode (see container-storage-interface/spec#150) and
there is an on-going debate to enable it (see
container-storage-interface/spec#178). Therefore
this test gets skipped unless explicitly enabled for a driver.

CSI drivers which create a block device for a remote volume in
NodePublishVolume fail this test. They have to make the volume
available in NodeStageVolume and then in NodePublishVolume merely do a
bind mount (as for example in
https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/node.go#L150).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants