-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spec: Clarify calling multiple NodePublishVolume #150
spec: Clarify calling multiple NodePublishVolume #150
Conversation
LGTM |
I agree with the semantics of the proposed change. I'm a bit concerned that the description for this RPC is difficult to unpack for a plugin writer that's coming to spec with fresh eyes: there's a lot here to digest. Furthermore, if a plugin doesn't support MULTI_NODE_xxx but a CO attempts to invoke this RPC in a way that MULTI_NODE_xxx is implied .. what's the error code that should be returned? This RPC is called by the CO when a workload that wants to use the specified volume is placed (scheduled) on a node.
...
If this RPC failed, or the CO does not know if it failed or not, it MAY choose to call `NodePublishVolume` again, or choose to call `NodeUnpublishVolume`.
Scenario 1: plugin supports MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ the same target_path's.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK (idempotent)
Scenario 2: plugin supports MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ different target_path's.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/2, credentials=nil
** OK
Scenario 3: plugin supports MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ the same target_path's and different credentials.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials={foo:bar}
** OK
Scenario 4: plugin supports MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ different target_path's and different credentials.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/2, credentials={foo:bar}
** OK
Scenario 5: plugin does not support MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ the same target_path's.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK (idempotent)
Scenario 6: plugin does not support MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ different target_path's.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/2, credentials=nil
** ERROR
Scenario 7: plugin does not MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ the same target_path's and different credentials.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials={foo:bar}
** ERROR
Scenario 8: plugin does not support MULTI_NODE_xxx, CO issues multiple calls for the same volume (A) w/ different target_path's and different credentials.
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/1, credentials=nil
** OK
* CO -> Plugin: NodePublishVolume, volume_id=A, target_path=/plugin/targets/2, credentials={foo:bar}
** ERROR |
53e6a70
to
348b8e8
Compare
@jdef updated. the |
@jieyu - What if the user would like to mount the volume to two different paths in the same node? A storage system may support that scenario without supporting multi-nodes. |
@jieyu you're talking about scenario 3 from my earlier comment? I think I was moving too fast: since it's the same target path the second call should probably generate an ERROR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is more clear. In light of these changes it probably makes sense to update the documentation for the AccessMode.Mode
enumerated types: the SINGLE_xxx types are actually more restrictive than their documentation indicates.
spec.md
Outdated
@@ -1070,6 +1079,7 @@ Condition | gRPC Code | Description | Recovery Behavior | |||
| --- | --- | --- | --- | | |||
| Volume does not exists | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. | | |||
| Operation pending for volume | 9 FAILED_PRECONDITION | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. | | |||
| Not supposed to be called | 10 ABORTED | Indicates that the CO is not supposed to call the RPC because the volume does not have MULTI_NODE capability. | Caller MAY retry at a higher-level by calling `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Not supposed to be called" isn't quite right. Maybe something more like "Exceeds capabilities"?
s/Indicates that the CO is not supposed to call the RPC/Indicates that the CO has exceeded the volume's capabilities/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes me think if RESOURCE_EXHAUSTED
is more appropriate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer ABORTED: #150 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
spec.md
Outdated
If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO MUST guarantee that this RPC is called after `ControllerPublishVolume` is called for the given volume on the given node and returns a success. | ||
|
||
This operation MUST be idempotent. | ||
If this RPC failed, or the CO does not know if it failed or not, it MAY choose to call `NodePublishVolume` again, or choose to call `NodeUnpublishVolume`. | ||
|
||
This RPC MAY be called by the CO multiple times on the same node for the same volume with possibly different `target_path` and/or auth credentials if the volume has MULTI_NODE capability (i.e., `volume_capability` is either `MULTI_NODE_READER_ONLY`, `MULTI_NODE_SINGLE_WRITER` or `MULTI_NODE_MULTI_WRITER`). The following table shows what the Plugin SHOULD return when receiving a second `NodePublishVolume` on the same volume on the same node: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there are two sentences here --> should be two separate lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
spec.md
Outdated
If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO MUST guarantee that this RPC is called after `ControllerPublishVolume` is called for the given volume on the given node and returns a success. | ||
|
||
This operation MUST be idempotent. | ||
If this RPC failed, or the CO does not know if it failed or not, it MAY choose to call `NodePublishVolume` again, or choose to call `NodeUnpublishVolume`. | ||
|
||
This RPC MAY be called by the CO multiple times on the same node for the same volume with possibly different `target_path` and/or auth credentials if the volume has MULTI_NODE capability (i.e., `volume_capability` is either `MULTI_NODE_READER_ONLY`, `MULTI_NODE_SINGLE_WRITER` or `MULTI_NODE_MULTI_WRITER`). The following table shows what the Plugin SHOULD return when receiving a second `NodePublishVolume` on the same volume on the same node: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/volume_capability
/access_mode
/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you forget to push?
@amaliaAvraham As I mentioned in the description of this PR, if a volume cannot be published on multiple node, it's weird that we allow it to be published multiple times on a single node. In other words, it's weird that we allow two workloads to share a volume if they are on the same node, but don't have a way to do that if they are on different nodes. I think it's always good to start with a stricter requirement. If we do have a use case later that requires lifting this requirement, we can always do that. |
I'm pretty happy w/ ABORTED because the CO is violating capabilities that
it should have been aware of, vs consuming some resource that the node just
happened to run out of.
…On Tue, Nov 21, 2017 at 12:19 PM, Jie Yu ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In spec.md
<#150 (comment)>
:
> @@ -1070,6 +1079,7 @@ Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Volume does not exists | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
| Operation pending for volume | 9 FAILED_PRECONDITION | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. |
+| Not supposed to be called | 10 ABORTED | Indicates that the CO is not supposed to call the RPC because the volume does not have MULTI_NODE capability. | Caller MAY retry at a higher-level by calling `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. |
That makes me think if RESOURCE_EXHAUSTED is more appropriate here?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#150 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACPVLIkolIVHtG0yv0Xxsg_VHN-B8HYVks5s4wYbgaJpZM4QhY30>
.
|
@jdef can you take a look again. I addressed your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once make
is re-run. Please squash prior to merge
Clarifies that `NodePublishVolume` can only be called multiple times on a node for a volume if that volume can be published on multiple nodes (i.e., has MULTI_NODE capability). If a volume cannot be published on multiple node, it's weird that we allow it to be publish multiple times on a single node. In other words, it's weird that we allow two workloads to share a volume if they are on the same node, but don't have a way to do that if they are on different nodes.
0dd9c04
to
52ed1bb
Compare
@jieyu Let me add some practical color to this. An AWS EBS volume is not able to be shared among hosts, so with this change you're saying that multiple containers would never be able to share a volume with EBS. Why is this restrictive language useful? |
Apologies for chiming in much too late, but I am very surprised by the logic of not having a volume be published to multiple target paths unless it has If you wanted to have two containers on the same node sharing a volume, wouldn't you want those published to two different target paths? One for each workload? This is the assumption I have been working under. Or even better -- two containers sharing the volume, but one is |
Hi @codenrhoden, I think the issue is really just one of naming. I believe things make more sense when I alter the existing enum names and descriptions: // Can only be published once as read/write, at any given time.
SINGLE_WRITER = 1;
// Can only be published once as readonly, at any given time.
SINGLE_READER_ONLY = 2;
// Can be published as readonly multiple times simultaneously.
MULTI_READER_ONLY = 3;
// Can be published multiple times. Only one publication can be used
// as read/write. The rest will be readonly.
MULTI_SINGLE_WRITER = 4;
// Can be published as read/write multiple times.
MULTI_WRITER = 5; In fact, I propose we rename the enums to match the above, updated names and documentation. The above changes more accurately reflect your use case @codenrhoden and still maintain the intent of the original names and descriptions. |
This totally depends on CO's implementation. Some CO might decide to just call
Take @clintkitson's EBS example, if you have two workloads (e.g., pods) both want to use the same EBS volume. This is only possible if these two workloads are placed together on one node. As a result, you create an implicit constraint on the scheduling of those two workloads. From CO's perspective, it'll be much simpler if there is no such implicit scheduling constraint. And the CO can perform validation early: if the volume does not support access mode multi-node publish, don't allow multiple workload to share it. |
How about just expanding the enum into another dimension to cover that aspect? So you'd have a workload enum and separately a node enum?
The combination of these enums should enable a orchestrator to know what it can and can't do with a volume. |
As an example due to AWS's limitation it would be limited to the SINGLEN_SINGLE_WRITER. Then it would have any of the 5 enums to pursue on that node and also know that it would need to schedule other workloads to this same node to share the volume? |
So, my misunderstanding here is I took
So, I agree with this, except that when I interpreted the meaning of With that misunderstanding cleared up, I do suggest that I will not be the only one to make this mistake, and perhaps there is clarification we can do around those access mode. I've been working with this spec for months now, and I've been misinterpreting those meanings from day 1. If I interpret the fields more inline with what @akutz laid out, it makes sense. Would it be fair then to say that the CSI spec does not allow for any volumes to be attached to multiple nodes at the same time? That the access modes only apply to usage on a single node? |
The current comments in the spec outline the following:
Note they all reference nodes, not targets. I still think the comments, and the intent are completely at odds with each other. @jieyu's logic seems to interpret "nodes" as targets. I do need some clarification on how this is supposed to work. Does CSI intend to support volumes that can mounted to multiple nodes, as in different physical/virtual servers, at the same time? Here is how I have been interpreting a simplified workflow...
I think this workflow works fine, but recognize that it is based on what can be mounted to nodes, and that multiple target paths is always fine. A CO may not want to use them, but prohibiting them outright seems silly to me. |
@codenrhoden I think you misinterpret my comments. The access mode currently reference nodes. If it's |
The currently spec disallows this. If the volume access mode is |
I guess that is what I am protesting. But it is ultimately up to the COs. Have all COs said this is the paradigm they will follow? We have done a lot of work to handle multiple target paths gracefully, through the use of private mount areas and bind mounting, and none of that work is necessary if that capability will never be used. Earlier when you told @amaliaAvraham:
I don't think that's weird at all. One can be done with a regular filesystem (ext4, xfs), the other requires different technology altogether. Again, I understand it is ultimately how the COs use the spec, so I need to just adhere to it, I'm just surprised, and mostly disappointed that this was not clear. I must have gotten incorrect information early on and just clung to it. |
It's weird because it poses an implicit scheduling constraints for the workloads. The orchestrator will need to make sure those two workloads are placed together. If that's the case, why not just put them into the same pod so that it's guaranteed to be able to share a volume on a node? |
I'm sorry @jieyu, but this doesn't make any sense. First, I would like you and @saad-ali to please respond to my above comment. Secondly, if there was no intention of having a private mount area for volumes to be published to multiple target paths (workloads) on a single node then why on earth did @saad-ali suggest |
@akutz I think all the confusing comes from the fact that we don't have a way to express whether a volume can be published multiple times on a node. The current access mode captures if a volume can be published to multiple node. I do believe that we probably need both. I liked @clintkitson's direction of separate these two:
(This specifies if a volume can be published to multiple nodes or not) and
(This specifies if a volume can be published to multiple targets on a given node or not) And we have a max of 5*5 combinations. Some probably does not make sense. For instance: |
Hi @jieyu, Doesn't my suggested approach from above handle that? // Can only be published once as read/write, at any given time.
SINGLE_WRITER = 1;
// Can only be published once as readonly, at any given time.
SINGLE_READER_ONLY = 2;
// Can be published as readonly multiple times simultaneously.
MULTI_READER_ONLY = 3;
// Can be published multiple times. Only one publication can be used
// as read/write. The rest will be readonly.
MULTI_SINGLE_WRITER = 4;
// Can be published as read/write multiple times.
MULTI_WRITER = 5; The difference between |
@akutz ok, IIUC, what you are suggesting is that the CO should infer from the access mode that whether a volume can be published to multiple node or not.
Can you clarify what |
Hi @jieyu,
|
OK, so the following? I am fine with that if it captures all potential use cases.
|
I think there is a missing use case for having a volume published to a single node, but have multiple rw targets. There is nothing wrong with that. The application needs to deal with file locking, but the filesystem is fine. There is no technical barrier to it. I should be able to give the same volume to many containers on the same node as rw if i choose to. and that doesn't require a clustered filesystem. That's correct, right @cduchesne? |
Hi @codenrhoden, While you can technically do that I don't know if you ever should. That's essentially NFS you're describing, and so if the underlying FS doesn't know how to handle shared access you're kind of hosed. |
@akutz @codenrhoden - this is the way docker, kubernetes, and mesos operate today. If the same volume is mapped to more than 1 container, it works just fine, but it up to the application to deal with file locking. If you read the documentation for Kubernetes PV Access Modes , it specifically states ReadWriteOnce doesn't allow a volume to be mapped to multiple nodes. This doesn't stop a volume from being mapped to 100 pods on the same node though. Multiple pods accessing a volume is a perfectly valid scenario, such as some helper pod that is created/scheduled with a constraint to land it on the same node as the already-running pod. @akutz this isn't like NFS because the local host os is managing the primary mount of the block device. In essence the mounted filesystem is being shared to other pods on the same host like NFS because it can be. |
Hi @cduchesne,
Oh, yeah, doh! Brain fart :) |
For me, it is important that SINGLE_NODE_WRITER allows multiple WRITERS on the same NODE because that's what it implies: The volume can be mounted to a single node and is writable. |
For me, I don't know that there needs to be new definitions, just clarification. This all became crazy once we realized there was an explicit disallowing of multiple targets. I'd say, if a CO doesn't want to use multiple targets (and is therefor doing all the refcounting itself) then it shouldn't, but it shouldn't blocked. I see the definitions as the following: SINGLE_NODE_WRITER - Can be attached on a single node, and has rw privileges. Can be published multiple times to targets on that node, as either rw or ro (controlled by the SINGLE_NODE_READER_ONLY - Can be attached on a single node. Mount must be ro. Can be published multiple times to targets, all as ro. If a All "attach to one node at a time" use cases are covered by the above. MULTI_NODE_READER_ONLY - Can be attached to multiple nodes at the same time, but all mounts have to be ro. Also can be published to multiple targets (or just one), and have to be readonly. Storage providers like ScaleIO can do this - where they can attach a block device to multiple nodes, have a non-clustered filesystem, and make sure everything is readonly. MULTI_NODE_SINGLE_WRITER - Can be attached to multiple nodes, but only one node can have rw priveleges (and that node can publish to multiple targets, with any combo of ro and rw). I see this as the trickiest to handle, as it still really requires a clustered filesystem unless you are happy having stale data at your readers. It would also require a node to know if something is attached to another node (often easy to do), but also knowing how it has it mounted (no idea who can do this). MULTI_NODE_MULTI_WRITER - Everything goes! can be attached everywhere, with any combination of ro and rw. Definitely requires a clustered filesystem, or NFS running I honestly thought this was pretty clear, but i was starting with a base assumption of allowing multiple target paths, and that the SPs deal with bind-mounting to those multiple targets. I still think its fine for a CO to not do that, but why block it? |
What @codenrhoden said is exactly how I would interpret these definitions. I too don't know how MULTI_NODE_SINGLE_WRITER is supportable without all agents being aware of each-other's mount tables but I'm okay with that. |
This is a special case that both kubelet and the volume driver should support, because users might expect it. One Kubernetes mechanism to deploy pods like this is via pod affinity. However, strictly speaking the CSI spec does not allow this usage mode (see container-storage-interface/spec#150) and there is an on-going debate to enable it (see container-storage-interface/spec#178). Therefore this test gets skipped unless explicitly enabled for a driver. CSI drivers which create a block device for a remote volume in NodePublishVolume fail this test. They have to make the volume available in NodeStageVolume and then in NodePublishVolume merely do a bind mount (as for example in https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/node.go#L150).
This is a special case that both kubelet and the volume driver should support, because users might expect it. One Kubernetes mechanism to deploy pods like this is via pod affinity. However, strictly speaking the CSI spec does not allow this usage mode (see container-storage-interface/spec#150) and there is an on-going debate to enable it (see container-storage-interface/spec#178). Therefore this test gets skipped unless explicitly enabled for a driver. CSI drivers which create a block device for a remote volume in NodePublishVolume fail this test. They have to make the volume available in NodeStageVolume and then in NodePublishVolume merely do a bind mount (as for example in https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/node.go#L150).
Clarifies that
NodePublishVolume
can only be called multiple times ona node for a volume if that volume can be published on multiple nodes
(i.e., has MULTI_NODE capability).
If a volume cannot be published on multiple node, it's weird that we
allow it to be publish multiple times on a single node. In other words,
it's weird that we allow two workloads to share a volume if they are on
the same node, but don't have a way to do that if they are on different
nodes.