-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce concept of topology to CSI spec. #188
Conversation
CC @Jgoldick |
Hi @saad-ali, Right off the bat I have to caution the use of additional, named fields. Especially when I know you added it to |
That is good instinct and important for protecting the project from bloat. The bar that I use for what should be added to the spec:
Opaque parameters are perfect for exposing SP specific "knobs" that CO does not need to manipulate or parse. But, in this case the SP needs to be able to operate on topology information to make scheduling decisions, and therefore an opaque parameter would not suffice.
To be clear, in the case that topology of SP is independent of the CO (volume is equally accessible from all nodes), then opaque params should be used. This PR only handles the case where volumes are not accessible everywhere in the cluster, therefore the CO necessarily needs to be aware of this information for proper operation. |
@saad-ali writes:
Are you confident that this topology solution can be separated from the volume anti-affinity issue we'll need to address to support HA databases #44? I wrote them up separately in that hope but it's unclear to me yet that the CO won't need non-opaque params in the case where a volume can be accessed from any node but it still needs to know when Volume A and Volume B are in the same fault domain. |
@Jgoldick This design does not address automated spreading of volumes to account for storage specific fault tolerance. It's possible to do so manually with opaque parameters. But doing so in an automated way will be something I'd like to tackle next (independent of this). Something like node anti-affinity from Kubernetes would be very powerful here. This design is focused on addressing the "volume availability within cluster" gap. I'm going to rework this proposal. As it is currently worded it requires a CO to dictate the exact topological segment(s) the volume must be provisioned in. In some cases the CO may have a preference for a segment but permit a set of possibilities. So instead of dictating exactly where a volume MUST be provisioned, I will loosen the language and syntax to allow storage plugins to select from list of possibilities with and a way for CO to indicate its preference. |
68e08ce
to
82e6db4
Compare
Made the changes above. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed a first-pass review. Will think on this more.
spec.md
Outdated
// {"region":{"us-west1"}, "zone":{"us-west1-b"}} | ||
// Indicates the node exists within the "region" "us-west1" and the | ||
// "zone" "us-west1-b". | ||
map<string, string> accessibility_requirements = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the type of this field be map<string, ListOfString>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about his and decided against it, since I can not think of any concrete case where a node would belong to multiple topological segments (e.g. a node that belongs two different zones or two different racks does not make sense).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Maybe we could update the example then?
// {"region":"us-west1", "zone":"us-west1-b"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? The example seems correct to me: it shows the node belongs to a single zone within a single region (as opposed to multiple zones, for example).
spec.md
Outdated
|
||
// TopologyRequirement captures the topological segment(s) the CO wants | ||
// a volume to be provisioned in. | ||
message TopologyRequirement { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: consider TopologyPolicy
instead? This message MAY define both requisite and preference topology segment lists, each of which are OPTIONAL. Given that a message of this type may only contain preferences, the suffix "Requirement" seems to overreach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming is hard.
Given that a message of this type may only contain preferences
I don't think this is accurate. As it is currently worded, the TopologyRequirement
message, if it exists, would contain the required set of possibilities along with which of those are preferred. It does not allow for a list of preferred options only. As worded anything that is in preferred_topologies
must be present in accepted_topologies
. So therefore I think requirement seems more accurate.
However, I could loosen this to allow for preferred only -- so for example preferred_topologies
could exist without accepted_topologies
meaning SP should use preferred_topologies
but if that doesn't work, use whatever it wants. If we do this then Policy
would be a better name. But I'm not sure we want to enable that use case.
spec.md
Outdated
// Indicates the volume may be accessible from any combination of two | ||
// "zones" in the "region" "us-west1": e.g. "us-west1-b" and | ||
// "us-west1-c" or "us-west1-b" and "us-west1-d", etc. | ||
map<string, ListOfString> accepted_topologies = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: avoid API stutter. The prefix of this type is already "Topology", we probably don't need it in the suffix as well. I would be perfectly happy with accepted
or requisite
or mandatory
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will remove suffix and leave prefixes only.
spec.md
Outdated
// accessible from the combination of the two "zones" "us-west1-c" and | ||
// "us-west1-e" in the "region" "us-west1" and fall back to other | ||
// combinations if that is not possible. | ||
map<string, ListOfString> preferred_topologies = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, preferred
reads better to my eyes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
spec.md
Outdated
@@ -1192,6 +1345,56 @@ message NodeGetIdResponse { | |||
// CO in subsequent `ControllerPublishVolume`. | |||
// This is a REQUIRED field. | |||
string node_id = 1; | |||
|
|||
|
|||
// The topological segment(s) the provisioned volume is accessible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the next 25 lines could probably be deleted - they frame this field from the perspective of a volume, and not a node. The first sentence of this comment should be the line that starts with "The topological segment(s) this node belongs..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that was a copy paste I forgot to clean up. Will remove.
spec.md
Outdated
// {"region": {"us-west1"}, "zone": {"us-west1-b", "us-west1-c"}} | ||
// Indicates the volume is accessible from both "zone" "us-west1-b" | ||
// and "zone" "us-west1-c" the "region" "us-west1". | ||
map<string, ListOfString> accessibility_requirements = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, prefer accessible_from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming.
spec.md
Outdated
// accessible from a given node when scheduling workloads. | ||
// This field is OPTIONAL. If it is not specified, the CO MAY assume | ||
// the node is not subject to any topological constraint and may | ||
// schedule workloads referencing any volume without any topological |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a couple of ways to read this sentence:
(1) the CO ... MAY schedule workloads that reference any volume V on this node, irrespective of any topological constraints declared for V.
(2) the CO ... MAY schedule workloads that reference any volume V, such that there are no topological constraints declared for V.
Which do you intend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 was the intention. Thanks for the clarification. I will modify to your wording.
spec.md
Outdated
// from a given node when scheduling workloads. | ||
// This field is OPTIONAL. If it is not specified, the CO MAY assume | ||
// the volume is equally accessible from all nodes in the cluster and | ||
// may schedule workloads referencing the volume on any available node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: end sentence w/ a period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goes over the 72 char limit, but ok, created new line with one word. =P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, if the COs flaunt the rules you cannot expect the community to observe them. Thanks for following the rules Saad :)
spec.md
Outdated
// from. A topological segment is a sub-division of a cluster, like | ||
// region, zone, rack, etc. | ||
// Each key must consist of alphanumeric characters, '-', '_' or '.'. | ||
// Each value MUST contain 1 or more valid strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What constitutes a "valid" string for a value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. I will clarify.
spec.md
Outdated
// COs may use this information along with the topology information | ||
// returned by NodeGetId to ensure that a given volume is accessible | ||
// from a given node when scheduling workloads. | ||
// This field is OPTIONAL. If it is not specified, the CO MAY assume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are node-local volumes accounted for by this strategy? If this field is OPTIONAL but the volume is local (is tied to a particular node) then it doesn't make sense that it's "equally accessible from all nodes in the cluster".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node local volumes should set a topological segment {"node": {"nodeName"}} and this would have to match between node and volume.
csi.proto
Outdated
// accessible from the combination of the two "zones" "us-west1-c" and | ||
// "us-west1-e" in the "region" "us-west1" and fall back to other | ||
// combinations if that is not possible. | ||
map<string, ListOfString> preferred_topologies = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the way new message types complicate things, can we not agree that ListOfString
be a string
that can have multiple values via some known separator and have a size value that exceeds that of a normal string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I assume you tried to use the repeated keyword with the map value? That’s not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree that ListOfStrings should be replaced with a simpler map of string,string. I quite prefer the explicitness of ListOfStrings and that no one has to write tokenizing logic or consider escape sequences (because someone wants to use a token separator in a segment name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
James,
The reason for my suggestion is in part due to the way a custom type prevents the ability to create interface assertions on types in a language like Go without referencing the language bindings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s fine. Has anyone tried a repeated keyword inside the map value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a protobuf well-known type for a list of things (ListValue). It's
less specific, and a bit more effort to work with, so I don't like it as
much as ListOfStrings.
…On Tue, Feb 13, 2018 at 3:44 PM, Schley Andrew Kutz < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In csi.proto
<#188 (comment)>
:
> + // "us-west1".
+ //
+ // Example 3:
+ // Given a volume that is accessible from two zones (which may be
+ // indicated via an opaque parameter if the SP supports both single
+ // and multi-zone volumes), and
+ // accepted_topologies =
+ // {"region": {"us-west1"},
+ // "zone": {"us-west1-b", "us-west1-c", "us-west1-d", "us-west1-e"}}
+ // preferred_topologies =
+ // {"zone": {"us-west1-e", "us-west1-c"}}
+ // Indicates the SP SHOULD first attempt to provision the volume as
+ // accessible from the combination of the two "zones" "us-west1-c" and
+ // "us-west1-e" in the "region" "us-west1" and fall back to other
+ // combinations if that is not possible.
+ map<string, ListOfString> preferred_topologies = 2;
James,
The reason for my suggestion is in part due to the way a custom type
prevents the ability to create interface assertions on types in a language
like Go without referencing the language bindings.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#188 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACPVLBEplNkjRxxn0_0p5xZ7jkoghkEDks5tUfQegaJpZM4R_gNh>
.
|
Was thinking about the naming constraint of alphanumeric characters "-", "_", and ".". Looking at the most similar AWS concept Placement Groups you see the naming is 255 ASCII characters. Is there a reason not to accept this more relaxed definition? Looking at Azure Naming Conventions the biggest issue would seem to be around case insensitive comparisons. It'd be worth a note as to whether we are doing exact comparisons as this causes much confusion for Windows admins. I think it would also be useful to add an example where you actually put in a Rack constraint. I've often seen rack names use delimiters like "/" or ":" so that they provide some internal data center layout/organization information. It'd probably be worth asking around to see how folks are naming their racks. See Duplicate Rack Names for some useful examples out of the Digital Ocean world. Even though the CSI spec as written allows us to specify multiple criteria and thereby define a specific rack within a zone, we can't group them. Taking this farther, I don't know how I would specify a preferred_topology in the presence of non-unique rack names given multiple zones within a single data center. On the other hand, if racks are fully described and made unique, like /zone-name/cage#-row#-rack# then it becomes clear how to satisfy/specify the constraint. Note that I used an obvious delimiter "/" to avoid name collisions between zones and racks. |
Good questions @Jgoldick.
This interface has to ultimately work with Kubernetes (as well as other COs), so some constraints come from there. In this case, key/value pairs returned by Specifically values are limited to: The plan is the key returned by CSI (e.g. I agree that these are a bit too strict. But I don't see a good way to get around this without modifying Kubernetes. So I'd like to adopt the same level of strictness for now, work to loosen the requirements on Kubernetes side and then do the same on the CSI side.
I believe proto maps in most (all?) languages are case sensitive. But,
The spec as defined allows specifying the fully qualified topology hierarchy for a volume or node. So cases like the one you linked should work nicely with the defined case. I'll add some examples. |
82e6db4
to
bf0a226
Compare
Thanks for the feedback. PTAL! |
Some initial thoughts on this. At a high level, this proposal suggests that the topology information about where the volume can be used should be specified by the SP, and will be opaque key/value based. Given that the CO needs this topology information to decide where the volume and the workload can be placed, does it mean that the CO has to perform scheduling based on vendor specific topology information? That sounds very complex to me, especially for some CO (e.g. Mesos) that exposes a general scheduling API. I think a lot of the comments in this PR try to explain what the "matching" algorithm is for the CO. I am actually leaning towards some pre-defined keywords (e.g., region/zone), and pre-defined semantics in terms of scheduling that the CO will do based on those keywords, Then, let the SPs to figure out the mapping between it's own internal topology information to those keywords based on the scheduling semantics. |
No it does not. The way this is designed COs shall remain agnostic to the specific topology of any given SP. As far as the CO is concerned it is just matching labels on nodes. Let me clarify by walking through the different use cases: There are three scenarios a CO would be interested in exposing 1) understanding which nodes a particular volume is accessible from so that any workload referencing that volume is scheduled to the correct node (assumes volume provisioning before workload scheduling), 2) influence where a volume is provisioned based on where the workload referencing that volume is scheduled to (assumes volume provisioning after workload scheduling), and 3) allowing a cluster administrator to control where volumes are provisioned. We should not have (and do not need) pre-defined keywords (e.g. region, zone, etc.). The goal of this design is to ensure both CO and the CSI spec remain agnostic to SP topology while enabling arbitrary toplogies. I'm happy to jump on a call to walk though this if that would help. |
Assuming for a CO, each node already has some notion of "topology" information. For instance, in Mesos, system administrator can configure the domain of a node. And in Kubernetes, there are some well known labels related to topology. Let's assume that there are two workloads (e.g., pods) that the user wants to schedule, and the user wants to make sure that these two workloads (pods) are in the same zone according to CO's node topology information. Each of the workload needs a persistent volume that needs to be dynamically provisioned. Assuming workloads are placed first (case (2) above), workload 1 is scheduled in zone A and workload 2 is scheduled in zone B according to CO's topology information. Now, the CO needs to dynamically provision a volume for workload 1. Obviously, the volume has to be accessible from CO's zone A. IIUC, the logic you propose for the CO is to scan all nodes in CO's zone A, find the corresponding CSI plugin specific topology information that you obtained from the However, it is possible that the CSI plugin specific topology information you picked and fed into |
We plan to replace that with something more generic so Kubernetes also remains agnostic (no hard coded labels).
Ok
Why if the user constraint limits both to the same zone are they scheduled to different zones?
No. On Kubernetes the first workload will get scheduled to some zone, let's say zone A. The second workload will have a constraint (
I think this may be the misunderstanding. The CO never feeds it's own topology labels to |
bf0a226
to
9d0217d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add more details around how topology is enumerated by the CO and SP, even node level topology is interesting in a cluster (leave alone regions/zones) and the spec should handle that.
csi.proto
Outdated
// accessible from. | ||
// A topological segment is a sub-division of a cluster, like region, | ||
// zone, rack, etc. | ||
// An SP SHALL advertise the requirements for topology in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the CO (or for that matter an SP) enumerate a region/zone or rack in the topology? Does the SP need to query and pull this data from the CO? HOw are changes in the topology or access to a region or rack notified either way? For example, CO (or SP) detects a change in toplogy or looses access to a rack/zone etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for the user to configure sites/racks/zones and regions in kubernetes that an SP can query. So that both the CO and SP have the same view for what topology elements are available. What would the definition include and how does the SP figure that it can access and provision volumes at a zone/region/rack etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the CO (or for that matter an SP) enumerate a region/zone or rack in the topology? Does the SP need to query and pull this data from the CO?
If the volumes that an SP makes available have availability constraints the SP should have some mechanism to enumerate those topological segments. The SP communicates those to the CO via the NodeGetId
call. So, the CO "enumerates" the SP's topology via the NodeGetId
call (i.e. retrieve the node/zone/rack/etc., as defined by SP, that each node belongs to).
HOw are changes in the topology or access to a region or rack notified either way? For example, CO (or SP) detects a change in toplogy or looses access to a rack/zone etc.
The topology for a node is retrieved via a NodeGetId
call when the plugin is registered on the node. The topology for a volume is returned in the CreateVolume
response. Once a volume or node are provisioned there topology is not expected to change. In the odd chance that it does, reinitializing those elements will allow picking up the new topology info.
Is it possible for the user to configure sites/racks/zones and regions in kubernetes that an SP can query. So that both the CO and SP have the same view for what topology elements are available.
In the proposed design, when a new plugin is registered, Kubernetes will use NodeGetId
to retrieve the SP's topology information for each node and apply it as labels to the Kubernetes node object. These labels can then be used with all the familiar Kubernetes mechanisms (PodAffinity, etc.).
Making SP aware of any CO specific topology information is not part of this design, because it is unnecessary.
csi.proto
Outdated
// {"region":{"us-west1"}, "zone":{"us-west1-b"}} | ||
// Indicates the node exists within the "region" "us-west1" and the | ||
// "zone" "us-west1-b". | ||
map<string, string> accessible_from = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its possible the SP can only make the volume available on specific nodes in a cluster.
How would an SP figure what nodes are in the cluster and then what adapters are available on those to determine if the SP can expose volumes on those nodes?
Topology should include host adapter level access for an SP to be able to make a volume available to a host where the container is going to run. Nodes can be provisioned/de-provisioned in the cluster.
Its possible that an SP may have multiple paths to a volume but they be active-active or active-passive. A volume may be created and the SP may return the active path (region/zone/site/rack) for the volume (in the create volume response). Later if the passive path becomes active, how does the SP notify the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its possible the SP can only make the volume available on specific nodes in a cluster.
Yes, the SP defines it's own topology it can choose to segment nodes as it chooses.
How would an SP figure what nodes are in the cluster and then what adapters are available on those to determine if the SP can expose volumes on those nodes?
SP's should have an internal mechanism to discover and identify nodes. CO's use the NodeGetId
call to map SP node information to CO node information. The SP defines topology so it is up to the SP to specify which topological segment(s) a node falls in to. The CO doesn't tell the SP what the topology is. Volume availability is an inherent trait of an SP and therefore the SP shares its availability topology information with the CO so that the CO can make appropriate scheduling decisions.
Topology should include host adapter level access for an SP to be able to make a volume available to a host where the container is going to run. Nodes can be provisioned/de-provisioned in the cluster.
This doesn't make sense, could you clarify what you mean.
Its possible that an SP may have multiple paths to a volume but they be active-active or active-passive. A volume may be created and the SP may return the active path (region/zone/site/rack) for the volume (in the create volume response). Later if the passive path becomes active, how does the SP notify the change.
If it is manifested as limited availability within the CO's cluster, then the design here should suffice. If the volume is equally accessible from the CO's cluster, then that will be addressed by a follow up proposal which will address spreading of volumes across SP specific failure domains.
csi.proto
Outdated
// corresponding key in accepted topologies. | ||
// This field is OPTIONAL. If it is not specified, the SP MAY choose | ||
// from any of the set of possible values specified in | ||
// accepted topologies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could clarify that while the CO specifies a preferred value(s), the SP should be able to make the volume accessible from values in the preferred and the accepted list. Or is that not a requirement? In which case, once the volume is created, the CO can't expect the volume to be accessible outside of the topology segment in the preferred list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any values in the preferred map must exist in the accepted list. The accepted list defines the list of possibilities. The optional preferred list simply specifies which of those are preferred by the CO. The SP must not provision outside of what is dictated in the accepted list.
OK, let's assume both workloads are scheduled in zone A. When Kubernetes fetch the CSI topology labels corresponding to that node (let's assume it's called The more I think about this, the more i think that maybe the CO needs to inform the SP about its own topology information, rather than the other way around. |
The key is that the CO's topology (or domain) information is never communicated to the SP. The CO only retrieves the SP's topology (or domain) information from the SP and passes it back to the SP (it remains agnostic to specific labels).
The CO's topology information is irrelevant to the storage system. The CO it's own topology information to make a decision about where to schedule a workload, but once that decision is made it just needs to communicate to the SP that the provisioned volume must be accessible from the selected destination. |
The |
@jieyu and I discussed this offline. Jie's argument is that CO's can have their own topology independent of the SP. If the SP is only aware of it's own topology, and volume provisioning only specifies the SPs topology, then the provisioned volume may not be accessible from all nodes that are part of the CO's topology. My counter argument is that the SP must not be responsible for understanding the topology of the CO--that would be a leaky abstraction. The SP is not responsible for scheduling workloads, it is responsible for provisioning volumes. While provisioning volumes it should not be required to understand CO specific topology information. It is the responsibility of the CO's scheduler to understand it's own topology (along with any constraints due to volumes), and schedule accordingly. Jie says this would make it difficult for Mesos, but that he will think it over. |
@julian-hj suggested adding a |
9d0217d
to
3d65772
Compare
@chhsia0 The difference in proposals boils down to this: message Topology {
map<string, string> segments = 1;
} vs. message Topology {
string segments = 1;
} @jdef preferred the second, I prefer the first, since losing key information will make for a painful user experience in Kubernetes where labels are key/value pairs. I spoke with @jdef yesterday, and it sounds like he is ok with going with the map approach. |
Yeah I know that the difference is map vs strings in the spec. I was trying to understand the conceptual difference since the expressiveness (in terms of describing the relation between volumes and accessible nodes) of both formats seems equivalent to me. I came up with the above interpretation because Mesos is leaning towards to a world where the topology value strings are completely opaque to the user so the user won't know anything about the keys/values exposed by plugins. |
spec.md
Outdated
// alphanumeric character with '-', '_', '.', or alphanumerics in | ||
// between. | ||
message Topology { | ||
map<string, string> segments = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 space indentation please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
spec.md
Outdated
// the volume to be accessible from multiple locations. | ||
// This field is OPTIONAL. This field MAY only be set if the plugin | ||
// advertises the ACCESSIBILITY_CONSTRAINTS capability. | ||
repeated Topology accessible_topology = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem necessary given that Volume
already has this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand: Volume
isn't part of this RPC request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Volume is not part of ValidateVolumeCapabilitiesRequest
(maybe it should be, but that is beyond the scope here), and it would be nice for a CO to be able to validate its assumptions about accessibility of the volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about the workflow of this RPC. This RPC is used to validate if a volume has certain capabilities that the CO assumes. For instance, for CO to deal with pre-existing volumes:
ListVolumes
is called to discover pre-existing volumes. The CO gets a list ofVolume
objects, which contains the topology information.- CO wants to validate if each of those volume has certain capabilities, thus calling
ValidateVolumeCapabilities
.
In that workflow, clearly that the topology information is not needed because Volume
objects are returned from ListVolumes
.
Is there an alternative workflow that requires topology information in this call?
I was debating why topology information of a volume is not part of VolumeCapability
, but first class in Volume
object (similar to capacity_bytes
). What's the fundamental difference between those two, and those in VolumeCapability
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Kubernetes we allow the user to directly specify a pre-provisioned volume and all it's attributes (including topology information). So basically, step 1 can be skipped. In that case, it would be nice to validate assumptions about accessibility too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saad-ali hum, OK. I just want to have a consistent story for the API. If that's the case, any reason we don't put capacity_bytes
there? Or, can we put topology
information in VolumeCapability
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the capacity_bytes
. Honestly we should just pass the full volume object in addition to the VolumeCapabilities
but that is out of scope for this PR so I will not do that here.
can we put topology information in VolumeCapability?
I'm ok with that. @jdef?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saad-ali, i had a chat with @jdef and @chhsia0 on this. The reason I was questioning this choice initially was due to consistency (I believe that if something is not consistent, usually something is missing or incorrect).
I think the reason we originally introduced ValidateVolumeCapabilities
, instead of putting VolumeCapability
in Volume
object is because it's very hard to enumerate all possible combinations of supported volume capabilities. Instead, we let the CO to drive the volume capability test using this call.
Ideally, the arguments in this call should be consistent with those in CreateVolumeRequest
(i.e., CapacityRanage
and TopologyRequirement
) because that's what COs should care about when dealing with pre-existing volumes the same way as if they're dealing with dynamically provisioned volumes. To map that to the k8s world, that means this call probably needs to be called when binding a PVC to a PV, rather than validating if a PV is valid or not. But I guess this is a much bigger change?
I think moving topology into volume capability is probably too big of a change (which I haven't thought through). At this stage, I think I am fine keeping as it is right now, with the plan to add capacity_bytes
to ValidateVolumeCapabilities
. Those fields need to be optional, if not specified, the SP won't check those against the volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the arguments in this call should be consistent with those in CreateVolumeRequest (i.e., CapacityRanage and TopologyRequirement) because that's what COs should care about when dealing with pre-existing volumes the same way as if they're dealing with dynamically provisioned volumes. To map that to the k8s world, that means this call probably needs to be called when binding a PVC to a PV, rather than validating if a PV is valid or not. But I guess this is a much bigger change?
Agreed. We should consider it pre-1.0 though. Opened #242 to track that.
I think moving topology into volume capability is probably too big of a change (which I haven't thought through). At this stage, I think I am fine keeping as it is right now, with the plan to add capacity_bytes to ValidateVolumeCapabilities. Those fields need to be optional, if not specified, the SP won't check those against the volume.
SGTM, will leave as is for now.
spec.md
Outdated
// volume is accessible from. | ||
// A plugin that returns this field must also set the | ||
// ACCESSIBILITY_CONSTRAINTS plugin capability. | ||
// An SP may specify multiple topologies to indicates the volume is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/indicates/indicate/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/may/MAY/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
spec.md
Outdated
|
||
// Specifies where (regions, zones, racks, etc.) the provisioned | ||
// volume is accessible from. | ||
// A plugin that returns this field must also set the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/must/MUST/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
spec.md
Outdated
// ACCESSIBILITY_CONSTRAINTS plugin capability. | ||
// An SP may specify multiple topologies to indicates the volume is | ||
// accessible from multiple locations. | ||
// COs may use this information along with the topology information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/may/MAY/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
spec.md
Outdated
// This field is OPTIONAL. If TopologyRequirement is specified either | ||
// permitted or preferred or both MUST be specified. | ||
// | ||
// If permitted is specified, the provisioned volume must be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/must/MUST/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
spec.md
Outdated
// An SP MUST attempt to make the provisioned volume available using | ||
// the preferred topologies in order from first to last. | ||
// | ||
// If permitted is specified, all topologies in preferred volume must |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/must/MUST/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
spec.md
Outdated
// topological accessibility information supported by the SP. COs | ||
// SHALL permit passing through topological accessibility information. | ||
// This field is OPTIONAL. | ||
// This field MAY be specified only if the SP has the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field SHALL NOT be specified unless the SP has the ACCESSIBILITY_CONSTRAINTS plugin capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
spec.md
Outdated
// the volume to be accessible from multiple locations. | ||
// This field is OPTIONAL. This field MAY only be set if the plugin | ||
// advertises the ACCESSIBILITY_CONSTRAINTS capability. | ||
repeated Topology accessible_topology = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand: Volume
isn't part of this RPC request
spec.md
Outdated
// specified `accessibility_requirements`. These are the same | ||
// `accessibility_requirements` the CO will use in a | ||
// `CreateVolumeRequest`. | ||
// This field is OPTIONAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and ... "This field SHALL NOT be set unless the plugin advertises the ACCESSIBILITY_CONSTRAINTS capability."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
spec.md
Outdated
|
||
// Specifies where (regions, zones, racks, etc.) the node is | ||
// accessible from. | ||
// A plugin that returns this field must also set the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/must/MUST/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
spec.md
Outdated
// accessible from. | ||
// A plugin that returns this field must also set the | ||
// ACCESSIBILITY_CONSTRAINTS plugin capability. | ||
// COs may use this information along with the topology information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/may/MAY/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Feedback addressed. PTAL |
// Each string MUST be 63 characters or less and begin and end with an | ||
// alphanumeric character with '-', '_', '.', or alphanumerics in | ||
// between. | ||
message Topology { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case, please create a follow-up Issue in GH and link it to this comment
spec.md
Outdated
@@ -1176,10 +1176,10 @@ message ValidateVolumeCapabilitiesRequest { | |||
|
|||
// Specifies where (regions, zones, racks, etc.) the caller believes | |||
// the volume is accessible from. | |||
// An caller may specify multiple topologies to indicates they believe | |||
// An caller MAY specify multiple topologies to indicate they believe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/An/A/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Feedback addressed PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'd wait for @jdef's approval before merging.
Sounds good. Will squash changes after @jdef's review. |
spec.md
Outdated
// volume is accessible from. | ||
// A plugin that returns this field MUST also set the | ||
// ACCESSIBILITY_CONSTRAINTS plugin capability. | ||
// An SP may specify multiple topologies to indicate the volume is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/may/MAY/
spec.md
Outdated
// An SP MUST attempt to make the provisioned volume available using | ||
// the preferred topologies in order from first to last. | ||
// | ||
// If permitted is specified, all topologies in preferred volume MUST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"preferred volume" -- I don't think "volume" makes sense here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Changed to list
.
spec.md
Outdated
// "Z3" or "Z5" in the "region" "R1". | ||
// | ||
// Example 3: | ||
// Given a volume should be accessible from TWO zones (because opaque |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"an opaque" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// A topological segment is a specific instance of a topological domain, | ||
// like "zone3", "rack3", etc. | ||
// For example {"com.company/zone": "Z1", "com.company/rack": "R3"} | ||
// Valid keys have two segments: an optional prefix and name, separated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, much better
// in between, for example "zone". | ||
// The key prefix MUST follow reverse domain name notation format | ||
// (https://en.wikipedia.org/wiki/Reverse_domain_name_notation). | ||
// The key prefix SHOULD include the plugin's host company name and/or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about consistent prefixes for keys? i.e. I'd like all topo keys reported by the plugin (across all RPCs) to share the same prefix, or no prefix at all. It is an error for a plugin to return some keys w/ a prefix, and then some without. It is also an error for different keys to specify a different prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I like that will add wording
spec.md
Outdated
// `CreateVolumeRequest`. | ||
// This field is OPTIONAL. This field SHALL NOT be set unless the | ||
// plugin advertises the ACCESSIBILITY_CONSTRAINTS capability. | ||
TopologyRequirement accessibility_requirements = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of TopologyRequirement
here vs. Topology
? In other words, what benefit does an SP derive from seeing the CO's "preferred" topologies; would the reported capacity change based on the presence of the CO's preference?
Thinking about this RPC and ValidateVolumeCapabilitiesRequest
and wondering why the topo accessibility fields are of different types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will change to Topology.
s/permitted/requisite/ based on feedback from @cpuguy83 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some more details on error handling (e.g. codes to use) when topology requirements can't be met?
Thinking failed precondition
may be a good fit for this case.
Done. I went with RESOURCE_EXHAUSTED seemed more appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one other question. Otherwise LGTM
spec.md
Outdated
// volume MUST be accessible from. | ||
// An SP SHALL advertise the requirements for topological | ||
// accessibility information in documentation. COs SHALL only specify | ||
// topological accessibility information supported by the SP. COs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"COs SHALL permit passing through topological accessibility information."
If I'm writing a CO, what does this really mean? Do we actually need this sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is trying to say that cluster admin should be able to specify topology constraints. I'm ok dropping that requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rewording it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels strange. I'd rather we drop it. COs may phase in support for topology at different times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, will drop.
Feedback addressed. Commits squashed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, otherwise LGTM
spec.md
Outdated
@@ -793,7 +1002,8 @@ The CO MUST implement the specified error recovery behavior when it encounters t | |||
| Condition | gRPC Code | Description | Recovery Behavior | | |||
|-----------|-----------|-------------|-------------------| | |||
| Volume already exists but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified volume `name` already exists but is incompatible with the specified `capacity_range`, `volume_capabilities` or `parameters`. | Caller MUST fix the arguments or use a different `name` before retrying. | | |||
| Operation pending for volume | 10 ABORTED | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. | | |||
| Unable to provision in `accessible_topology` | 8 RESOURCE_EXHAUSTED | Indicates that although the `accessible_topology` field is valid, a new volume can not be provisioned with the specified topology constraints. More human-readable information MAY be provided in the gRPC `status.message` field. | Caller MUST ensure that whatever is preventing volumes from being provisioned in the specified location (e.g. quota issues) is addressed before retrying with exponential backoff. | | |||
| Operation pending for volume | 10 ABORTED | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than 66one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/66one/one/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks everyone! Rebased. |
Volumes made available by an arbitrary storage provider (SP) may be partitioned such that the volumes are not equally available throughout the cluster the CO is managing. Instead volumes may be constrained to some subset of node(s) in the cluster. We will refer to the segmenting of these clusters (into “racks” in on premise clusters or into “regions” and “zones” in cloud environments, for example) as topology.
This PR introduces the concept of topology to the CSI spec.
Modern Container Orchestration Systems (COs) enable dynamic provisioning of volumes on demand when a workload is scheduled, and also enable pre-provisioned volumes to be used by workloads. In order to make effective scheduling and provisioning decisions, COs need to be able to both influence and act on topology information for each volume.
If volumes exposed by an SP are equally available from every where in the cluster that the CO is managing, then the CO doesn’t need to influence where a workload is scheduled or where a volume is provisioned, and no topology information needs to be shared between SP and CO.
An SP may also have a cluster topology different from (independent of) the topology of the cluster where work is being scheduled. In that case, because the volumes are still available everywhere in the cluster that the CO is managing, the CO doesn’t need to influence where a workload is scheduled or where a volume is provisioned. Regardless, the CO or end user may still want to influence where a volume is provisioned. To enable this, the SP can simply expose a mechanism via an opaque parameter in the
CreateVolume
call (the Container Storage Interface enable SPs to expose an arbitrary set of configuration options during dynamic provisioning by allowing opaque parameters to be passed from cluster admins to the storage plugins, to control things likediskType
or in this case, maybe some storage system specific topology).For many storage systems, volumes are not equally accessible from all nodes in the cluster the CO is managing. And therefore, the CO must be able to retrieve the topological location of a volume and use it to influence where the workload referencing it can be scheduled or, vice versa: use the location of a workload to influence where a volume for that workload should be provisioned. Therefore, a standardized mechanism for sharing topological information between COs and SPs must be defined.
There exists an arbitrary number of topologies that may differ from cluster to cluster, and a single cluster may contain multiple topologies, so this solution must be able to handle topology generically (not hard code implementation specific concepts like regions, zones, etc. as proposed in #7)
Closes #43