Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce concept of topology to CSI spec. #188

Merged
merged 1 commit into from
May 31, 2018

Conversation

saad-ali
Copy link
Member

@saad-ali saad-ali commented Feb 9, 2018

Volumes made available by an arbitrary storage provider (SP) may be partitioned such that the volumes are not equally available throughout the cluster the CO is managing. Instead volumes may be constrained to some subset of node(s) in the cluster. We will refer to the segmenting of these clusters (into “racks” in on premise clusters or into “regions” and “zones” in cloud environments, for example) as topology.

This PR introduces the concept of topology to the CSI spec.

Modern Container Orchestration Systems (COs) enable dynamic provisioning of volumes on demand when a workload is scheduled, and also enable pre-provisioned volumes to be used by workloads. In order to make effective scheduling and provisioning decisions, COs need to be able to both influence and act on topology information for each volume.

If volumes exposed by an SP are equally available from every where in the cluster that the CO is managing, then the CO doesn’t need to influence where a workload is scheduled or where a volume is provisioned, and no topology information needs to be shared between SP and CO.

An SP may also have a cluster topology different from (independent of) the topology of the cluster where work is being scheduled. In that case, because the volumes are still available everywhere in the cluster that the CO is managing, the CO doesn’t need to influence where a workload is scheduled or where a volume is provisioned. Regardless, the CO or end user may still want to influence where a volume is provisioned. To enable this, the SP can simply expose a mechanism via an opaque parameter in the CreateVolume call (the Container Storage Interface enable SPs to expose an arbitrary set of configuration options during dynamic provisioning by allowing opaque parameters to be passed from cluster admins to the storage plugins, to control things like diskType or in this case, maybe some storage system specific topology).

For many storage systems, volumes are not equally accessible from all nodes in the cluster the CO is managing. And therefore, the CO must be able to retrieve the topological location of a volume and use it to influence where the workload referencing it can be scheduled or, vice versa: use the location of a workload to influence where a volume for that workload should be provisioned. Therefore, a standardized mechanism for sharing topological information between COs and SPs must be defined.

There exists an arbitrary number of topologies that may differ from cluster to cluster, and a single cluster may contain multiple topologies, so this solution must be able to handle topology generically (not hard code implementation specific concepts like regions, zones, etc. as proposed in #7)

Closes #43

@saad-ali
Copy link
Member Author

saad-ali commented Feb 9, 2018

CC @Jgoldick

@akutz
Copy link
Contributor

akutz commented Feb 9, 2018

Hi @saad-ali,

Right off the bat I have to caution the use of additional, named fields. Especially when CreateVolume already has the parameters field. I thought we've previously agreed that we both like the opaque nature of CSI and that codifying too many specifics just paves the way for possible misuse or errors. What is the justification for not leaving this information in parameters and making it opaque to the CO and SP-specific? Because the only thing I see that is any different is the name of the field, and thus anything placed in topologies could be placed in parameters or attributes.

I know you added it to NodeGetIDResponse as well, but to that's just an argument for introducing a generic map for every request/response for generic data.

@saad-ali
Copy link
Member Author

saad-ali commented Feb 9, 2018

Right off the bat I have to caution the use of additional, named fields.

That is good instinct and important for protecting the project from bloat.

The bar that I use for what should be added to the spec:

  1. The COs needs to be able to parse or manipulate it for orchestration operations.
  2. It will be used by multiple storage systems.

Opaque parameters are perfect for exposing SP specific "knobs" that CO does not need to manipulate or parse.

But, in this case the SP needs to be able to operate on topology information to make scheduling decisions, and therefore an opaque parameter would not suffice.

What is the justification for not leaving this information in parameters and making it opaque to the CO and SP-specific

To be clear, in the case that topology of SP is independent of the CO (volume is equally accessible from all nodes), then opaque params should be used. This PR only handles the case where volumes are not accessible everywhere in the cluster, therefore the CO necessarily needs to be aware of this information for proper operation.

@Jgoldick
Copy link

Jgoldick commented Feb 10, 2018

@saad-ali writes:

To be clear, in the case that topology of SP is independent of the CO (volume is equally accessible from all nodes), then opaque params should be used.

Are you confident that this topology solution can be separated from the volume anti-affinity issue we'll need to address to support HA databases #44? I wrote them up separately in that hope but it's unclear to me yet that the CO won't need non-opaque params in the case where a volume can be accessed from any node but it still needs to know when Volume A and Volume B are in the same fault domain.

@saad-ali
Copy link
Member Author

saad-ali commented Feb 10, 2018

@Jgoldick This design does not address automated spreading of volumes to account for storage specific fault tolerance. It's possible to do so manually with opaque parameters. But doing so in an automated way will be something I'd like to tackle next (independent of this). Something like node anti-affinity from Kubernetes would be very powerful here.

This design is focused on addressing the "volume availability within cluster" gap.

I'm going to rework this proposal. As it is currently worded it requires a CO to dictate the exact topological segment(s) the volume must be provisioned in. In some cases the CO may have a preference for a segment but permit a set of possibilities. So instead of dictating exactly where a volume MUST be provisioned, I will loosen the language and syntax to allow storage plugins to select from list of possibilities with and a way for CO to indicate its preference.

@saad-ali saad-ali force-pushed the csiTopology branch 2 times, most recently from 68e08ce to 82e6db4 Compare February 13, 2018 07:57
@saad-ali
Copy link
Member Author

Made the changes above. PTAL

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed a first-pass review. Will think on this more.

spec.md Outdated
// {"region":{"us-west1"}, "zone":{"us-west1-b"}}
// Indicates the node exists within the "region" "us-west1" and the
// "zone" "us-west1-b".
map<string, string> accessibility_requirements = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the type of this field be map<string, ListOfString>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about his and decided against it, since I can not think of any concrete case where a node would belong to multiple topological segments (e.g. a node that belongs two different zones or two different racks does not make sense).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Maybe we could update the example then?

//   {"region":"us-west1", "zone":"us-west1-b"}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? The example seems correct to me: it shows the node belongs to a single zone within a single region (as opposed to multiple zones, for example).

spec.md Outdated

// TopologyRequirement captures the topological segment(s) the CO wants
// a volume to be provisioned in.
message TopologyRequirement {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: consider TopologyPolicy instead? This message MAY define both requisite and preference topology segment lists, each of which are OPTIONAL. Given that a message of this type may only contain preferences, the suffix "Requirement" seems to overreach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is hard.

Given that a message of this type may only contain preferences

I don't think this is accurate. As it is currently worded, the TopologyRequirement message, if it exists, would contain the required set of possibilities along with which of those are preferred. It does not allow for a list of preferred options only. As worded anything that is in preferred_topologies must be present in accepted_topologies. So therefore I think requirement seems more accurate.

However, I could loosen this to allow for preferred only -- so for example preferred_topologies could exist without accepted_topologies meaning SP should use preferred_topologies but if that doesn't work, use whatever it wants. If we do this then Policy would be a better name. But I'm not sure we want to enable that use case.

spec.md Outdated
// Indicates the volume may be accessible from any combination of two
// "zones" in the "region" "us-west1": e.g. "us-west1-b" and
// "us-west1-c" or "us-west1-b" and "us-west1-d", etc.
map<string, ListOfString> accepted_topologies = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid API stutter. The prefix of this type is already "Topology", we probably don't need it in the suffix as well. I would be perfectly happy with accepted or requisite or mandatory here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will remove suffix and leave prefixes only.

spec.md Outdated
// accessible from the combination of the two "zones" "us-west1-c" and
// "us-west1-e" in the "region" "us-west1" and fall back to other
// combinations if that is not possible.
map<string, ListOfString> preferred_topologies = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, preferred reads better to my eyes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

spec.md Outdated
@@ -1192,6 +1345,56 @@ message NodeGetIdResponse {
// CO in subsequent `ControllerPublishVolume`.
// This is a REQUIRED field.
string node_id = 1;


// The topological segment(s) the provisioned volume is accessible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the next 25 lines could probably be deleted - they frame this field from the perspective of a volume, and not a node. The first sentence of this comment should be the line that starts with "The topological segment(s) this node belongs..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that was a copy paste I forgot to clean up. Will remove.

spec.md Outdated
// {"region": {"us-west1"}, "zone": {"us-west1-b", "us-west1-c"}}
// Indicates the volume is accessible from both "zone" "us-west1-b"
// and "zone" "us-west1-c" the "region" "us-west1".
map<string, ListOfString> accessibility_requirements = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, prefer accessible_from

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming.

spec.md Outdated
// accessible from a given node when scheduling workloads.
// This field is OPTIONAL. If it is not specified, the CO MAY assume
// the node is not subject to any topological constraint and may
// schedule workloads referencing any volume without any topological
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a couple of ways to read this sentence:

(1) the CO ... MAY schedule workloads that reference any volume V on this node, irrespective of any topological constraints declared for V.
(2) the CO ... MAY schedule workloads that reference any volume V, such that there are no topological constraints declared for V.

Which do you intend?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 was the intention. Thanks for the clarification. I will modify to your wording.

spec.md Outdated
// from a given node when scheduling workloads.
// This field is OPTIONAL. If it is not specified, the CO MAY assume
// the volume is equally accessible from all nodes in the cluster and
// may schedule workloads referencing the volume on any available node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: end sentence w/ a period

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goes over the 72 char limit, but ok, created new line with one word. =P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, if the COs flaunt the rules you cannot expect the community to observe them. Thanks for following the rules Saad :)

spec.md Outdated
// from. A topological segment is a sub-division of a cluster, like
// region, zone, rack, etc.
// Each key must consist of alphanumeric characters, '-', '_' or '.'.
// Each value MUST contain 1 or more valid strings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What constitutes a "valid" string for a value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. I will clarify.

spec.md Outdated
// COs may use this information along with the topology information
// returned by NodeGetId to ensure that a given volume is accessible
// from a given node when scheduling workloads.
// This field is OPTIONAL. If it is not specified, the CO MAY assume
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are node-local volumes accounted for by this strategy? If this field is OPTIONAL but the volume is local (is tied to a particular node) then it doesn't make sense that it's "equally accessible from all nodes in the cluster".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node local volumes should set a topological segment {"node": {"nodeName"}} and this would have to match between node and volume.

csi.proto Outdated
// accessible from the combination of the two "zones" "us-west1-c" and
// "us-west1-e" in the "region" "us-west1" and fall back to other
// combinations if that is not possible.
map<string, ListOfString> preferred_topologies = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the way new message types complicate things, can we not agree that ListOfString be a string that can have multiple values via some known separator and have a size value that exceeds that of a normal string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I assume you tried to use the repeated keyword with the map value? That’s not supported?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree that ListOfStrings should be replaced with a simpler map of string,string. I quite prefer the explicitness of ListOfStrings and that no one has to write tokenizing logic or consider escape sequences (because someone wants to use a token separator in a segment name).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

James,

The reason for my suggestion is in part due to the way a custom type prevents the ability to create interface assertions on types in a language like Go without referencing the language bindings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s fine. Has anyone tried a repeated keyword inside the map value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind.

screencapture-2018 02 13-15 31 07

I still believe using an existing type rather than defining a new one is preferred for downstream language bindings.

@jdef
Copy link
Member

jdef commented Feb 13, 2018 via email

@Jgoldick
Copy link

Jgoldick commented Feb 14, 2018

Was thinking about the naming constraint of alphanumeric characters "-", "_", and ".".

Looking at the most similar AWS concept Placement Groups you see the naming is 255 ASCII characters. Is there a reason not to accept this more relaxed definition?

Looking at Azure Naming Conventions the biggest issue would seem to be around case insensitive comparisons. It'd be worth a note as to whether we are doing exact comparisons as this causes much confusion for Windows admins.

I think it would also be useful to add an example where you actually put in a Rack constraint. I've often seen rack names use delimiters like "/" or ":" so that they provide some internal data center layout/organization information. It'd probably be worth asking around to see how folks are naming their racks. See Duplicate Rack Names for some useful examples out of the Digital Ocean world. Even though the CSI spec as written allows us to specify multiple criteria and thereby define a specific rack within a zone, we can't group them. Taking this farther, I don't know how I would specify a preferred_topology in the presence of non-unique rack names given multiple zones within a single data center. On the other hand, if racks are fully described and made unique, like /zone-name/cage#-row#-rack# then it becomes clear how to satisfy/specify the constraint. Note that I used an obvious delimiter "/" to avoid name collisions between zones and racks.

@saad-ali
Copy link
Member Author

Good questions @Jgoldick.

Is there a reason not to accept this more relaxed definition?

This interface has to ultimately work with Kubernetes (as well as other COs), so some constraints come from there. In this case, key/value pairs returned by NodeGetId(...) need to be associated with Kubernetes Node objects somehow. The plan is to do so by applying these key/values as Labels on Kubernetes node objects.
And the keys and values for Kubernetes labels are restricted: see here.

Specifically values are limited to: 63 characters or less and must be empty or begin and end with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), underscores (_), dots (.), and alphanumerics between. Keys are a combination of optional prefix and name separated by a slash. The name segment is limited to: 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), underscores (_), dots (.).

The plan is the key returned by CSI (e.g. zone) would become the "name" part of the Kubernetes label key, and the prefix of the label key would contain the driver name. So for example "zone": "us-west-1a" would translate to Kuberentes label com.google.csi.gcepd/zone": "us-west-1a.

I agree that these are a bit too strict. But I don't see a good way to get around this without modifying Kubernetes. So I'd like to adopt the same level of strictness for now, work to loosen the requirements on Kubernetes side and then do the same on the CSI side.

It'd be worth a note as to whether we are doing exact comparisons as this causes much confusion for Windows admins

I believe proto maps in most (all?) languages are case sensitive. But, Rack and rack should not be two separate keys, so I added a requirement that keys should be case insensitive (caller will be responsible for ensuring this is the case and receiver can reject the call as invalid if it contains the same key with different casing.).

I think it would also be useful to add an example where you actually put in a Rack constraint.
I don't know how I would specify a preferred_topology in the presence of non-unique rack names given multiple zones within a single data center.

The spec as defined allows specifying the fully qualified topology hierarchy for a volume or node. So cases like the one you linked should work nicely with the defined case. I'll add some examples.

@saad-ali
Copy link
Member Author

Thanks for the feedback. PTAL!

@jieyu
Copy link
Member

jieyu commented Feb 14, 2018

Some initial thoughts on this. At a high level, this proposal suggests that the topology information about where the volume can be used should be specified by the SP, and will be opaque key/value based.

Given that the CO needs this topology information to decide where the volume and the workload can be placed, does it mean that the CO has to perform scheduling based on vendor specific topology information? That sounds very complex to me, especially for some CO (e.g. Mesos) that exposes a general scheduling API.

I think a lot of the comments in this PR try to explain what the "matching" algorithm is for the CO. I am actually leaning towards some pre-defined keywords (e.g., region/zone), and pre-defined semantics in terms of scheduling that the CO will do based on those keywords, Then, let the SPs to figure out the mapping between it's own internal topology information to those keywords based on the scheduling semantics.

@saad-ali
Copy link
Member Author

Some initial thoughts on this. At a high level, this proposal suggests that the topology information about where the volume can be used should be specified by the SP, and will be opaque key/value based.

Given that the CO needs this topology information to decide where the volume and the workload can be placed, does it mean that the CO has to perform scheduling based on vendor specific topology information?

No it does not. The way this is designed COs shall remain agnostic to the specific topology of any given SP. As far as the CO is concerned it is just matching labels on nodes. Let me clarify by walking through the different use cases: There are three scenarios a CO would be interested in exposing 1) understanding which nodes a particular volume is accessible from so that any workload referencing that volume is scheduled to the correct node (assumes volume provisioning before workload scheduling), 2) influence where a volume is provisioned based on where the workload referencing that volume is scheduled to (assumes volume provisioning after workload scheduling), and 3) allowing a cluster administrator to control where volumes are provisioned.
The most critical part is that NodeGetId also returns the opaque key/values that correspond to a node. So for 1 above, assuming a volume is already provisioned, the CO would compare the returned opaque key/value pairs to opaque key/value corresponding to the node to verify a particular volume is accessible from a particular node. For 2, assuming the workload is scheduled first, the CO would pull the opaque key/values corresponding to the node that the workload was scheduled to and pass them in to the CreateVolume call to ensure the volume was scheduled in the same location. And 3 can be accomplished by exposing an CO API to allow cluster admins to specify the list of possible key/values that a volume is allowed to be provisioned to (again this would be opaque to the CO--pass through).

We should not have (and do not need) pre-defined keywords (e.g. region, zone, etc.). The goal of this design is to ensure both CO and the CSI spec remain agnostic to SP topology while enabling arbitrary toplogies. I'm happy to jump on a call to walk though this if that would help.

@jieyu
Copy link
Member

jieyu commented Feb 15, 2018

@saad-ali

Assuming for a CO, each node already has some notion of "topology" information. For instance, in Mesos, system administrator can configure the domain of a node. And in Kubernetes, there are some well known labels related to topology.

Let's assume that there are two workloads (e.g., pods) that the user wants to schedule, and the user wants to make sure that these two workloads (pods) are in the same zone according to CO's node topology information. Each of the workload needs a persistent volume that needs to be dynamically provisioned.

Assuming workloads are placed first (case (2) above), workload 1 is scheduled in zone A and workload 2 is scheduled in zone B according to CO's topology information. Now, the CO needs to dynamically provision a volume for workload 1. Obviously, the volume has to be accessible from CO's zone A. IIUC, the logic you propose for the CO is to scan all nodes in CO's zone A, find the corresponding CSI plugin specific topology information that you obtained from the GetNodeId call, and feed that into the CreateVolumeRequest.

However, it is possible that the CSI plugin specific topology information you picked and fed into CreateVolumeRequest is not aligned with CO's topology information. In other words, it's possible that the volume the CO provisioned is not accessible in CO's zone A, because the accessibility is determined by the SP's topology information, not CO's. Unless, you make sure all nodes in CO's zone A has the same SP specific topology information.

@saad-ali
Copy link
Member Author

And in Kubernetes, there are some well known labels related to topology.

We plan to replace that with something more generic so Kubernetes also remains agnostic (no hard coded labels).

Let's assume that there are two workloads (e.g., pods) that the user wants to schedule, and the user wants to make sure that these two workloads (pods) are in the same zone

Ok

workload 1 is scheduled in zone A and workload 2 is scheduled in zone B

Why if the user constraint limits both to the same zone are they scheduled to different zones?

IIUC, the logic you propose for the CO is to scan all nodes in CO's zone A, find the corresponding CSI plugin specific topology information that you obtained from the GetNodeId call, and feed that into the CreateVolumeRequest.

No. On Kubernetes the first workload will get scheduled to some zone, let's say zone A. The second workload will have a constraint (PodAffinity) that will tell the scheduler to schedule workload in the same zone as the other pod, so it will also get scheduled to zone A. For the first pod, once it is scheduled to a node, Kubernetes will fetch the CSI topology labels corresponding to that node (previously retrieved and saved by GetNodeId) and pass those into the CreateVolumeRequest (note there is no need to "scan all nodes in CO's zone") to provision a volume for that workload. Similarly once the second pod is scheduled, Kubernetes will fetch the CSI topology labels corresponding to that node and pass them to CreateVolumeRequest to provision another volume for the second workload.

However, it is possible that the CSI plugin specific topology information you picked and fed into CreateVolumeRequest is not aligned with CO's topology information.

I think this may be the misunderstanding. The CO never feeds it's own topology labels to CreateVolumeRequest. It discovers the SP's topology labels corresponding to each node on initialization. Then when a workload is scheduled to a node, it takes those (SP generated labels) and passes them in to CreateVolumeRequest.

Copy link

@govint govint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add more details around how topology is enumerated by the CO and SP, even node level topology is interesting in a cluster (leave alone regions/zones) and the spec should handle that.

csi.proto Outdated
// accessible from.
// A topological segment is a sub-division of a cluster, like region,
// zone, rack, etc.
// An SP SHALL advertise the requirements for topology in
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the CO (or for that matter an SP) enumerate a region/zone or rack in the topology? Does the SP need to query and pull this data from the CO? HOw are changes in the topology or access to a region or rack notified either way? For example, CO (or SP) detects a change in toplogy or looses access to a rack/zone etc.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for the user to configure sites/racks/zones and regions in kubernetes that an SP can query. So that both the CO and SP have the same view for what topology elements are available. What would the definition include and how does the SP figure that it can access and provision volumes at a zone/region/rack etc.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the CO (or for that matter an SP) enumerate a region/zone or rack in the topology? Does the SP need to query and pull this data from the CO?

If the volumes that an SP makes available have availability constraints the SP should have some mechanism to enumerate those topological segments. The SP communicates those to the CO via the NodeGetId call. So, the CO "enumerates" the SP's topology via the NodeGetId call (i.e. retrieve the node/zone/rack/etc., as defined by SP, that each node belongs to).

HOw are changes in the topology or access to a region or rack notified either way? For example, CO (or SP) detects a change in toplogy or looses access to a rack/zone etc.

The topology for a node is retrieved via a NodeGetId call when the plugin is registered on the node. The topology for a volume is returned in the CreateVolume response. Once a volume or node are provisioned there topology is not expected to change. In the odd chance that it does, reinitializing those elements will allow picking up the new topology info.

Is it possible for the user to configure sites/racks/zones and regions in kubernetes that an SP can query. So that both the CO and SP have the same view for what topology elements are available.

In the proposed design, when a new plugin is registered, Kubernetes will use NodeGetId to retrieve the SP's topology information for each node and apply it as labels to the Kubernetes node object. These labels can then be used with all the familiar Kubernetes mechanisms (PodAffinity, etc.).

Making SP aware of any CO specific topology information is not part of this design, because it is unnecessary.

csi.proto Outdated
// {"region":{"us-west1"}, "zone":{"us-west1-b"}}
// Indicates the node exists within the "region" "us-west1" and the
// "zone" "us-west1-b".
map<string, string> accessible_from = 2;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its possible the SP can only make the volume available on specific nodes in a cluster.

How would an SP figure what nodes are in the cluster and then what adapters are available on those to determine if the SP can expose volumes on those nodes?

Topology should include host adapter level access for an SP to be able to make a volume available to a host where the container is going to run. Nodes can be provisioned/de-provisioned in the cluster.

Its possible that an SP may have multiple paths to a volume but they be active-active or active-passive. A volume may be created and the SP may return the active path (region/zone/site/rack) for the volume (in the create volume response). Later if the passive path becomes active, how does the SP notify the change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its possible the SP can only make the volume available on specific nodes in a cluster.

Yes, the SP defines it's own topology it can choose to segment nodes as it chooses.

How would an SP figure what nodes are in the cluster and then what adapters are available on those to determine if the SP can expose volumes on those nodes?

SP's should have an internal mechanism to discover and identify nodes. CO's use the NodeGetId call to map SP node information to CO node information. The SP defines topology so it is up to the SP to specify which topological segment(s) a node falls in to. The CO doesn't tell the SP what the topology is. Volume availability is an inherent trait of an SP and therefore the SP shares its availability topology information with the CO so that the CO can make appropriate scheduling decisions.

Topology should include host adapter level access for an SP to be able to make a volume available to a host where the container is going to run. Nodes can be provisioned/de-provisioned in the cluster.

This doesn't make sense, could you clarify what you mean.

Its possible that an SP may have multiple paths to a volume but they be active-active or active-passive. A volume may be created and the SP may return the active path (region/zone/site/rack) for the volume (in the create volume response). Later if the passive path becomes active, how does the SP notify the change.

If it is manifested as limited availability within the CO's cluster, then the design here should suffice. If the volume is equally accessible from the CO's cluster, then that will be addressed by a follow up proposal which will address spreading of volumes across SP specific failure domains.

csi.proto Outdated
// corresponding key in accepted topologies.
// This field is OPTIONAL. If it is not specified, the SP MAY choose
// from any of the set of possible values specified in
// accepted topologies.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could clarify that while the CO specifies a preferred value(s), the SP should be able to make the volume accessible from values in the preferred and the accepted list. Or is that not a requirement? In which case, once the volume is created, the CO can't expect the volume to be accessible outside of the topology segment in the preferred list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any values in the preferred map must exist in the accepted list. The accepted list defines the list of possibilities. The optional preferred list simply specifies which of those are preferred by the CO. The SP must not provision outside of what is dictated in the accepted list.

@jieyu
Copy link
Member

jieyu commented Feb 15, 2018

@saad-ali

No. On Kubernetes the first workload will get scheduled to some zone, let's say zone A. The second workload will have a constraint (PodAffinity) that will tell the scheduler to schedule workload in the same zone as the other pod, so it will also get scheduled to zone A. For the first pod, once it is scheduled to a node, Kubernetes will fetch the CSI topology labels corresponding to that node (previously retrieved and saved by GetNodeId) and pass those into the CreateVolumeRequest (note there is no need to "scan all nodes in CO's zone") to provision a volume for that workload. Similarly once the second pod is scheduled, Kubernetes will fetch the CSI topology labels corresponding to that node and pass them to CreateVolumeRequest to provision another volume for the second workload.

OK, let's assume both workloads are scheduled in zone A. When Kubernetes fetch the CSI topology labels corresponding to that node (let's assume it's called "rack":{"foo"}) and pass those into the CreateVolumeRequest. As a result, the volume provisioned by the SP can be used in SP's domain "rack":{"foo"}. However, this does not mean that the volume is accessible in CO's zone A. There might be node in SP's domain "rack":{"foo"}, but not in CO's domain zone A.

The more I think about this, the more i think that maybe the CO needs to inform the SP about its own topology information, rather than the other way around.

@saad-ali
Copy link
Member Author

OK, let's assume both workloads are scheduled in zone A. When Kubernetes fetch the CSI topology labels corresponding to that node (let's assume it's called "rack":{"foo"}) and pass those into the

NodeGetId must return all the labels that apply to that node, so it would be something like: {"zone": {"A"}, "rack": {"foo"}}. And the full set would be passed on to CreateVolumeRequest, indicating that the provisioned volume must be accessible from zone A and rack foo

As a result, the volume provisioned by the SP can be used in SP's domain "rack":{"foo"}. However, this does not mean that the volume is accessible in CO's zone A. There might be node in SP's domain "rack":{"foo"}, but not in CO's domain zone A.

The key is that the CO's topology (or domain) information is never communicated to the SP. The CO only retrieves the SP's topology (or domain) information from the SP and passes it back to the SP (it remains agnostic to specific labels).

The more I think about this, the more i think that maybe the CO needs to inform the SP about its own topology information, rather than the other way around.

The CO's topology information is irrelevant to the storage system. The CO it's own topology information to make a decision about where to schedule a workload, but once that decision is made it just needs to communicate to the SP that the provisioned volume must be accessible from the selected destination.

@jieyu
Copy link
Member

jieyu commented Feb 15, 2018

NodeGetId must return all the labels that apply to that node, so it would be something like: {"zone": {"A"}, "rack": {"foo"}}. And the full set would be passed on to CreateVolumeRequest, indicating that the provisioned volume must be accessible from zone A and rack foo

The zone: A here is SP's concept, has nothing to do with CO's zone: A.

@saad-ali
Copy link
Member Author

saad-ali commented Feb 16, 2018

@jieyu and I discussed this offline. Jie's argument is that CO's can have their own topology independent of the SP. If the SP is only aware of it's own topology, and volume provisioning only specifies the SPs topology, then the provisioned volume may not be accessible from all nodes that are part of the CO's topology.

My counter argument is that the SP must not be responsible for understanding the topology of the CO--that would be a leaky abstraction. The SP is not responsible for scheduling workloads, it is responsible for provisioning volumes. While provisioning volumes it should not be required to understand CO specific topology information. It is the responsibility of the CO's scheduler to understand it's own topology (along with any constraints due to volumes), and schedule accordingly.

Jie says this would make it difficult for Mesos, but that he will think it over.

@akutz
Copy link
Contributor

akutz commented Feb 16, 2018

The rest of us are watching this PR like :)

Alt Text

@saad-ali
Copy link
Member Author

@julian-hj suggested adding a PluginCapability to advertise if a plugin has availability domains. This would allow a CO who doesn't have the logic to handle topology to reject such plugins until the CO can be updated. I'll go ahead and add that.

@saad-ali
Copy link
Member Author

@chhsia0 The difference in proposals boils down to this:

message Topology {
    map<string, string> segments = 1;
}

vs.

message Topology {
    string segments = 1;
}

@jdef preferred the second, I prefer the first, since losing key information will make for a painful user experience in Kubernetes where labels are key/value pairs. I spoke with @jdef yesterday, and it sounds like he is ok with going with the map approach.

@chhsia0
Copy link
Contributor

chhsia0 commented May 25, 2018

Yeah I know that the difference is map vs strings in the spec. I was trying to understand the conceptual difference since the expressiveness (in terms of describing the relation between volumes and accessible nodes) of both formats seems equivalent to me. I came up with the above interpretation because Mesos is leaning towards to a world where the topology value strings are completely opaque to the user so the user won't know anything about the keys/values exposed by plugins.

spec.md Outdated
// alphanumeric character with '-', '_', '.', or alphanumerics in
// between.
message Topology {
map<string, string> segments = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 space indentation please.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

spec.md Outdated
// the volume to be accessible from multiple locations.
// This field is OPTIONAL. This field MAY only be set if the plugin
// advertises the ACCESSIBILITY_CONSTRAINTS capability.
repeated Topology accessible_topology = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem necessary given that Volume already has this information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand: Volume isn't part of this RPC request

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volume is not part of ValidateVolumeCapabilitiesRequest (maybe it should be, but that is beyond the scope here), and it would be nice for a CO to be able to validate its assumptions about accessibility of the volume.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about the workflow of this RPC. This RPC is used to validate if a volume has certain capabilities that the CO assumes. For instance, for CO to deal with pre-existing volumes:

  1. ListVolumes is called to discover pre-existing volumes. The CO gets a list of Volume objects, which contains the topology information.
  2. CO wants to validate if each of those volume has certain capabilities, thus calling ValidateVolumeCapabilities.

In that workflow, clearly that the topology information is not needed because Volume objects are returned from ListVolumes.

Is there an alternative workflow that requires topology information in this call?

I was debating why topology information of a volume is not part of VolumeCapability, but first class in Volume object (similar to capacity_bytes). What's the fundamental difference between those two, and those in VolumeCapability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Kubernetes we allow the user to directly specify a pre-provisioned volume and all it's attributes (including topology information). So basically, step 1 can be skipped. In that case, it would be nice to validate assumptions about accessibility too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saad-ali hum, OK. I just want to have a consistent story for the API. If that's the case, any reason we don't put capacity_bytes there? Or, can we put topology information in VolumeCapability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the capacity_bytes. Honestly we should just pass the full volume object in addition to the VolumeCapabilities but that is out of scope for this PR so I will not do that here.

can we put topology information in VolumeCapability?

I'm ok with that. @jdef?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saad-ali, i had a chat with @jdef and @chhsia0 on this. The reason I was questioning this choice initially was due to consistency (I believe that if something is not consistent, usually something is missing or incorrect).

I think the reason we originally introduced ValidateVolumeCapabilities, instead of putting VolumeCapability in Volume object is because it's very hard to enumerate all possible combinations of supported volume capabilities. Instead, we let the CO to drive the volume capability test using this call.

Ideally, the arguments in this call should be consistent with those in CreateVolumeRequest (i.e., CapacityRanage and TopologyRequirement) because that's what COs should care about when dealing with pre-existing volumes the same way as if they're dealing with dynamically provisioned volumes. To map that to the k8s world, that means this call probably needs to be called when binding a PVC to a PV, rather than validating if a PV is valid or not. But I guess this is a much bigger change?

I think moving topology into volume capability is probably too big of a change (which I haven't thought through). At this stage, I think I am fine keeping as it is right now, with the plan to add capacity_bytes to ValidateVolumeCapabilities. Those fields need to be optional, if not specified, the SP won't check those against the volume.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the arguments in this call should be consistent with those in CreateVolumeRequest (i.e., CapacityRanage and TopologyRequirement) because that's what COs should care about when dealing with pre-existing volumes the same way as if they're dealing with dynamically provisioned volumes. To map that to the k8s world, that means this call probably needs to be called when binding a PVC to a PV, rather than validating if a PV is valid or not. But I guess this is a much bigger change?

Agreed. We should consider it pre-1.0 though. Opened #242 to track that.

I think moving topology into volume capability is probably too big of a change (which I haven't thought through). At this stage, I think I am fine keeping as it is right now, with the plan to add capacity_bytes to ValidateVolumeCapabilities. Those fields need to be optional, if not specified, the SP won't check those against the volume.

SGTM, will leave as is for now.

spec.md Outdated
// volume is accessible from.
// A plugin that returns this field must also set the
// ACCESSIBILITY_CONSTRAINTS plugin capability.
// An SP may specify multiple topologies to indicates the volume is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/indicates/indicate/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/may/MAY/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

spec.md Outdated

// Specifies where (regions, zones, racks, etc.) the provisioned
// volume is accessible from.
// A plugin that returns this field must also set the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/must/MUST/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

spec.md Outdated
// ACCESSIBILITY_CONSTRAINTS plugin capability.
// An SP may specify multiple topologies to indicates the volume is
// accessible from multiple locations.
// COs may use this information along with the topology information
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/may/MAY/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

spec.md Outdated
// This field is OPTIONAL. If TopologyRequirement is specified either
// permitted or preferred or both MUST be specified.
//
// If permitted is specified, the provisioned volume must be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/must/MUST/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

spec.md Outdated
// An SP MUST attempt to make the provisioned volume available using
// the preferred topologies in order from first to last.
//
// If permitted is specified, all topologies in preferred volume must
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/must/MUST/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

spec.md Outdated
// topological accessibility information supported by the SP. COs
// SHALL permit passing through topological accessibility information.
// This field is OPTIONAL.
// This field MAY be specified only if the SP has the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field SHALL NOT be specified unless the SP has the ACCESSIBILITY_CONSTRAINTS plugin capability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

spec.md Outdated
// the volume to be accessible from multiple locations.
// This field is OPTIONAL. This field MAY only be set if the plugin
// advertises the ACCESSIBILITY_CONSTRAINTS capability.
repeated Topology accessible_topology = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand: Volume isn't part of this RPC request

spec.md Outdated
// specified `accessibility_requirements`. These are the same
// `accessibility_requirements` the CO will use in a
// `CreateVolumeRequest`.
// This field is OPTIONAL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and ... "This field SHALL NOT be set unless the plugin advertises the ACCESSIBILITY_CONSTRAINTS capability."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

spec.md Outdated

// Specifies where (regions, zones, racks, etc.) the node is
// accessible from.
// A plugin that returns this field must also set the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/must/MUST/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

spec.md Outdated
// accessible from.
// A plugin that returns this field must also set the
// ACCESSIBILITY_CONSTRAINTS plugin capability.
// COs may use this information along with the topology information
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/may/MAY/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@saad-ali
Copy link
Member Author

Feedback addressed. PTAL

// Each string MUST be 63 characters or less and begin and end with an
// alphanumeric character with '-', '_', '.', or alphanumerics in
// between.
message Topology {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, please create a follow-up Issue in GH and link it to this comment

spec.md Outdated
@@ -1176,10 +1176,10 @@ message ValidateVolumeCapabilitiesRequest {

// Specifies where (regions, zones, racks, etc.) the caller believes
// the volume is accessible from.
// An caller may specify multiple topologies to indicates they believe
// An caller MAY specify multiple topologies to indicate they believe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/An/A/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@saad-ali
Copy link
Member Author

Feedback addressed PTAL

Copy link
Member

@jieyu jieyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'd wait for @jdef's approval before merging.

@saad-ali
Copy link
Member Author

Sounds good. Will squash changes after @jdef's review.

spec.md Outdated
// volume is accessible from.
// A plugin that returns this field MUST also set the
// ACCESSIBILITY_CONSTRAINTS plugin capability.
// An SP may specify multiple topologies to indicate the volume is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/may/MAY/

spec.md Outdated
// An SP MUST attempt to make the provisioned volume available using
// the preferred topologies in order from first to last.
//
// If permitted is specified, all topologies in preferred volume MUST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"preferred volume" -- I don't think "volume" makes sense here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Changed to list.

spec.md Outdated
// "Z3" or "Z5" in the "region" "R1".
//
// Example 3:
// Given a volume should be accessible from TWO zones (because opaque
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"an opaque" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// A topological segment is a specific instance of a topological domain,
// like "zone3", "rack3", etc.
// For example {"com.company/zone": "Z1", "com.company/rack": "R3"}
// Valid keys have two segments: an optional prefix and name, separated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, much better

// in between, for example "zone".
// The key prefix MUST follow reverse domain name notation format
// (https://en.wikipedia.org/wiki/Reverse_domain_name_notation).
// The key prefix SHOULD include the plugin's host company name and/or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about consistent prefixes for keys? i.e. I'd like all topo keys reported by the plugin (across all RPCs) to share the same prefix, or no prefix at all. It is an error for a plugin to return some keys w/ a prefix, and then some without. It is also an error for different keys to specify a different prefix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like that will add wording

spec.md Outdated
// `CreateVolumeRequest`.
// This field is OPTIONAL. This field SHALL NOT be set unless the
// plugin advertises the ACCESSIBILITY_CONSTRAINTS capability.
TopologyRequirement accessibility_requirements = 7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of TopologyRequirement here vs. Topology? In other words, what benefit does an SP derive from seeing the CO's "preferred" topologies; would the reported capacity change based on the presence of the CO's preference?
Thinking about this RPC and ValidateVolumeCapabilitiesRequest and wondering why the topo accessibility fields are of different types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will change to Topology.

@saad-ali
Copy link
Member Author

s/permitted/requisite/ based on feedback from @cpuguy83

Copy link
Contributor

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some more details on error handling (e.g. codes to use) when topology requirements can't be met?

Thinking failed precondition may be a good fit for this case.

@saad-ali
Copy link
Member Author

Can we add some more details on error handling (e.g. codes to use) when topology requirements can't be met?

Thinking failed precondition may be a good fit for this case.

Done. I went with RESOURCE_EXHAUSTED seemed more appropriate.

Copy link
Contributor

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one other question. Otherwise LGTM

spec.md Outdated
// volume MUST be accessible from.
// An SP SHALL advertise the requirements for topological
// accessibility information in documentation. COs SHALL only specify
// topological accessibility information supported by the SP. COs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"COs SHALL permit passing through topological accessibility information."

If I'm writing a CO, what does this really mean? Do we actually need this sentence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trying to say that cluster admin should be able to specify topology constraints. I'm ok dropping that requirement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rewording it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels strange. I'd rather we drop it. COs may phase in support for topology at different times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, will drop.

@saad-ali
Copy link
Member Author

Feedback addressed. Commits squashed.

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, otherwise LGTM

spec.md Outdated
@@ -793,7 +1002,8 @@ The CO MUST implement the specified error recovery behavior when it encounters t
| Condition | gRPC Code | Description | Recovery Behavior |
|-----------|-----------|-------------|-------------------|
| Volume already exists but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified volume `name` already exists but is incompatible with the specified `capacity_range`, `volume_capabilities` or `parameters`. | Caller MUST fix the arguments or use a different `name` before retrying. |
| Operation pending for volume | 10 ABORTED | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. |
| Unable to provision in `accessible_topology` | 8 RESOURCE_EXHAUSTED | Indicates that although the `accessible_topology` field is valid, a new volume can not be provisioned with the specified topology constraints. More human-readable information MAY be provided in the gRPC `status.message` field. | Caller MUST ensure that whatever is preventing volumes from being provisioned in the specified location (e.g. quota issues) is addressed before retrying with exponential backoff. |
| Operation pending for volume | 10 ABORTED | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than 66one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/66one/one/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch :)

Copy link
Contributor

@julian-hj julian-hj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@saad-ali
Copy link
Member Author

Thanks everyone! Rebased.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.