Topology aware dynamic provisioning design #1857

msau42 · 2018-02-27T01:06:58Z

This design is for 1.11 work to support dynamic provisioning.

Update volume topology design doc with:

Defining terms used to describe topology
Dynamic provisioning new APIs
In-tree and external provisioning interface changes
Scheduler implementation details
Examples for local, zonal, and multi-zonal volumes

@kubernetes/sig-storage-pr-reviews

/hold

msau42 · 2018-02-27T01:07:11Z

/assign @jsafrane @saad-ali @thockin @lichuqiang

msau42 · 2018-02-27T01:22:21Z

/assign @bsalamat
@kubernetes/sig-scheduling-pr-reviews

lichuqiang · 2018-02-27T02:09:05Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+}
+
+type StorageClassCapacity struct {
+    TopologyKey string


Still I'm concerned about the Topology key in StorageClassCapacity, if we place it in the status field, and leave it to provisioners to specify, then for edge-case that:

When scheduler find a storage with StorageClassCapacity empty, it is hard to tell whether the storage is indeed without capacity limit, or it's just of a state that capacity has not been initialized by provisioners yet.

If we move the Topology key out of status field, and leave it to admin, it is thought to be fulfilled on StorageClass creation.
Thus, I think the edge-case above could be avoid.

Then it will open up a new mode for failure, where administrator specifies the wrong topologyKey and the provisioner values won't match with it.

If a pod gets scheduled before provisioners have reported their capacity, then:

Provisioning could fail. In that case we should retry scheduling

Provisioning could succeed. Then we will not have marked the PVC with provisioned capacity information, so our calculations for used capacity will be incorrect. Is there a way we can handle this? Ie, have a StorageClass modify handler, when StorageClassCapacity is initialized, then initialize the capacity cache?

I would prefer the scheduler could recognize the StorageClass at the first beginning, but not manage things by trying Provisioning. That makes things complex and may have effect on performance.

If you are worried that admin may specify a wrong key, then a simple flag to imply whether the StorageClass is of capacity limit may also work.
I think the admin should at least know the StorageClass he creates is of capacity limit or not.

How will the scheduler know that all the provisioners have initialized and reported capacity? (in the case of local volume provisioners)

Nm, in that case, once the topologyKey is specified, then capacity is zero until otherwise updated.

How will the scheduler know that all the provisioners have initialized and reported capacity? (in the case of local volume provisioners)

I think it doesn't have to know, what the scheduler cares are:

Whether the StorageClass is of capacity limit

If so, whether it's capacity could satisfy the requirement.

No matter it's capacity smaller than requirement, or it's in state of uninitialized, it's both in state of unavailable for scheduler.

I discussed with @saad-ali on this. Our consensus is that it's better for topologyKey to be reported by the provisioner. Depending on the user to specify the topologyKey, or even a flag, opens up the possibility that they could incorrectly set it. And they don't have any choice in what to set as the value. For local storage, it must always be set to "kubernetes.io/hostname", there is no other value that will work.

In the worst case, there will be a period of time during cluster initialization where the provisioners may not have reported capacity yet. The consequences are not fatal. Provisioning could fail until the provisioners are up and running (which is expected behavior in general). Provisioning could succeed, and in that case we'll have to resync the capacity cache later. We could try to mitigate this scenario, by recommending that provisioners do not process any provisioning requests until they have fully initialized (including reporting capacity).

msau42 · 2018-02-27T04:27:44Z

TODO: Add section about predicate ecache invalidation and active pod queues

msau42 · 2018-02-27T15:43:09Z

/assign @verult

verult · 2018-02-28T01:16:21Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+Some volume types have capacity limitations per topology domain. For example,
+local volumes have limited capacity per node. Provisioners for these volume
+types can report these capacity limitations through a new StorageClass status
+field.


I've been warned before against having constantly changing metrics (like resource usage) be placed as an API status, and one argument is that it could easily overwhelm the API server. Maybe the capacity information should be plumbed to the scheduler through existing resource usage monitoring pipelines.

Might be good to get a review from sig-instrumentation?

The capacity reported by the provisioner should be total capacity, not available capacity. I don't anticipate total capacity changing very frequently.

The challenge with existing resource usage tracking is that it is done on a per node level, whereas volume resource limits may not necessarily be at the node boundary.

There are plans to eventually plumb PVC usage stats through Metrics Server, as part of the work to get these stats exposed to kubectl. I believe the scheduler also talks to the Metrics Server.

Currently there's a limitation that PVC usage stats can only be measured when the underlying volume is mounted in a pod, but maybe we can expand this feature and add the capability of having volume plugins report these stats as well.

@jingxu97

I believe for PVC usage stats, it's still tied to a Pod that is scheduled to a specific node. That way you can hit all the kubelet metrics endpoints. For this design to handle getting capacity from metrics, we would need to collect all the metrics from a central entity, and not per node to handle volumes with topologies spanning more than a single node.

I am unaware of the scheduler talking to the metrics server. My understanding is it uses resource capacity reported in the Node object. @bsalamat could you clarify?

But besides the initial StorageClass creation, I don't anticipate reported capacity to be updated very frequently.

Here, we have also have the challenge that external provisioners need to report their capacity somewhere, whether on an API object (like proposed), or to some central server.

Yeah, as I'm also focusing on design of local storage provisioning, I wonder what the way to talk to the metrics server looks like.

I guess users are more likely to look forward to things they are familiar

I discussed a bit with @verult, and we agreed that going the metrics server approach would require a lot of extra plumbing that doesn't exist today. So we're still leaning towards the current approach of reporting total (not used) capacity in the StorageClass and calculating and caching used from provisioned PVCs.

lichuqiang · 2018-03-19T09:17:37Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+kube-scheduler, kube-controller-manager, and all kubelets.
+
+StorageClass.AllowedTopology and StorageClassCapacity fields will be controlled
+by the DynamicProvisioningScheduling feature gate, and must be configured in the


Seem the two are not completely independent, maybe we should control both binding and provisioning with a same Feature Gate?

If we enable VolumeScheduling and disable DynamicProvisioningScheduling, then what's the expected behavior in pv controller? for PVCs of mode WaitForFirstConsumer, skip binding but exec provisioning directly?

Challenge is static local PVs is beta, and dynamic provisioning will be alpha, so we can't have one feature gate control both. I see VolumeScheduling as a superset, so if that is disabled, then the delayed dynamic provisioning is also disabled. Once delayed dynamic provisioning goes to beta, we can combine the two feature gates.

msau42 · 2018-03-20T01:10:19Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+}
+
+type VolumeProvisioningTopology struct {
+    Required map[string]VolumeTopologyValues


@saad-ali found a use case that this cannot represent:
An administrator wants to restrict provisioning to either "zone A, rack 1" or "zone B, rack 1".

To fix this, we could go back to using a restricted node selector (only support the IN operator), or follow the CSI proposed representation which sort of does the reverse of the node selector: OR, then AND.

A restricted NodeSelector would be more familiar to Kubernetes users, so I'm leaning towards that approach. As long as we restrict the operators to just IN, then it can be translated internally into the CSI representation.

I'm flip flopping again. A node selector specifies a subset off of a set of labels on the node object. But in this case, we are not selecting from a set of labels on an existing object. In this case, we want to limit the topology of a PV object that will be created in the future. So I think following the CSI representation as a list of maps makes more sense, rather then specifying a selector on nothing.

+1. ~~Another semantic difference is: VolumeTopology is used to select a set of nodes at which a PV is available, whereas NodeSelector is used to select a single node.~~

NodeSelector is used to select a single node

that's a little surprising... selectors typically select a set of things

@liggitt you are right, I was using a wrong frame of thinking, so I take that back.

Here's another point, similar to what @msau42 said, that helped me understand the difference: ultimately the scheduler needs a set of nodes to choose from; a NodeSelector directly specifies that set, whereas VolumeTopology indirectly specifies it by providing a super set which is then refined further by the volume plugin. Please correct me if this is the wrong understanding.

lichuqiang · 2018-03-20T11:42:11Z

contributors/design-proposals/storage/volume-topology-scheduling.md

 ```
 1. For static PV binding:
    1. Prebind the PV by updating the `PersistentVolume.ClaimRef` field.
    2. If the prebind fails, revert the cache updates.
 2. For in-tree and external dynamic provisioning:
-    1. Set `annStorageProvisioner` on the PVC.
+    1. Set `annScheduledNode` on the PVC.


annSelectedNode

msau42 · 2018-04-18T00:37:28Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+
+## API Upgrades
+
+### Zonal PV Upgrade


Add an example for multi-zonal PV upgrade because the zone label value format is special.

verult · 2018-04-18T01:23:45Z

contributors/design-proposals/storage/volume-topology-scheduling.md

@@ -141,6 +185,9 @@ spec:
 ```

 #### Zonal Volume
+In this example, the volume can only be accessed from nodes that have the
+label key `failure-domain.beta.kubernetes.io/zone' and label value


nit: tick instead of apostrophe

done thanks

msau42 · 2018-04-20T15:51:09Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+...
+status:
+  capacity:
+    topologyKey: zone,rack


lichuqiang · 2018-04-21T01:18:37Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+
+In order for the scheduler to calculate provisioned capacity on PVs that have
+been Released (PVC has been deleted), when binding a PV to a PVC, the PV also
+needs to copy the PVC's `AnnProvisionerTopology` annotation, if it exists.


AnnProvisionedTopology

lichuqiang · 2018-04-23T02:17:51Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+parameters:
+  type: pd-standard
+allowedTopologies:
+  requiredToplogies:


requiredTopologies

lichuqiang · 2018-04-23T02:18:12Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  Name: zonal-class


thockin · 2018-04-26T08:01:30Z

contributors/design-proposals/storage/volume-topology-scheduling.md

 * StorageClass will be required to get the new binding behavior, even if dynamic
 provisioning is not used (in the case of local storage).
-* We have to maintain two different paths for volume binding.
+* We have to maintain two different code paths for volume binding.
 * We will be depending on the storage admin to correctly configure the


This is a pretty significant downside, IMO. What this section doesn't really cover is WHY we need both, other than transition. Or rather, why there isn't a less permanent way to transition.

I know Jan put forth a use-case of very slow-to-bind volumes - if we still hold that as a use-case for this, can we detail it a bit more? API fields are forever, and this is one I think we'd rather not have if we don't NEED it.

e.g. if we could get away with an annotation for some transitionary period, that would be better. I'd like to se an explanation of why that doesn't work.

I'm not sure if we can change this behavior to the default, unless we go v2, for the following reasons:

Users may have already developed automation/processes assuming the old binding behavior

Some big volumes can take hours to provision, so provisioning is a planned event well in advance of workload creation

Globally accessible storage doesn't benefit from this change in behavior

I don't have any special use case for very slow-to-bind volumes, it would just break existing behavior. Users (and automation) are used to their PVCs being bound relatively quickly and they will be confused.

Someone on SIG-storage said that in some storage backends provisioning can take hours and it's well planned operation, so users want to create PVCs and let them provision before they start a pod. @kangarlou, was that you?

@jsafrane No, it wasn't me, but here is my two cents:
Today for dynamic provisioners, binding happens as soon as the provisioning is done. For the NFS and iSCSI volumes that we manage on NetApp storage, provisioning happens very quickly. To keep it fast, our provisioner doesn't format the iSCSI volumes and relies on kubelet to perform this functionality during attach. This is not a concern for us, but I can see cases where provisioning takes a long time for some plugins/provisioners (e.g., formatting large drives at the time of volume creation, provisioning from snapshots). Having said that, I don't think we should maintain two code paths. As far as external automation is concerned, because provisioning can take arbitrarily long time (in the worst case, the provisioning can fail), any external tooling should inspect the PVC status and not rely on the immediate binding behavior. I agree with @jsafrane that some users prefer to provision their PVCs before they start the pods, but for provisioners that take a long time, I don't think the proposed changes affect the user experience much as such users see provisioning and binding as one step, not as two separate steps.

thockin · 2018-04-26T08:33:52Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+metadata:
+  name: standard
+provisioner: kubernetes.io/gce-pd
+volumeBindingMode: WaitForFirstConsumer


wrt naming: does the extra word "volume" here add anything? In context it seems redundant. Would bindingMode or bindMode or bindEvent or something beshorter and thus easier to use?

I was thinking since this is a storageclass, and not a volume object, it may not be obvious what the binding is referring to. But considering we don't have any sort of storageclass binding concept today, shortening it to `bindMode" is fine with me.

thockin · 2018-04-26T11:07:27Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+to us-central1-a and us-central1-b. Topologies that are incompatible with the
+storage provider parameters will be enforced by the provisioner.
+```
+apiVersion: storage.k8s.io/v1


I don't see the difference between this and the previous?

The UX is the same, but underneath, the provisioner will actually be choosing 2 zones out of the list. I guess I'm also implying here that if the user only specifies 1 zone, then the provisioner will fail because it requires a min of 2. I'll add that as another example

thockin · 2018-04-26T11:10:55Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+have the following labels:
+
+* "zone: us-central1-a" and "rack: rack1" or,
+* "zone: us-central1-b" and "rack: rack1" or,


Throughout the system we have a mixed-usage of string-string maps to represent label matches and more expressive selectors. Is this better as a map (which requires 3 options) or would it better expressed as a full selector, eg.

"zone in [us-central1-a]" and "rack in [rack1]" or,
"zone in [us-central1-b]" and "rack in [rack1, rack2]"

Did we consoider it? I am not sure what the guidelines are for when to use one or the other.

Yes, I struggled with this and flipped back and forth many times. The main reason I went with this is because I didn't want to support selector operators other than "In". But I agree, this is more verbose because you have to explicitly spell out every combination. I'm willing to flip back to some new TopologySelector as I described above.

thockin · 2018-04-26T11:47:24Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+    // Each entry in the map is ANDed.
+    MatchNodeLabels map[string]string
+}
+```


I have to say that I am not sure I am happy with the API expression here. It feels very complicated and the names are not self-describing enough.

Why do we need a struct for the lowest level? Why not []map[string]string ? E.g.

StorageClass: allowedTopologies: requiredTopologies: - rack: "1" zone: "a" - rack: "2" zone: "b"

Does that parse?

For that matter, why do we need required at all? Do you plan to really expand there? What are the likely expansions?

Language-wise "allowed" and "required" seem at odds, so it make it hard to look at and comprehend what it really means.

Are we sure we want to express the match as LABELS or as topologies? We're implementing topology as labels for now, but it has ben proposed at least once that maybe topology should be its own concept. Also see question below about maps vs set selectors.

all told, I see something like:

StorageClass: topologyConstraints: - zone in [a, b] rack in ["1", "2"]

to be self-justifying. Anything beyond that needs robust rationalization. :)

The reason I shied away from selectors is because I did not see an easy way for provisioners to be able to parse selector operators other than "In". But I would be open to creating a new TopologySelector that doesn't have an operator field, and only supports the "In" behavior of other selectors. Something like:

type TopologySelector { MatchLabelExpressions []TopologySelectorLabelRequirement } type TopologySelectorLabelRequirement { Key string Values []string }

And if we do end up having a non-label-based topology field in the future, then we can add a MatchTopologyExpressions field.

Regarding required, I wanted to leave room for "preferred" in the future, but I guess it could be an entirely new "preferredTopologies" field if we need to.

StorageClass: topologyConstraints: - zone in [a, b] rack in ["1", "2"]

Wouldn't this allow "zone=b,rack=1"? To entirely capture the original topology, you'd need something like:

StorageClass: topologyConstraints: - zone in [a] rack in ["1"] - zone in [b] rack in ["2"]

One inconvenience with the selector approach is there are multiple ways to specify certain kinds of topology. For example, the following 3 are equivalent:

StorageClass: topologyConstraints: - zone in [a] rack in ["1"] - zone in [b] rack in ["1"] - zone in [b] rack in ["2"]

StorageClass: topologyConstraints: - zone in [a,b] rack in ["1"] - zone in [b] rack in ["2"]

StorageClass: topologyConstraints: - zone in [a] rack in ["1"] - zone in [b] rack in ["1", "2"]

I could potentially see users being confused which form Kubernetes expects, when in reality either of them work. To me it seems better to have one layer of ORs instead of two, to guarantee that there's exactly 1 representation.

My updated example here correctly represents (one way) the example topology. I think having multiple representations is fine as long as they evaluate the same, and it is consistent with how selectors are specified for other Kubernetes objects.

thockin · 2018-04-26T11:54:15Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+    // (free and used) for provisioning at each topology value.
+    //
+    // The key of the map is the topology value. It is a flattened,
+    // comma-separated string of label values corresponding with


why comma seperated rather than structured? We mostly prefer structured fiedls over string-encoded fields..

I didn't think that a struct could be represented as a map key in our API.

thockin · 2018-04-26T11:57:04Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+    // comma-separated string of label values corresponding with
+    // each label key specified in TopologyKeys in the same order.
+    //
+    // The value of the map is the total capacity for that topology


I don't really understand this structure. The example below doesn't seem to match this and this doesn't seem to have enough information. You have topo key, topo value, and capacity you want to report, but the map can only carry 2 of those 3?

Yeah it's complicated... The topo key is specified as a separate field. And the map key is the topo values. And the map value is the capacity.

thockin · 2018-04-26T11:57:58Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+type StorageClass struct {
+    ...
+
+    Status *StorageClassStatus


Who is going to consume this? Is it required that all provisioners provide it or is it just optional? How to transition to using it? What are consumers going to do with it?

Scheduler will consume this and it is optional. The scheduler looks at the capacity available in that topology domain to determine if dynamic provisioning is possible. The only volume plugin that will use this initially is local volumes. I don't have plans to use this for cloud zonal volumes. Those we can treat with "infinite" capacity.

thockin · 2018-04-26T11:59:26Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+    //
+    // The value of the map is the total capacity for that topology
+    // value.
+    Capacities map[string]resource.Quantity


What fidelity is required? How fine-grained are the values? How often are they syncronized against reality? What happens if they are wrong in either direction? Who will consume this and how will they use it?

thockin · 2018-04-26T12:00:33Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+limitations.  A missing map entry for a topology domain will be treated
+as zero capacity.
+
+A scheduler predicate will use the reported capacity to determine if dynamic


Why? This is a LOT of complexity for what seems like best-effort results that may not be accurate.

Also it will scale badly for large clusters.

Also it is insufficient to express whether a given volume can actually fit (e.g. could be fragmented).

This is similar to how a node today reports total CPU/memory. It doesn't actually report used vs available, instead the scheduler keeps track of used based on all the Pods that have been scheduled to that node. This way you avoid the latency issue of waiting for reported values to updated after they are allocated.

So we're trying to follow a similar concept here, except the challenge with volumes is that their topology is not always scoped to a single node, so it doesn't make sense to report as part of node capacity. The reported capacity would only change if you change the capacity of the storage pool, but not when you allocate/free volumes from it. I expect the former will happen less frequently than the latter, so the scaling limits should be similar to how we process cpu/memory capacity.

But this does have some limitations to be accurate, like you mentioned:

The storage pool has to be dedicated to this StorageClass and cannot be shared with other StorageClasses

Provisioning from the storage pool cannot be prone to fragmentation

Here's an alternative idea. Considering that only local volumes plan to use the capacity reporting api at the moment, and there are some restrictions on how you deploy the storage pools to ensure that this capacity handling is actually accurate, we could try going with a CRD + scheduler extender approach.

The downsides are:

Deployment is more complicated

We can't include this into Kubernetes core. The provisioner is already out of tree, maybe this not an issue.

The predicate will become best-effort, due to the delay of seeing the PVC provisioning API update in the extender. The fact that the scheduler round robins on equal priority nodes could help mitigate this, so that the same node is not chosen again for the next Pod.

Advantages are:

Since it's a CRD specific to local volumes, the API could be simpler and more flexible because we don't need to make it generic to all volume topologies

You potentially could support scenarios like storage pools being shared across multiple StorageClasses, since it doesn't need to be strictly tied to a StorageClass anymore

You can encode special behavior/requirements of your specific storage system into the extender logic.

Keeps StorageClass lean and we don't have to regret not putting a Spec/Status in the beginning

@lichuqiang @jsafrane @thockin wdyt?

I'm fine with the idea. But I think the key point lies in whether we need a generic solution.

If we can see some storages are in need of the capacity ability (even potential but not for now), then it should absolutely be made in-tree and as a generic one.

But if we can decide the feature will only needed by local volumes. then it's reasonable to make it out-of-tree.

msau42 · 2018-05-12T01:14:48Z

I squashed the PR into 3 commits:

Design up the most recent review iteration
Addressed feedback from the most recent review, changing allowedTopologies to use a new TopologySelector field instead of an array of maps
Separated out the capacity reporting design, so we can discuss it separately from the rest

msau42 · 2018-05-14T17:46:35Z

I removed the capacity api section, and will look at addressing it separately.

So the design proposal here includes:

Basic changes to scheduler and PV controller to delay dynamic provisioning and pass the selected node to the provisioner
API changes to allow an administrator to restrict topologies in the storageclass (replacement for plugin-specific zone/zones parameters today)

@saad-ali @thockin @jsafrane PTAL

kangarlou · 2018-05-01T05:04:14Z

contributors/design-proposals/storage/volume-topology-scheduling.md

 When the feature gate is enabled, the PV controller needs to skip binding
 unbound PVCs with VolumBindingWaitForFirstConsumer and no prebound PVs
 to let it come through the scheduler path.

 Dynamic provisioning will also be skipped if
-VolumBindingWaitForFirstConsumer is set.  The scheduler will signal to
+VolumBindingWaitForFirstConsumer is set. The scheduler will signal to


Dynamic provisioning will be skipped or delayed until annSelectedNode annotation is set? Does this mean dynamic provisioners should look for this annotation for the storage classes with VolumeBindingMode set to VolumeBindingWaitForFirstConsumer before they can act on the PVCs of that type?

In-tree provisioners will be coordinated by the PV controller.

The external-provisioner library in external-storage will be updated to also coordinate this.

The CSI external-provisioner will coordinate this.

If you don't use the external-provisioner library, then yes, you need to handle this in your own provisioner.

lichuqiang · 2018-05-16T08:40:41Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+    // Restrict the node topologies where volumes can be dynamically provisioned.
+    // Each volume plugin defines its own supported topology specifications.
+    // Each entry in AllowedTopologies is ORed.
+    AllowedTopologies *[]TopologySelector


Does it have to be a pointer? I think it fairly enough to make it an array, and just leave it empty if we don't wanna specify the field.

I tried to make it a pointer, but when running make update, seems the update scripts still treat the field as array, and tried to range over it:

# k8s.io/kubernetes/vendor/k8s.io/api/storage/v1 vendor/k8s.io/api/storage/v1/generated.pb.go:169:8: invalid argument m.AllowedTopologies (type *[]TopologySelector) for len vendor/k8s.io/api/storage/v1/generated.pb.go:170:17: cannot range over m.AllowedTopologies (type *[]TopologySelector) vendor/k8s.io/api/storage/v1/generated.pb.go:348:8: invalid argument m.AllowedTopologies (type *[]TopologySelector) for len vendor/k8s.io/api/storage/v1/generated.pb.go:349:15: cannot range over m.AllowedTopologies (type *[]TopologySelector)

So if we have to make it this way, I guess we may need a fix on the update scripts.

Oops yes this does not need a pointer. I mistakenly left it from the last iteration when it was a struct

so []*TopologySelector ?

[] TopologySelector

saad-ali

Started reviewing. Haven't gotten very far yet.

saad-ali · 2018-05-18T01:41:44Z

contributors/design-proposals/storage/volume-topology-scheduling.md

-provisioned PVs.
+* Allow a Pod to request one or more Persistent Volumes (PV) with topology that
+  are compatible with the Pod's other scheduling constraints, such as resource
+  requirements and affinity/anti-affinity policies.


nit: This still doesn't sound very clear to me.

Pods don't request anything. Pods specify things that users request. So, to me, this reads "Allow a user to to specify in a pod one or more PV with topology that is compatible with the pod's other constraints."

That still doesn't make sense to me. Do we prevent that now? Is that not possible now?

The following sound a lot more clear to me:

Allow the Kubernetes scheduler to influence where a volume is provisioned based on scheduling constraints on the pod referencing the volume, such as pod resource requirements and affinity/anti-affinity policies.

Ensure volume topology is used as a constraint when scheduling a pod referencing one or more Persistent Volumes (PV) with topology.

These may be implied, but I'd like to be very explicit about them.

saad-ali · 2018-05-18T01:43:27Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+* Allow a Pod to request one or more Persistent Volumes (PV) with topology that
+  are compatible with the Pod's other scheduling constraints, such as resource
+  requirements and affinity/anti-affinity policies.
+* Support arbitrary PV topology domains (i.e. node, rack, zone, foo, bar).


Maybe expand this to explicitly state that we want to this without having to encode as first class objects in the k8s API.

saad-ali · 2018-05-18T01:44:53Z

contributors/design-proposals/storage/volume-topology-scheduling.md

+  are compatible with the Pod's other scheduling constraints, such as resource
+  requirements and affinity/anti-affinity policies.
+* Support arbitrary PV topology domains (i.e. node, rack, zone, foo, bar).
+* Support topology for statically created PVs and dynamically provisioned PVs.


nit: personal preference "pre-provisioned" instead of "statically created". The symmetry with dynamically provisioned is nicer =)

msau42 · 2018-05-18T23:02:44Z

This is unicorning too frequently. I addressed all the comments here, and opened up a new PR #2168

/close

jheiss · 2018-05-21T15:59:31Z

@msau42 Did you mean to refer to #2168 when closing this? (You referred to this same issue in your last comment.)

msau42 · 2018-05-21T16:36:06Z

Thanks, updated!

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Feb 27, 2018

k8s-ci-robot requested review from childsb and saad-ali February 27, 2018 01:07

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 27, 2018

k8s-ci-robot assigned jsafrane, lichuqiang, saad-ali and thockin Feb 27, 2018

msau42 mentioned this pull request Feb 27, 2018

Topology aware dynamic provisioning design #1855

Merged

k8s-github-robot added the kind/design Categorizes issue or PR as related to design. label Feb 27, 2018

k8s-ci-robot assigned bsalamat Feb 27, 2018

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 27, 2018

lichuqiang reviewed Feb 27, 2018

View reviewed changes

This was referenced Feb 27, 2018

scheduler and storage provision (PV controller) coordination kubernetes/kubernetes#43504

Closed

Fix StatefulSet volume provisioning "magic" kubernetes/kubernetes#41598

Closed

Topology Aware Volume Scheduling kubernetes/enhancements#490

Closed

k8s-ci-robot assigned verult Feb 27, 2018

verult reviewed Feb 28, 2018

View reviewed changes

msau42 mentioned this pull request Mar 1, 2018

Arbitrary topology through the system kubernetes/kubernetes#41442

Closed

msau42 mentioned this pull request Mar 9, 2018

[local-volume] Design of LVM-based dynamic provisioning kubernetes-retired/external-storage#651

Closed

lichuqiang mentioned this pull request Mar 9, 2018

WIP: Design of local storage dynamic provisioning #1914

Closed

lichuqiang reviewed Mar 19, 2018

View reviewed changes

msau42 commented Mar 20, 2018

View reviewed changes

lichuqiang reviewed Mar 20, 2018

View reviewed changes

msau42 commented Apr 18, 2018

View reviewed changes

verult reviewed Apr 18, 2018

View reviewed changes

msau42 mentioned this pull request Apr 19, 2018

Add new properties to Kubernetes Volume kubernetes/kubernetes#62869

Closed

msau42 commented Apr 20, 2018

View reviewed changes

lichuqiang reviewed Apr 21, 2018

View reviewed changes

lichuqiang reviewed Apr 23, 2018

View reviewed changes

lichuqiang mentioned this pull request Apr 26, 2018

Volume topology aware dynamic provisioning: work based on new API kubernetes/kubernetes#63193

Merged

thockin reviewed Apr 26, 2018

View reviewed changes

This was referenced Apr 27, 2018

Volume topology aware dynamic provisioning: basic plumbing kubernetes/kubernetes#63232

Merged

API change for volume topology aware dynamic provisioning kubernetes/kubernetes#63233

Merged

Topology aware dynamic provisioning design

d13b661

msau42 force-pushed the storage-topology-design branch 2 times, most recently from ab884c8 to 6f22849 Compare May 12, 2018 01:10

msau42 force-pushed the storage-topology-design branch from 6f22849 to 3949ced Compare May 14, 2018 17:38

address review comments

023cb11

msau42 force-pushed the storage-topology-design branch from 3949ced to 023cb11 Compare May 14, 2018 17:41

kangarlou reviewed May 14, 2018

View reviewed changes

lichuqiang reviewed May 16, 2018

View reviewed changes

saad-ali reviewed May 18, 2018

View reviewed changes

msau42 mentioned this pull request May 18, 2018

Topology aware dynamic provisioning design (take 3) #2168

Merged

k8s-ci-robot closed this May 18, 2018

benbromhead mentioned this pull request Aug 3, 2018

Multi-AZ topologies instaclustr/cassandra-operator#78

Closed

msau42 deleted the storage-topology-design branch November 17, 2018 21:12

kmova mentioned this pull request Mar 27, 2020

REQUEST: New membership for kmova kubernetes/org#1755

Closed

6 tasks

Topology aware dynamic provisioning design #1857

Topology aware dynamic provisioning design #1857

Conversation

msau42 commented Feb 27, 2018

msau42 commented Feb 27, 2018

msau42 commented Feb 27, 2018

lichuqiang Feb 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lichuqiang Feb 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lichuqiang Feb 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Feb 27, 2018

msau42 commented Feb 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

verult Mar 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

verult May 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented May 12, 2018

msau42 commented May 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lichuqiang May 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saad-ali left a comment

lichuqiang Feb 27, 2018 •

edited

Loading

lichuqiang Feb 27, 2018 •

edited

Loading

lichuqiang Feb 27, 2018 •

edited

Loading

verult Mar 27, 2018 •

edited

Loading

verult May 14, 2018 •

edited

Loading

lichuqiang May 16, 2018 •

edited

Loading

msau42 commented May 18, 2018 •

edited

Loading