Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology aware dynamic provisioning design #1857

Closed
wants to merge 2 commits into from

Conversation

msau42
Copy link
Member

@msau42 msau42 commented Feb 27, 2018

This design is for 1.11 work to support dynamic provisioning.

Update volume topology design doc with:

  • Defining terms used to describe topology
  • Dynamic provisioning new APIs
  • In-tree and external provisioning interface changes
  • Scheduler implementation details
  • Examples for local, zonal, and multi-zonal volumes

@kubernetes/sig-storage-pr-reviews

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Feb 27, 2018
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 27, 2018
@msau42
Copy link
Member Author

msau42 commented Feb 27, 2018

/assign @jsafrane @saad-ali @thockin @lichuqiang

@k8s-github-robot k8s-github-robot added the kind/design Categorizes issue or PR as related to design. label Feb 27, 2018
@msau42
Copy link
Member Author

msau42 commented Feb 27, 2018

/assign @bsalamat
@kubernetes/sig-scheduling-pr-reviews

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 27, 2018
}

type StorageClassCapacity struct {
TopologyKey string
Copy link

@lichuqiang lichuqiang Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still I'm concerned about the Topology key in StorageClassCapacity, if we place it in the status field, and leave it to provisioners to specify, then for edge-case that:

When scheduler find a storage with StorageClassCapacity empty, it is hard to tell whether the storage is indeed without capacity limit, or it's just of a state that capacity has not been initialized by provisioners yet.

If we move the Topology key out of status field, and leave it to admin, it is thought to be fulfilled on StorageClass creation.
Thus, I think the edge-case above could be avoid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it will open up a new mode for failure, where administrator specifies the wrong topologyKey and the provisioner values won't match with it.

If a pod gets scheduled before provisioners have reported their capacity, then:

  • Provisioning could fail. In that case we should retry scheduling
  • Provisioning could succeed. Then we will not have marked the PVC with provisioned capacity information, so our calculations for used capacity will be incorrect. Is there a way we can handle this? Ie, have a StorageClass modify handler, when StorageClassCapacity is initialized, then initialize the capacity cache?

Copy link

@lichuqiang lichuqiang Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the scheduler could recognize the StorageClass at the first beginning, but not manage things by trying Provisioning. That makes things complex and may have effect on performance.

If you are worried that admin may specify a wrong key, then a simple flag to imply whether the StorageClass is of capacity limit may also work.
I think the admin should at least know the StorageClass he creates is of capacity limit or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the scheduler know that all the provisioners have initialized and reported capacity? (in the case of local volume provisioners)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nm, in that case, once the topologyKey is specified, then capacity is zero until otherwise updated.

Copy link

@lichuqiang lichuqiang Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the scheduler know that all the provisioners have initialized and reported capacity? (in the case of local volume provisioners)

I think it doesn't have to know, what the scheduler cares are:

  1. Whether the StorageClass is of capacity limit
  2. If so, whether it's capacity could satisfy the requirement.

No matter it's capacity smaller than requirement, or it's in state of uninitialized, it's both in state of unavailable for scheduler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed with @saad-ali on this. Our consensus is that it's better for topologyKey to be reported by the provisioner. Depending on the user to specify the topologyKey, or even a flag, opens up the possibility that they could incorrectly set it. And they don't have any choice in what to set as the value. For local storage, it must always be set to "kubernetes.io/hostname", there is no other value that will work.

In the worst case, there will be a period of time during cluster initialization where the provisioners may not have reported capacity yet. The consequences are not fatal. Provisioning could fail until the provisioners are up and running (which is expected behavior in general). Provisioning could succeed, and in that case we'll have to resync the capacity cache later. We could try to mitigate this scenario, by recommending that provisioners do not process any provisioning requests until they have fully initialized (including reporting capacity).

@msau42
Copy link
Member Author

msau42 commented Feb 27, 2018

TODO: Add section about predicate ecache invalidation and active pod queues

@msau42
Copy link
Member Author

msau42 commented Feb 27, 2018

/assign @verult

Some volume types have capacity limitations per topology domain. For example,
local volumes have limited capacity per node. Provisioners for these volume
types can report these capacity limitations through a new StorageClass status
field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been warned before against having constantly changing metrics (like resource usage) be placed as an API status, and one argument is that it could easily overwhelm the API server. Maybe the capacity information should be plumbed to the scheduler through existing resource usage monitoring pipelines.

Might be good to get a review from sig-instrumentation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capacity reported by the provisioner should be total capacity, not available capacity. I don't anticipate total capacity changing very frequently.

The challenge with existing resource usage tracking is that it is done on a per node level, whereas volume resource limits may not necessarily be at the node boundary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are plans to eventually plumb PVC usage stats through Metrics Server, as part of the work to get these stats exposed to kubectl. I believe the scheduler also talks to the Metrics Server.

Currently there's a limitation that PVC usage stats can only be measured when the underlying volume is mounted in a pod, but maybe we can expand this feature and add the capability of having volume plugins report these stats as well.

@jingxu97

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe for PVC usage stats, it's still tied to a Pod that is scheduled to a specific node. That way you can hit all the kubelet metrics endpoints. For this design to handle getting capacity from metrics, we would need to collect all the metrics from a central entity, and not per node to handle volumes with topologies spanning more than a single node.

I am unaware of the scheduler talking to the metrics server. My understanding is it uses resource capacity reported in the Node object. @bsalamat could you clarify?

But besides the initial StorageClass creation, I don't anticipate reported capacity to be updated very frequently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we have also have the challenge that external provisioners need to report their capacity somewhere, whether on an API object (like proposed), or to some central server.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as I'm also focusing on design of local storage provisioning, I wonder what the way to talk to the metrics server looks like.

I guess users are more likely to look forward to things they are familiar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed a bit with @verult, and we agreed that going the metrics server approach would require a lot of extra plumbing that doesn't exist today. So we're still leaning towards the current approach of reporting total (not used) capacity in the StorageClass and calculating and caching used from provisioned PVCs.

kube-scheduler, kube-controller-manager, and all kubelets.

StorageClass.AllowedTopology and StorageClassCapacity fields will be controlled
by the DynamicProvisioningScheduling feature gate, and must be configured in the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem the two are not completely independent, maybe we should control both binding and provisioning with a same Feature Gate?

If we enable VolumeScheduling and disable DynamicProvisioningScheduling, then what's the expected behavior in pv controller? for PVCs of mode WaitForFirstConsumer, skip binding but exec provisioning directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Challenge is static local PVs is beta, and dynamic provisioning will be alpha, so we can't have one feature gate control both. I see VolumeScheduling as a superset, so if that is disabled, then the delayed dynamic provisioning is also disabled. Once delayed dynamic provisioning goes to beta, we can combine the two feature gates.

}

type VolumeProvisioningTopology struct {
Required map[string]VolumeTopologyValues
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saad-ali found a use case that this cannot represent:
An administrator wants to restrict provisioning to either "zone A, rack 1" or "zone B, rack 1".

To fix this, we could go back to using a restricted node selector (only support the IN operator), or follow the CSI proposed representation which sort of does the reverse of the node selector: OR, then AND.

A restricted NodeSelector would be more familiar to Kubernetes users, so I'm leaning towards that approach. As long as we restrict the operators to just IN, then it can be translated internally into the CSI representation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm flip flopping again. A node selector specifies a subset off of a set of labels on the node object. But in this case, we are not selecting from a set of labels on an existing object. In this case, we want to limit the topology of a PV object that will be created in the future. So I think following the CSI representation as a list of maps makes more sense, rather then specifying a selector on nothing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@verult verult Mar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Another semantic difference is: VolumeTopology is used to select a set of nodes at which a PV is available, whereas NodeSelector is used to select a single node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeSelector is used to select a single node

that's a little surprising... selectors typically select a set of things

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt you are right, I was using a wrong frame of thinking, so I take that back.

Here's another point, similar to what @msau42 said, that helped me understand the difference: ultimately the scheduler needs a set of nodes to choose from; a NodeSelector directly specifies that set, whereas VolumeTopology indirectly specifies it by providing a super set which is then refined further by the volume plugin. Please correct me if this is the wrong understanding.

```
1. For static PV binding:
1. Prebind the PV by updating the `PersistentVolume.ClaimRef` field.
2. If the prebind fails, revert the cache updates.
2. For in-tree and external dynamic provisioning:
1. Set `annStorageProvisioner` on the PVC.
1. Set `annScheduledNode` on the PVC.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annSelectedNode


## API Upgrades

### Zonal PV Upgrade
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an example for multi-zonal PV upgrade because the zone label value format is special.

@@ -141,6 +185,9 @@ spec:
```

#### Zonal Volume
In this example, the volume can only be accessed from nodes that have the
label key `failure-domain.beta.kubernetes.io/zone' and label value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: tick instead of apostrophe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done thanks

...
status:
capacity:
topologyKey: zone,rack
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keys


In order for the scheduler to calculate provisioned capacity on PVs that have
been Released (PVC has been deleted), when binding a PV to a PVC, the PV also
needs to copy the PVC's `AnnProvisionerTopology` annotation, if it exists.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnnProvisionedTopology

parameters:
type: pd-standard
allowedTopologies:
requiredToplogies:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requiredTopologies

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
Name: zonal-class

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name

* StorageClass will be required to get the new binding behavior, even if dynamic
provisioning is not used (in the case of local storage).
* We have to maintain two different paths for volume binding.
* We have to maintain two different code paths for volume binding.
* We will be depending on the storage admin to correctly configure the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty significant downside, IMO. What this section doesn't really cover is WHY we need both, other than transition. Or rather, why there isn't a less permanent way to transition.

I know Jan put forth a use-case of very slow-to-bind volumes - if we still hold that as a use-case for this, can we detail it a bit more? API fields are forever, and this is one I think we'd rather not have if we don't NEED it.

e.g. if we could get away with an annotation for some transitionary period, that would be better. I'd like to se an explanation of why that doesn't work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can change this behavior to the default, unless we go v2, for the following reasons:

  • Users may have already developed automation/processes assuming the old binding behavior
  • Some big volumes can take hours to provision, so provisioning is a planned event well in advance of workload creation
  • Globally accessible storage doesn't benefit from this change in behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any special use case for very slow-to-bind volumes, it would just break existing behavior. Users (and automation) are used to their PVCs being bound relatively quickly and they will be confused.

Someone on SIG-storage said that in some storage backends provisioning can take hours and it's well planned operation, so users want to create PVCs and let them provision before they start a pod. @kangarlou, was that you?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsafrane No, it wasn't me, but here is my two cents:
Today for dynamic provisioners, binding happens as soon as the provisioning is done. For the NFS and iSCSI volumes that we manage on NetApp storage, provisioning happens very quickly. To keep it fast, our provisioner doesn't format the iSCSI volumes and relies on kubelet to perform this functionality during attach. This is not a concern for us, but I can see cases where provisioning takes a long time for some plugins/provisioners (e.g., formatting large drives at the time of volume creation, provisioning from snapshots). Having said that, I don't think we should maintain two code paths. As far as external automation is concerned, because provisioning can take arbitrarily long time (in the worst case, the provisioning can fail), any external tooling should inspect the PVC status and not rely on the immediate binding behavior. I agree with @jsafrane that some users prefer to provision their PVCs before they start the pods, but for provisioners that take a long time, I don't think the proposed changes affect the user experience much as such users see provisioning and binding as one step, not as two separate steps.

metadata:
name: standard
provisioner: kubernetes.io/gce-pd
volumeBindingMode: WaitForFirstConsumer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrt naming: does the extra word "volume" here add anything? In context it seems redundant. Would bindingMode or bindMode or bindEvent or something beshorter and thus easier to use?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking since this is a storageclass, and not a volume object, it may not be obvious what the binding is referring to. But considering we don't have any sort of storageclass binding concept today, shortening it to `bindMode" is fine with me.

to us-central1-a and us-central1-b. Topologies that are incompatible with the
storage provider parameters will be enforced by the provisioner.
```
apiVersion: storage.k8s.io/v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the difference between this and the previous?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UX is the same, but underneath, the provisioner will actually be choosing 2 zones out of the list. I guess I'm also implying here that if the user only specifies 1 zone, then the provisioner will fail because it requires a min of 2. I'll add that as another example

have the following labels:

* "zone: us-central1-a" and "rack: rack1" or,
* "zone: us-central1-b" and "rack: rack1" or,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout the system we have a mixed-usage of string-string maps to represent label matches and more expressive selectors. Is this better as a map (which requires 3 options) or would it better expressed as a full selector, eg.

"zone in [us-central1-a]" and "rack in [rack1]" or,
"zone in [us-central1-b]" and "rack in [rack1, rack2]"

Did we consoider it? I am not sure what the guidelines are for when to use one or the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I struggled with this and flipped back and forth many times. The main reason I went with this is because I didn't want to support selector operators other than "In". But I agree, this is more verbose because you have to explicitly spell out every combination. I'm willing to flip back to some new TopologySelector as I described above.

// Each entry in the map is ANDed.
MatchNodeLabels map[string]string
}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say that I am not sure I am happy with the API expression here. It feels very complicated and the names are not self-describing enough.

Why do we need a struct for the lowest level? Why not []map[string]string ? E.g.

StorageClass:
  allowedTopologies:
    requiredTopologies:
      - rack: "1"
        zone: "a"
      - rack: "2"
        zone: "b"

Does that parse?

For that matter, why do we need required at all? Do you plan to really expand there? What are the likely expansions?

Language-wise "allowed" and "required" seem at odds, so it make it hard to look at and comprehend what it really means.

Are we sure we want to express the match as LABELS or as topologies? We're implementing topology as labels for now, but it has ben proposed at least once that maybe topology should be its own concept. Also see question below about maps vs set selectors.

all told, I see something like:

StorageClass:
  topologyConstraints:
    - zone in [a, b]
      rack in ["1", "2"]

to be self-justifying. Anything beyond that needs robust rationalization. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I shied away from selectors is because I did not see an easy way for provisioners to be able to parse selector operators other than "In". But I would be open to creating a new TopologySelector that doesn't have an operator field, and only supports the "In" behavior of other selectors. Something like:

type TopologySelector {
   MatchLabelExpressions []TopologySelectorLabelRequirement
}

type TopologySelectorLabelRequirement {
  Key string
  Values []string
}

And if we do end up having a non-label-based topology field in the future, then we can add a MatchTopologyExpressions field.

Regarding required, I wanted to leave room for "preferred" in the future, but I guess it could be an entirely new "preferredTopologies" field if we need to.

Copy link
Contributor

@verult verult May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageClass:
 topologyConstraints:
   - zone in [a, b]
     rack in ["1", "2"]

Wouldn't this allow "zone=b,rack=1"? To entirely capture the original topology, you'd need something like:

StorageClass:
  topologyConstraints:
    - zone in [a]
      rack in ["1"]
    - zone in [b]
      rack in ["2"]

One inconvenience with the selector approach is there are multiple ways to specify certain kinds of topology. For example, the following 3 are equivalent:

StorageClass:
  topologyConstraints:
    - zone in [a]
      rack in ["1"]
    - zone in [b]
      rack in ["1"]
    - zone in [b]
      rack in ["2"]
StorageClass:
  topologyConstraints:
    - zone in [a,b]
      rack in ["1"]
    - zone in [b]
      rack in ["2"]
StorageClass:
  topologyConstraints:
    - zone in [a]
      rack in ["1"]
    - zone in [b]
      rack in ["1", "2"]

I could potentially see users being confused which form Kubernetes expects, when in reality either of them work. To me it seems better to have one layer of ORs instead of two, to guarantee that there's exactly 1 representation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My updated example here correctly represents (one way) the example topology. I think having multiple representations is fine as long as they evaluate the same, and it is consistent with how selectors are specified for other Kubernetes objects.

// (free and used) for provisioning at each topology value.
//
// The key of the map is the topology value. It is a flattened,
// comma-separated string of label values corresponding with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why comma seperated rather than structured? We mostly prefer structured fiedls over string-encoded fields..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think that a struct could be represented as a map key in our API.

// comma-separated string of label values corresponding with
// each label key specified in TopologyKeys in the same order.
//
// The value of the map is the total capacity for that topology
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this structure. The example below doesn't seem to match this and this doesn't seem to have enough information. You have topo key, topo value, and capacity you want to report, but the map can only carry 2 of those 3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's complicated... The topo key is specified as a separate field. And the map key is the topo values. And the map value is the capacity.

type StorageClass struct {
...

Status *StorageClassStatus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is going to consume this? Is it required that all provisioners provide it or is it just optional? How to transition to using it? What are consumers going to do with it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduler will consume this and it is optional. The scheduler looks at the capacity available in that topology domain to determine if dynamic provisioning is possible. The only volume plugin that will use this initially is local volumes. I don't have plans to use this for cloud zonal volumes. Those we can treat with "infinite" capacity.

//
// The value of the map is the total capacity for that topology
// value.
Capacities map[string]resource.Quantity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What fidelity is required? How fine-grained are the values? How often are they syncronized against reality? What happens if they are wrong in either direction? Who will consume this and how will they use it?

limitations. A missing map entry for a topology domain will be treated
as zero capacity.

A scheduler predicate will use the reported capacity to determine if dynamic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? This is a LOT of complexity for what seems like best-effort results that may not be accurate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it will scale badly for large clusters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it is insufficient to express whether a given volume can actually fit (e.g. could be fragmented).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to how a node today reports total CPU/memory. It doesn't actually report used vs available, instead the scheduler keeps track of used based on all the Pods that have been scheduled to that node. This way you avoid the latency issue of waiting for reported values to updated after they are allocated.

So we're trying to follow a similar concept here, except the challenge with volumes is that their topology is not always scoped to a single node, so it doesn't make sense to report as part of node capacity. The reported capacity would only change if you change the capacity of the storage pool, but not when you allocate/free volumes from it. I expect the former will happen less frequently than the latter, so the scaling limits should be similar to how we process cpu/memory capacity.

But this does have some limitations to be accurate, like you mentioned:

  • The storage pool has to be dedicated to this StorageClass and cannot be shared with other StorageClasses
  • Provisioning from the storage pool cannot be prone to fragmentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an alternative idea. Considering that only local volumes plan to use the capacity reporting api at the moment, and there are some restrictions on how you deploy the storage pools to ensure that this capacity handling is actually accurate, we could try going with a CRD + scheduler extender approach.

The downsides are:

  • Deployment is more complicated
  • We can't include this into Kubernetes core. The provisioner is already out of tree, maybe this not an issue.
  • The predicate will become best-effort, due to the delay of seeing the PVC provisioning API update in the extender. The fact that the scheduler round robins on equal priority nodes could help mitigate this, so that the same node is not chosen again for the next Pod.

Advantages are:

  • Since it's a CRD specific to local volumes, the API could be simpler and more flexible because we don't need to make it generic to all volume topologies
  • You potentially could support scenarios like storage pools being shared across multiple StorageClasses, since it doesn't need to be strictly tied to a StorageClass anymore
  • You can encode special behavior/requirements of your specific storage system into the extender logic.
  • Keeps StorageClass lean and we don't have to regret not putting a Spec/Status in the beginning

@lichuqiang @jsafrane @thockin wdyt?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the idea. But I think the key point lies in whether we need a generic solution.

If we can see some storages are in need of the capacity ability (even potential but not for now), then it should absolutely be made in-tree and as a generic one.

But if we can decide the feature will only needed by local volumes. then it's reasonable to make it out-of-tree.

@msau42 msau42 force-pushed the storage-topology-design branch 2 times, most recently from ab884c8 to 6f22849 Compare May 12, 2018 01:10
@msau42
Copy link
Member Author

msau42 commented May 12, 2018

I squashed the PR into 3 commits:

  • Design up the most recent review iteration
  • Addressed feedback from the most recent review, changing allowedTopologies to use a new TopologySelector field instead of an array of maps
  • Separated out the capacity reporting design, so we can discuss it separately from the rest

@msau42 msau42 force-pushed the storage-topology-design branch from 6f22849 to 3949ced Compare May 14, 2018 17:38
@msau42 msau42 force-pushed the storage-topology-design branch from 3949ced to 023cb11 Compare May 14, 2018 17:41
@msau42
Copy link
Member Author

msau42 commented May 14, 2018

I removed the capacity api section, and will look at addressing it separately.

So the design proposal here includes:

  • Basic changes to scheduler and PV controller to delay dynamic provisioning and pass the selected node to the provisioner
  • API changes to allow an administrator to restrict topologies in the storageclass (replacement for plugin-specific zone/zones parameters today)

@saad-ali @thockin @jsafrane PTAL

When the feature gate is enabled, the PV controller needs to skip binding
unbound PVCs with VolumBindingWaitForFirstConsumer and no prebound PVs
to let it come through the scheduler path.

Dynamic provisioning will also be skipped if
VolumBindingWaitForFirstConsumer is set. The scheduler will signal to
VolumBindingWaitForFirstConsumer is set. The scheduler will signal to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic provisioning will be skipped or delayed until annSelectedNode annotation is set? Does this mean dynamic provisioners should look for this annotation for the storage classes with VolumeBindingMode set to VolumeBindingWaitForFirstConsumer before they can act on the PVCs of that type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In-tree provisioners will be coordinated by the PV controller.
  2. The external-provisioner library in external-storage will be updated to also coordinate this.
  3. The CSI external-provisioner will coordinate this.
  4. If you don't use the external-provisioner library, then yes, you need to handle this in your own provisioner.

// Restrict the node topologies where volumes can be dynamically provisioned.
// Each volume plugin defines its own supported topology specifications.
// Each entry in AllowedTopologies is ORed.
AllowedTopologies *[]TopologySelector
Copy link

@lichuqiang lichuqiang May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be a pointer? I think it fairly enough to make it an array, and just leave it empty if we don't wanna specify the field.

I tried to make it a pointer, but when running make update, seems the update scripts still treat the field as array, and tried to range over it:

# k8s.io/kubernetes/vendor/k8s.io/api/storage/v1
vendor/k8s.io/api/storage/v1/generated.pb.go:169:8: invalid argument m.AllowedTopologies (type *[]TopologySelector) for len
vendor/k8s.io/api/storage/v1/generated.pb.go:170:17: cannot range over m.AllowedTopologies (type *[]TopologySelector)
vendor/k8s.io/api/storage/v1/generated.pb.go:348:8: invalid argument m.AllowedTopologies (type *[]TopologySelector) for len
vendor/k8s.io/api/storage/v1/generated.pb.go:349:15: cannot range over m.AllowedTopologies (type *[]TopologySelector)

So if we have to make it this way, I guess we may need a fix on the update scripts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops yes this does not need a pointer. I mistakenly left it from the last iteration when it was a struct

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so []*TopologySelector ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[] TopologySelector

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started reviewing. Haven't gotten very far yet.

provisioned PVs.
* Allow a Pod to request one or more Persistent Volumes (PV) with topology that
are compatible with the Pod's other scheduling constraints, such as resource
requirements and affinity/anti-affinity policies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This still doesn't sound very clear to me.

Pods don't request anything. Pods specify things that users request. So, to me, this reads "Allow a user to to specify in a pod one or more PV with topology that is compatible with the pod's other constraints."

That still doesn't make sense to me. Do we prevent that now? Is that not possible now?

The following sound a lot more clear to me:

  • Allow the Kubernetes scheduler to influence where a volume is provisioned based on scheduling constraints on the pod referencing the volume, such as pod resource requirements and affinity/anti-affinity policies.
  • Ensure volume topology is used as a constraint when scheduling a pod referencing one or more Persistent Volumes (PV) with topology.

These may be implied, but I'd like to be very explicit about them.

* Allow a Pod to request one or more Persistent Volumes (PV) with topology that
are compatible with the Pod's other scheduling constraints, such as resource
requirements and affinity/anti-affinity policies.
* Support arbitrary PV topology domains (i.e. node, rack, zone, foo, bar).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe expand this to explicitly state that we want to this without having to encode as first class objects in the k8s API.

are compatible with the Pod's other scheduling constraints, such as resource
requirements and affinity/anti-affinity policies.
* Support arbitrary PV topology domains (i.e. node, rack, zone, foo, bar).
* Support topology for statically created PVs and dynamically provisioned PVs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: personal preference "pre-provisioned" instead of "statically created". The symmetry with dynamically provisioned is nicer =)

@msau42
Copy link
Member Author

msau42 commented May 18, 2018

This is unicorning too frequently. I addressed all the comments here, and opened up a new PR #2168

/close

@jheiss
Copy link

jheiss commented May 21, 2018

@msau42 Did you mean to refer to #2168 when closing this? (You referred to this same issue in your last comment.)

@msau42
Copy link
Member Author

msau42 commented May 21, 2018

Thanks, updated!

@msau42 msau42 deleted the storage-topology-design branch November 17, 2018 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/design Categorizes issue or PR as related to design. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.