kubernetes · gnufied · Apr 18, 2018 · Aug 10, 2018 · Aug 15, 2018 · krmayankk
diff --git a/contributors/design-proposals/storage/dynamic_volume_limit.md b/contributors/design-proposals/storage/dynamic_volume_limit.md
@@ -0,0 +1,157 @@
+# Dynamic attached volume limits
+
+## Goals
+
+Currently the number of volumes attachable to a node is either hard-coded or only configurable via an environment variable. Also
+existing limits only apply to well known volume types like EBS, GCE and is not available to all volume plugins.
+
+This proposal enables any volume plugin to specify those limits and also allows same volume type to have different volume
+limits depending on type of node.
+
+## Implementation Design
+
+### Prerequisite
+
+* 1.11 This feature will be protected by an alpha feature gate, so as API and CLI changes needed for it. We are planning to call
+  the feature `AttachVolumeLimit`.
+* 1.12 This feature will be behind a beta feature gate and enabled by default.
+
+### API Changes
+
+There is no API change needed for this feature. However existing `node.Status.Capacity` and `node.Status.Allocatable` will
+be extended to cover volume limits available on the node too.
+
+The key name that will store volume will be start with prefix `attachable-volumes-`. The volume limit key will respect
+format restrictions applied to Kubernetes Resource names. Volume limit key for existing plugins might look like:
+
+
+* `attachable-volumes-aws-ebs`
+* `attachable-volumes-gce-pd`
+
+`IsScalarResourceName` check will be extended to cover storage limits:
+
+```go
+func IsStorageAttachLimit(name v1.ResourceName) bool {
+    return strings.HasPrefix(string(name), v1.ResourceStoragePrefix)
+}
+
+// Extended and Hugepages resources
+func IsScalarResourceName(name v1.ResourceName) bool {
+    return IsExtendedResourceName(name) || IsHugePageResourceName(name) ||
+        IsPrefixedNativeResource(name) || IsStorageAttachLimit(name)
+}
+```
+
+The prefix `storage-attach-limits-*` can not be used as a resource in pods, because it does not adhere to specs defined in following function:
+
+
+```go
+func IsStandardContainerResourceName(str string) bool {
+    return standardContainerResources.Has(str) || IsHugePageResourceName(core.ResourceName(str))
+}
+```
+
+Additional validation tests will be added to make sure we don't accidentally break this.
+
+#### Alternative to using "storage-" prefix
+We also considered using currently defined `GetPluginName` interface(of Volume Plugins) for using as key in the `node.Status.Capacity`. Ultimately
+we decided against using it, because most in-tree plugins start with `kubernetes.io/` and we needed a uniform way to identify storage
+related capacity limits in `node.Status`.
+
+### Changes to scheduler
+
+Scheduler will retrieve available attachable limit on a node from `node.Status.Allocatable` and store it in `nodeInfo` cache. Volume
+limits will be treated like any other scalar resource.
+
+For `AWS-EBS`, `AzureDisk` and `GCE-PD` volume types, existing `MaxPD*` predicates will be updated to use volume attach limits available
+from node's allocatable property. To be backward compatible - the scheduler will fallback to older logic, if no limit is set in `node.Status.Allocatable` for AWS, GCE and Azure volume types.
+
+### Setting of limit for existing in-tree volume plugins
+
+The volume limit for existing volume plugins will be set by querying the volume plugin. Following function
+will be added to volume plugin interface:
+
+```go
+type VolumePluginWithAttachLimits interface {
+    // Return key name that is used for storing volume limits inside node Capacity
+    // must start with storage- prefix
+    VolumeLimitKey(spec *Spec) string
+    // Return volume limits for plugin
+    GetVolumeLimits() (map[string]int64, error)
+}
+```
+
+When querying the plugin - plugin will use `ProviderName` function of CloudProvider to check
+if plugin is usable on the node. For example - querying for `GetVolumeLimits` from `aws-ebs` plugin with `gce` cloudprovider
+will result in error.
+
+Kubelet will query the volume plugins inside `kubelet.initialNode` function and populate `node.Status` with returned values.
+
+For GCE and AWS - `GetVolumeLimits` will return limits depending on node type. Plugin already has node name accessible
+via `VolumeHost` interface and hence it will check the node type and return the volume limits.
+
+We do not aim to cover all in-tree volume types. We will support dynamic volume limits proposed here for following volume types:
+
+* GCE-PD
+* AWS-EBS
+* AzureDisk
+
+We expect to add incremental support for other volume types.
+
+### Changes for Kubernetes 1.12
+
+For Kubernetes 1.12, we are adding support for CSI and moving the feature to beta.
+
+#### CSI support
+
+A new function will be added to `pkg/volume/util/attach_limit.go` which will return CSI attach limit
+resource name.
+
+The interface of function will be:
+
+```go
+const (
+    // CSI attach prefix
+    CSIAttachLimitPrefix = "attachable-volumes-csi-"
+
+    // Resource Name length
+    ResourceNameLengthLimit = 63
+)
+
+func GetCSIAttachLimitKey(driverName string) string {
+    csiPrefixLength := len(CSIAttachLimitPrefix)
+    totalkeyLength := csiPrefixLength + len(driverName)
+    if totalkeyLength >= ResourceNameLengthLimit {
+        charsFromDriverName := driverName[:23]
+        // compute SHA1 of driverName and get first 16 chars
+        return CSIAttachLimitPrefix + charsFromDriverName + hashed
+
+    }
+    return CSIAttachLimitPrefix + driverName
+}
+```
+
+This function will be used both on node and scheduler for determining CSI attach limit key.The value of the
+limit will be retrieved using `GetNodeInfo` CSI RPC call and set if non-zero.
+
+**Other options**
+
+Alternately we also considered storing attach limit resource name in `CSIDriver` introduced as part
+of https://github.com/kubernetes/community/pull/2514 proposal.
+
+This will work but depends on acceptance of proposal. We can always migrate attach limit resource names to
+values defined in `CSIDriver` object in later release. If `CSIDriver` object is available and has a attach limit key,
+then kubelet could use that key otherwise it will fallback to `GetCSIAttachLimitKey`.
+
+Scheduler can also check presence of `CSIDriver` object and corresponding key in node object, otherwise it will
+fallback to using `GetCSIAttachLimitKey` function.
+
+##### Changes to scheduler
+
+To support attachable limit for CSI, a new predicate called `CSIMaxVolumeLimitChecker` will be added. It will use `GetCSIAttachLimitKey`
+function defined above for extracting attach limit resource name.
+
+The predicate will be NOOP if feature gate is not enabled or when attachable limits are not available from node object.
+
+Handling delayed binding is out of scope for this proposal and will be fixed in delayed binding and topology aware dynamic
+provisioning.