Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a design proposal for dynamic volume limits #2051

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 157 additions & 0 deletions contributors/design-proposals/storage/dynamic_volume_limit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# Dynamic attached volume limits

## Goals

Currently the number of volumes attachable to a node is either hard-coded or only configurable via an environment variable. Also

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you list down the those methods here for reference, i didnt know the limit existed somewhere

existing limits only apply to well known volume types like EBS, GCE and is not available to all volume plugins.

This proposal enables any volume plugin to specify those limits and also allows same volume type to have different volume
limits depending on type of node.

## Implementation Design

### Prerequisite

* 1.11 This feature will be protected by an alpha feature gate, so as API and CLI changes needed for it. We are planning to call
the feature `AttachVolumeLimit`.
* 1.12 This feature will be behind a beta feature gate and enabled by default.

### API Changes

There is no API change needed for this feature. However existing `node.Status.Capacity` and `node.Status.Allocatable` will
be extended to cover volume limits available on the node too.

The key name that will store volume will be start with prefix `attachable-volumes-`. The volume limit key will respect
format restrictions applied to Kubernetes Resource names. Volume limit key for existing plugins might look like:


* `attachable-volumes-aws-ebs`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we also want to add a max capacity limit, does "attachable-volumes" still make sense, or do we want to clarify that it is a count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to use separate resource name for max attachable capacity. Something like - attachable-capacity-xxx and keep attachable-volumes-xx for representing count. I think capacity vs volumes word is "good enough" indication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike this sort of ad hoc syntax embedded in a string, and aside from that, I don't buy that this should be magical like this. Why can't we do something less stringy?

Just thinking out loud...

e.g. Some volume have per-node attachment limits. For those volumes, you install an admission controller that examines the volumes in play and decorates the pod as requiring certain resources, and then the scheduler simply schedules?

I feel like we have explored this, but if so, I don't recall why it was rejected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two part of it right?

  1. The plugin must be able to specify a maximum attachment limit on per node basis. We chose to use node's allocatable/capacity field for that. And when we want to express this limit in node's allocatable regardless of how we do counting via admission controller or something else - the limit must have a key and a value. The key MUST be a string that adheres to resourceName naming conventions(allocatable is of type map[v1.ResourceName]resource.Quantity). So isn't the attachable-volumes-aws-ebs prefix chosen for expressing attach limits in node's allocatable/capacity orthogonal to how we do counting?

  2. We could still use a admission controller to do actual counting but I think it was agreed that, scheduler is better(or may be easier) place for counting actual volumes in use. Because:
    a. It can consider all pods on the node. Not just the one being scheduled.
    b. I think the way volume limits interacts with late binding and even distribution of pods with volume workloads(Balanced resource allocation priority to include volume count on nodes. kubernetes#60525) makes it better place to do in scheduler.

And obviously a admission plugin may be disabled or not enabled(but same goes for a scheduler predicate I guess). I am not particularly opposed to using admission controller for counting but I think we did not see benefits we thought we will see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to reformat for comments:

We chose to use node's allocatable/capacity field for that [to specify a maximum attachment limit] ... The key MUST be a string that adheres to resourceName naming conventions ... So isn't the attachable-volumes-aws-ebs prefix chosen for expressing attach limits in node's allocatable/capacity orthogonal to how we do counting

Yes it is orthogonal. Sorry for putting them in one message.

My first grump here is that we have invented non-obvious syntax without any specification for it. I detest string-embedded syntaxes (stringly type APIs) and even if we DO go that way, we can't tolerate under-specified ones. Where do these magical names come from? How do they not conflict? Jordan had some comments on that, too. At the very very least this needs a strong spec, and perhaps we actually need something different.

Regarding scheduler, what I really don't want to do is build up a body of special -cases that we dump on the scheduler, when the problem is actually a pretty general problem. How do we opaquely allocate and schedule resources WITHOUT changing the scheduler? We may need to make other changes to the system to accomodate that, but I think it should be a preferred choice. @bsalamat am I over-blowing this?

Late-binding is from a PVC to an PV via a class. Presumably a single class can't use multiple drivers, so it's still feasible to reify the PVC -> class -> driver somewhere? The fact that classes could, in theory change, is a problem that maybe we should think harder about and maybe we need a different kind of plugin (e.g. maybe we need explicit scheduler pre-processor plugins). "Let's just hack the scheduler' seems like a very slippery slope to me.

And, BTW, I want this capability (and have for years), so please don't think I am poo-pooing the idea :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second what @thockin said about dumping a lot of special case logic in the scheduler. That said, not everything can be checked at admission time. For example, dynamic volume binding happens after scheduler finds a node to run a pod. I guess there is no choice other than checking the number of attached volumes to the node at that point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these magical names come from? How do they not conflict? Jordan had some comments on that, too. At the very very least this needs a strong spec, and perhaps we actually need something different.

In current iteration any resource name(in node's allocatable) that uses attachable-volumes- prefix refers to attachable limits. We have added some tests to ensure that containers themselves aren't able to directly request this resource. It is however possible that someone can set a new attachable-volumes-xxx value that does not apply to volume limits(via patch ).

I am not entirely sure about some external component inadvertently overriding any of existing limits though. Because - when kubelet starts, it sets those limits and it periodically syncs the value from same source. So if an external component does override those values it will be wiped out.

I still see the need of speccing them better and perhaps there is a better way. I need to think and will bring this up on sig-storage.

Copy link
Member

@msau42 msau42 Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the earlier design proposals was to leverage the existing opaque resource counting feature that already exists in the scheduler. However there were some hurdles:

  1. Opaque resources are specified at the container level. We would need to extend the pod api to add pod-level resources.
  2. Users don't specify volume counts via resources. I think this is where the admission controller idea came in. Have an admission controller that parses pod.volumes and convert that into a pod-level volume count resource. I remember discussing this option in a sig-node meeting and there were concerns about users being confused by this field they didn't set, and also users potentially manually setting the fields themselves.

That being said, it may be worth revisiting this. It turns out we also want to support limiting volume attached capacity per node. If we follow the current design, that will require adding more special names/logic for this new scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous discussion here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a discussion with @jsafrane as well about this and we discussed two problematic things about moving the count outside the scheduler:

  1. For local volumes with delayed binding, storageClass could be dummy or unknown and hence counting unbound PVCs is going to be tricky outside the scheduler.
  2. Currently the scheduler counts unique volumes. If we specify volumes as a pod level countable resource and scheduler counts volumes the way it counts memory or other resources, it can't accurately count unique volumes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is however possible that someone can set a new attachable-volumes-xxx value that does not apply to volume limits(via patch ).

What is the specification for xxx here? Is it the CSI driver name? Some other source? Does it get some auto-mutation? Are characters changed?

* `attachable-volumes-gce-pd`

`IsScalarResourceName` check will be extended to cover storage limits:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function ensure that users cannot request this resource normally in their Pod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check in itself will not prevent users from requesting this inside pods. But since name does not adhere to StandardResourceName conventions, so users can not request it. I will update validations to be double sure that users can not request this resource.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note about, how are are making sure that this resource name can not be used in pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about NumberofAttachedVolumesLimit to make it more obvious?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IsScalarResourceName function is already used for checking whether a resource can be added to node's Capacity and Allocatable. We just have to extend the check to permit new volume limit values. We can't define new function without making bigger change in scheduler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage-limits- is only the prefix. The full name will depend on plugin. So something like storage-limits-aws-ebs etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point was what kind of storage limits is storage-limits-aws-ebs , is it size of volume that can be created, number of volumes attached to a node, or something else ?

Copy link
Member

@liggitt liggitt May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function ensure that users cannot request this resource normally in their Pod?

I had the same question. Are we already using capacity/allocatable resources reported by nodes to represent resources not explicitly requested/consumed by pods/containers?

since name does not adhere to StandardResourceName conventions, so users can not request it. I will update validations to be double sure that users can not request this resource.

if I'm reading this correctly, this PR is adding a new use of allocatable resources reported by nodes:

well-known extension
requested by pods cpu, memory, hugepages-..., kubernetes.io/foo acme.com/foobar
implicitly requested via volume use? storage-limits-... ?

this is using allocatable resources to represent one aspect of a volume, which we don't want pod specs to request directly.

does it seem likely we would want to use allocatable resources to represent other aspects of volumes in the same way? (thinking of finite resources CSI impls would report on the node side, indicate a consumption of in a storageclass or CSI PV, and expect to affect PV/Pod scheduling).

flipping this around, for each resource the node reports, it seems important to be able to determine:

  • what object(s) are allowed to request the resource (for API validation purposes)?
    • as far as I know, only pods can request resources today, and all resources the node reports are requestable by pods
    • this PR adds a resource nothing can request explicitly
  • what causes consumption of the resource (for scheduling purposes)?
    • currently, scheduling a pod to a node consumes the resources requested by the pod
    • this PR adds a resource that is implicitly consumed by scheduling a pod which uses a particular volume type

is the ultimate goal to represent both the availability and the consumption of volume-related resources explicitly in the API? if so, where would the volume resource requests live? in the pod spec at the pod level? in the pod spec in the volume definitions? on a PV object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this offline a bit and it is true that apart from this resource, there is no other resource that is available on the node but not directly requestable by the pod and it will be nice to segment resources that are available on the node but not requestable by the pods directly.

However I think that, we can revisit that decision once we know more. cc @derekwaynecarr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max pod per node limit is a node capacity resource that cannot be explicitly requested by pods


```go
func IsStorageAttachLimit(name v1.ResourceName) bool {
return strings.HasPrefix(string(name), v1.ResourceStoragePrefix)
}

// Extended and Hugepages resources
func IsScalarResourceName(name v1.ResourceName) bool {
return IsExtendedResourceName(name) || IsHugePageResourceName(name) ||
IsPrefixedNativeResource(name) || IsStorageAttachLimit(name)
}
```

The prefix `storage-attach-limits-*` can not be used as a resource in pods, because it does not adhere to specs defined in following function:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be the same prefix as above?



```go
func IsStandardContainerResourceName(str string) bool {
return standardContainerResources.Has(str) || IsHugePageResourceName(core.ResourceName(str))
}
```

Additional validation tests will be added to make sure we don't accidentally break this.

#### Alternative to using "storage-" prefix
We also considered using currently defined `GetPluginName` interface(of Volume Plugins) for using as key in the `node.Status.Capacity`. Ultimately
we decided against using it, because most in-tree plugins start with `kubernetes.io/` and we needed a uniform way to identify storage
related capacity limits in `node.Status`.

### Changes to scheduler

Scheduler will retrieve available attachable limit on a node from `node.Status.Allocatable` and store it in `nodeInfo` cache. Volume
limits will be treated like any other scalar resource.

For `AWS-EBS`, `AzureDisk` and `GCE-PD` volume types, existing `MaxPD*` predicates will be updated to use volume attach limits available
from node's allocatable property. To be backward compatible - the scheduler will fallback to older logic, if no limit is set in `node.Status.Allocatable` for AWS, GCE and Azure volume types.

### Setting of limit for existing in-tree volume plugins

The volume limit for existing volume plugins will be set by querying the volume plugin. Following function
will be added to volume plugin interface:

```go
type VolumePluginWithAttachLimits interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the scheduler need to call this volume plugin interface? Or will the existing predicate just hardcode the names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When counting "in-use" volumes on a node by iterating through pods, scheduler will have to check if any of the volumes being used by pod under consideration implements VolumePluginWithAttachLimits and consider that volume.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explicitly state that here because the scheduler today does not call volume plugin interfaces.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// Return key name that is used for storing volume limits inside node Capacity
// must start with storage- prefix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the plugin prefix the "storage-" part, or should the system prefix it?

Copy link
Member Author

@gnufied gnufied May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can go either way. For something like CSI the prefix will be added by the system. For in-tree plugins currently the entire volumeLimitKey is supplied by the plugin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For something like CSI the prefix will be added by the system

that seems problematic for a couple reasons... first, we want to transition in-tree volume plugins to CSI plugins. that implies the resource name consumed by an attached volume should be able to be coherent between use of that volume type via in-tree volume sources and CSI volume sources (e.g. if I have 10 AWSElasticBlockStore volumes and 10 CSI volumes backed by AWS EBS volumes, they should both map to the same storage-volume-limit-aws-ebs resource)

second, you'll run into resource name length limits if you simply prefix storage-csi- to the CSI driver name (CSI drivers can be 63 chars long, just like resource names)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PV specs of csi ebs and in-tree EBS volumes are not interchangeable for various other reasons. In fact, we just went through a long discussion at sig f2f about how to handle this transition. I would not necessarily expect volume limit key to be same for in-tree EBS plugin and CSI plugin.

I was thinking to workaround 63 char limit, we can take first 13 chars from sha1 key of plugin name.

VolumeLimitKey(spec *Spec) string
// Return volume limits for plugin
GetVolumeLimits() (map[string]int64, error)
}
```

When querying the plugin - plugin will use `ProviderName` function of CloudProvider to check
if plugin is usable on the node. For example - querying for `GetVolumeLimits` from `aws-ebs` plugin with `gce` cloudprovider
will result in error.

Kubelet will query the volume plugins inside `kubelet.initialNode` function and populate `node.Status` with returned values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming that node.Status will be populated at the time of node registration before node is ready and before pods can be scheduled on it. I am assuming the limit wont change at run time, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes limit will not change at runtime and should be invariant through out lifetime of node.


For GCE and AWS - `GetVolumeLimits` will return limits depending on node type. Plugin already has node name accessible
via `VolumeHost` interface and hence it will check the node type and return the volume limits.

We do not aim to cover all in-tree volume types. We will support dynamic volume limits proposed here for following volume types:

* GCE-PD
* AWS-EBS
* AzureDisk

We expect to add incremental support for other volume types.

### Changes for Kubernetes 1.12

For Kubernetes 1.12, we are adding support for CSI and moving the feature to beta.

#### CSI support

A new function will be added to `pkg/volume/util/attach_limit.go` which will return CSI attach limit
resource name.

The interface of function will be:

```go
const (
// CSI attach prefix
CSIAttachLimitPrefix = "attachable-volumes-csi-"

// Resource Name length
ResourceNameLengthLimit = 63
)

func GetCSIAttachLimitKey(driverName string) string {
csiPrefixLength := len(CSIAttachLimitPrefix)
totalkeyLength := csiPrefixLength + len(driverName)
if totalkeyLength >= ResourceNameLengthLimit {
charsFromDriverName := driverName[:23]
// compute SHA1 of driverName and get first 16 chars
return CSIAttachLimitPrefix + charsFromDriverName + hashed

}
return CSIAttachLimitPrefix + driverName
}
```

This function will be used both on node and scheduler for determining CSI attach limit key.The value of the
limit will be retrieved using `GetNodeInfo` CSI RPC call and set if non-zero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to handle the case of override and changing the limit on the node after the initial configuration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No not in this release. We will tackle this problem in 1.13


**Other options**

Alternately we also considered storing attach limit resource name in `CSIDriver` introduced as part
of https://github.com/kubernetes/community/pull/2514 proposal.

This will work but depends on acceptance of proposal. We can always migrate attach limit resource names to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSI doesn't have a way to report some alternative key though right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there is a proposal to allow CSI drivers to register themselves and not use auto registration mechanism. In which case, it will be possible for a plugin to report alternative key as a part of registration process.

values defined in `CSIDriver` object in later release. If `CSIDriver` object is available and has a attach limit key,
then kubelet could use that key otherwise it will fallback to `GetCSIAttachLimitKey`.

Scheduler can also check presence of `CSIDriver` object and corresponding key in node object, otherwise it will
fallback to using `GetCSIAttachLimitKey` function.

##### Changes to scheduler

To support attachable limit for CSI, a new predicate called `CSIMaxVolumeLimitChecker` will be added. It will use `GetCSIAttachLimitKey`
function defined above for extracting attach limit resource name.

The predicate will be NOOP if feature gate is not enabled or when attachable limits are not available from node object.

Handling delayed binding is out of scope for this proposal and will be fixed in delayed binding and topology aware dynamic
provisioning.