-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a design proposal for dynamic volume limits #2051
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,157 @@ | |||||||||||
# Dynamic attached volume limits | |||||||||||
|
|||||||||||
## Goals | |||||||||||
|
|||||||||||
Currently the number of volumes attachable to a node is either hard-coded or only configurable via an environment variable. Also | |||||||||||
existing limits only apply to well known volume types like EBS, GCE and is not available to all volume plugins. | |||||||||||
|
|||||||||||
This proposal enables any volume plugin to specify those limits and also allows same volume type to have different volume | |||||||||||
limits depending on type of node. | |||||||||||
|
|||||||||||
## Implementation Design | |||||||||||
|
|||||||||||
### Prerequisite | |||||||||||
|
|||||||||||
* 1.11 This feature will be protected by an alpha feature gate, so as API and CLI changes needed for it. We are planning to call | |||||||||||
the feature `AttachVolumeLimit`. | |||||||||||
* 1.12 This feature will be behind a beta feature gate and enabled by default. | |||||||||||
|
|||||||||||
### API Changes | |||||||||||
|
|||||||||||
There is no API change needed for this feature. However existing `node.Status.Capacity` and `node.Status.Allocatable` will | |||||||||||
be extended to cover volume limits available on the node too. | |||||||||||
|
|||||||||||
The key name that will store volume will be start with prefix `attachable-volumes-`. The volume limit key will respect | |||||||||||
format restrictions applied to Kubernetes Resource names. Volume limit key for existing plugins might look like: | |||||||||||
|
|||||||||||
|
|||||||||||
* `attachable-volumes-aws-ebs` | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that we also want to add a max capacity limit, does "attachable-volumes" still make sense, or do we want to clarify that it is a count? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking to use separate resource name for max attachable capacity. Something like - There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I really dislike this sort of ad hoc syntax embedded in a string, and aside from that, I don't buy that this should be magical like this. Why can't we do something less stringy? Just thinking out loud... e.g. Some volume have per-node attachment limits. For those volumes, you install an admission controller that examines the volumes in play and decorates the pod as requiring certain resources, and then the scheduler simply schedules? I feel like we have explored this, but if so, I don't recall why it was rejected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are two part of it right?
And obviously a admission plugin may be disabled or not enabled(but same goes for a scheduler predicate I guess). I am not particularly opposed to using admission controller for counting but I think we did not see benefits we thought we will see. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am going to reformat for comments:
Yes it is orthogonal. Sorry for putting them in one message. My first grump here is that we have invented non-obvious syntax without any specification for it. I detest string-embedded syntaxes (stringly type APIs) and even if we DO go that way, we can't tolerate under-specified ones. Where do these magical names come from? How do they not conflict? Jordan had some comments on that, too. At the very very least this needs a strong spec, and perhaps we actually need something different. Regarding scheduler, what I really don't want to do is build up a body of special -cases that we dump on the scheduler, when the problem is actually a pretty general problem. How do we opaquely allocate and schedule resources WITHOUT changing the scheduler? We may need to make other changes to the system to accomodate that, but I think it should be a preferred choice. @bsalamat am I over-blowing this? Late-binding is from a PVC to an PV via a class. Presumably a single class can't use multiple drivers, so it's still feasible to reify the PVC -> class -> driver somewhere? The fact that classes could, in theory change, is a problem that maybe we should think harder about and maybe we need a different kind of plugin (e.g. maybe we need explicit scheduler pre-processor plugins). "Let's just hack the scheduler' seems like a very slippery slope to me. And, BTW, I want this capability (and have for years), so please don't think I am poo-pooing the idea :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I second what @thockin said about dumping a lot of special case logic in the scheduler. That said, not everything can be checked at admission time. For example, dynamic volume binding happens after scheduler finds a node to run a pod. I guess there is no choice other than checking the number of attached volumes to the node at that point. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In current iteration any resource name(in node's allocatable) that uses I am not entirely sure about some external component inadvertently overriding any of existing limits though. Because - when kubelet starts, it sets those limits and it periodically syncs the value from same source. So if an external component does override those values it will be wiped out. I still see the need of speccing them better and perhaps there is a better way. I need to think and will bring this up on sig-storage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One of the earlier design proposals was to leverage the existing opaque resource counting feature that already exists in the scheduler. However there were some hurdles:
That being said, it may be worth revisiting this. It turns out we also want to support limiting volume attached capacity per node. If we follow the current design, that will require adding more special names/logic for this new scenario. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Previous discussion here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just had a discussion with @jsafrane as well about this and we discussed two problematic things about moving the count outside the scheduler:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What is the specification for xxx here? Is it the CSI driver name? Some other source? Does it get some auto-mutation? Are characters changed? |
|||||||||||
* `attachable-volumes-gce-pd` | |||||||||||
|
|||||||||||
`IsScalarResourceName` check will be extended to cover storage limits: | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this function ensure that users cannot request this resource normally in their Pod? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This check in itself will not prevent users from requesting this inside pods. But since name does not adhere to StandardResourceName conventions, so users can not request it. I will update validations to be double sure that users can not request this resource. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a note about, how are are making sure that this resource name can not be used in pods. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about NumberofAttachedVolumesLimit to make it more obvious? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my point was what kind of storage limits is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I had the same question. Are we already using capacity/allocatable resources reported by nodes to represent resources not explicitly requested/consumed by pods/containers?
if I'm reading this correctly, this PR is adding a new use of allocatable resources reported by nodes:
this is using allocatable resources to represent one aspect of a volume, which we don't want pod specs to request directly. does it seem likely we would want to use allocatable resources to represent other aspects of volumes in the same way? (thinking of finite resources CSI impls would report on the node side, indicate a consumption of in a storageclass or CSI PV, and expect to affect PV/Pod scheduling). flipping this around, for each resource the node reports, it seems important to be able to determine:
is the ultimate goal to represent both the availability and the consumption of volume-related resources explicitly in the API? if so, where would the volume resource requests live? in the pod spec at the pod level? in the pod spec in the volume definitions? on a PV object? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Discussed this offline a bit and it is true that apart from this resource, there is no other resource that is available on the node but not directly requestable by the pod and it will be nice to segment resources that are available on the node but not requestable by the pods directly. However I think that, we can revisit that decision once we know more. cc @derekwaynecarr There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Max pod per node limit is a node capacity resource that cannot be explicitly requested by pods |
|||||||||||
|
|||||||||||
```go | |||||||||||
func IsStorageAttachLimit(name v1.ResourceName) bool { | |||||||||||
return strings.HasPrefix(string(name), v1.ResourceStoragePrefix) | |||||||||||
} | |||||||||||
|
|||||||||||
// Extended and Hugepages resources | |||||||||||
func IsScalarResourceName(name v1.ResourceName) bool { | |||||||||||
return IsExtendedResourceName(name) || IsHugePageResourceName(name) || | |||||||||||
IsPrefixedNativeResource(name) || IsStorageAttachLimit(name) | |||||||||||
} | |||||||||||
``` | |||||||||||
|
|||||||||||
The prefix `storage-attach-limits-*` can not be used as a resource in pods, because it does not adhere to specs defined in following function: | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this supposed to be the same prefix as above? |
|||||||||||
|
|||||||||||
|
|||||||||||
```go | |||||||||||
func IsStandardContainerResourceName(str string) bool { | |||||||||||
return standardContainerResources.Has(str) || IsHugePageResourceName(core.ResourceName(str)) | |||||||||||
} | |||||||||||
``` | |||||||||||
|
|||||||||||
Additional validation tests will be added to make sure we don't accidentally break this. | |||||||||||
|
|||||||||||
#### Alternative to using "storage-" prefix | |||||||||||
We also considered using currently defined `GetPluginName` interface(of Volume Plugins) for using as key in the `node.Status.Capacity`. Ultimately | |||||||||||
we decided against using it, because most in-tree plugins start with `kubernetes.io/` and we needed a uniform way to identify storage | |||||||||||
related capacity limits in `node.Status`. | |||||||||||
|
|||||||||||
### Changes to scheduler | |||||||||||
|
|||||||||||
Scheduler will retrieve available attachable limit on a node from `node.Status.Allocatable` and store it in `nodeInfo` cache. Volume | |||||||||||
limits will be treated like any other scalar resource. | |||||||||||
|
|||||||||||
For `AWS-EBS`, `AzureDisk` and `GCE-PD` volume types, existing `MaxPD*` predicates will be updated to use volume attach limits available | |||||||||||
from node's allocatable property. To be backward compatible - the scheduler will fallback to older logic, if no limit is set in `node.Status.Allocatable` for AWS, GCE and Azure volume types. | |||||||||||
|
|||||||||||
### Setting of limit for existing in-tree volume plugins | |||||||||||
|
|||||||||||
The volume limit for existing volume plugins will be set by querying the volume plugin. Following function | |||||||||||
will be added to volume plugin interface: | |||||||||||
|
|||||||||||
```go | |||||||||||
type VolumePluginWithAttachLimits interface { | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does the scheduler need to call this volume plugin interface? Or will the existing predicate just hardcode the names? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When counting "in-use" volumes on a node by iterating through pods, scheduler will have to check if any of the volumes being used by pod under consideration implements There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should explicitly state that here because the scheduler today does not call volume plugin interfaces. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
|||||||||||
// Return key name that is used for storing volume limits inside node Capacity | |||||||||||
// must start with storage- prefix | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the plugin prefix the "storage-" part, or should the system prefix it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can go either way. For something like CSI the prefix will be added by the system. For in-tree plugins currently the entire volumeLimitKey is supplied by the plugin. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
that seems problematic for a couple reasons... first, we want to transition in-tree volume plugins to CSI plugins. that implies the resource name consumed by an attached volume should be able to be coherent between use of that volume type via in-tree volume sources and CSI volume sources (e.g. if I have 10 AWSElasticBlockStore volumes and 10 CSI volumes backed by AWS EBS volumes, they should both map to the same second, you'll run into resource name length limits if you simply prefix There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The PV specs of csi ebs and in-tree EBS volumes are not interchangeable for various other reasons. In fact, we just went through a long discussion at sig f2f about how to handle this transition. I would not necessarily expect volume limit key to be same for in-tree EBS plugin and CSI plugin. I was thinking to workaround 63 char limit, we can take first 13 chars from sha1 key of plugin name. |
|||||||||||
VolumeLimitKey(spec *Spec) string | |||||||||||
// Return volume limits for plugin | |||||||||||
GetVolumeLimits() (map[string]int64, error) | |||||||||||
} | |||||||||||
``` | |||||||||||
|
|||||||||||
When querying the plugin - plugin will use `ProviderName` function of CloudProvider to check | |||||||||||
if plugin is usable on the node. For example - querying for `GetVolumeLimits` from `aws-ebs` plugin with `gce` cloudprovider | |||||||||||
will result in error. | |||||||||||
|
|||||||||||
Kubelet will query the volume plugins inside `kubelet.initialNode` function and populate `node.Status` with returned values. | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am assuming that node.Status will be populated at the time of node registration before node is ready and before pods can be scheduled on it. I am assuming the limit wont change at run time, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes limit will not change at runtime and should be invariant through out lifetime of node. |
|||||||||||
|
|||||||||||
For GCE and AWS - `GetVolumeLimits` will return limits depending on node type. Plugin already has node name accessible | |||||||||||
via `VolumeHost` interface and hence it will check the node type and return the volume limits. | |||||||||||
|
|||||||||||
We do not aim to cover all in-tree volume types. We will support dynamic volume limits proposed here for following volume types: | |||||||||||
|
|||||||||||
* GCE-PD | |||||||||||
* AWS-EBS | |||||||||||
* AzureDisk | |||||||||||
|
|||||||||||
We expect to add incremental support for other volume types. | |||||||||||
|
|||||||||||
### Changes for Kubernetes 1.12 | |||||||||||
|
|||||||||||
For Kubernetes 1.12, we are adding support for CSI and moving the feature to beta. | |||||||||||
|
|||||||||||
#### CSI support | |||||||||||
|
|||||||||||
A new function will be added to `pkg/volume/util/attach_limit.go` which will return CSI attach limit | |||||||||||
resource name. | |||||||||||
|
|||||||||||
The interface of function will be: | |||||||||||
|
|||||||||||
```go | |||||||||||
const ( | |||||||||||
// CSI attach prefix | |||||||||||
CSIAttachLimitPrefix = "attachable-volumes-csi-" | |||||||||||
|
|||||||||||
// Resource Name length | |||||||||||
ResourceNameLengthLimit = 63 | |||||||||||
) | |||||||||||
|
|||||||||||
func GetCSIAttachLimitKey(driverName string) string { | |||||||||||
csiPrefixLength := len(CSIAttachLimitPrefix) | |||||||||||
totalkeyLength := csiPrefixLength + len(driverName) | |||||||||||
if totalkeyLength >= ResourceNameLengthLimit { | |||||||||||
charsFromDriverName := driverName[:23] | |||||||||||
// compute SHA1 of driverName and get first 16 chars | |||||||||||
return CSIAttachLimitPrefix + charsFromDriverName + hashed | |||||||||||
|
|||||||||||
} | |||||||||||
return CSIAttachLimitPrefix + driverName | |||||||||||
} | |||||||||||
``` | |||||||||||
|
|||||||||||
This function will be used both on node and scheduler for determining CSI attach limit key.The value of the | |||||||||||
limit will be retrieved using `GetNodeInfo` CSI RPC call and set if non-zero. | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we going to handle the case of override and changing the limit on the node after the initial configuration? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No not in this release. We will tackle this problem in 1.13 |
|||||||||||
|
|||||||||||
**Other options** | |||||||||||
|
|||||||||||
Alternately we also considered storing attach limit resource name in `CSIDriver` introduced as part | |||||||||||
of https://github.com/kubernetes/community/pull/2514 proposal. | |||||||||||
|
|||||||||||
This will work but depends on acceptance of proposal. We can always migrate attach limit resource names to | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CSI doesn't have a way to report some alternative key though right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think there is a proposal to allow CSI drivers to register themselves and not use auto registration mechanism. In which case, it will be possible for a plugin to report alternative key as a part of registration process. |
|||||||||||
values defined in `CSIDriver` object in later release. If `CSIDriver` object is available and has a attach limit key, | |||||||||||
then kubelet could use that key otherwise it will fallback to `GetCSIAttachLimitKey`. | |||||||||||
|
|||||||||||
Scheduler can also check presence of `CSIDriver` object and corresponding key in node object, otherwise it will | |||||||||||
fallback to using `GetCSIAttachLimitKey` function. | |||||||||||
|
|||||||||||
##### Changes to scheduler | |||||||||||
|
|||||||||||
To support attachable limit for CSI, a new predicate called `CSIMaxVolumeLimitChecker` will be added. It will use `GetCSIAttachLimitKey` | |||||||||||
function defined above for extracting attach limit resource name. | |||||||||||
|
|||||||||||
The predicate will be NOOP if feature gate is not enabled or when attachable limits are not available from node object. | |||||||||||
|
|||||||||||
Handling delayed binding is out of scope for this proposal and will be fixed in delayed binding and topology aware dynamic | |||||||||||
provisioning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you list down the those methods here for reference, i didnt know the limit existed somewhere