-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a design proposal for dynamic volume limits #2051
Conversation
cc @childsb |
|
||
### Changes to scheduler | ||
|
||
When a node is being considered for scheduling of a pod with volume, it will use currently running pods against value of node.Status.AttachedVolumeLimit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For in-tree volumes, how do you map the volume type with the node.status.attachedvolumelimit key? Is it just a hardcoded mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - all volume types eventually resolve to something like - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/aws_ebs/aws_ebs.go#L54 and hence for in-tree volume plugins this will be a very small change in scheduler. The key in attachedVolumeLimit
will be same as what is being returned in GetPluginName
for in-tree plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the scheduler does not plumb in the volumeplugin interface
} | ||
``` | ||
|
||
If given cloudprovider implements this interface - kubelet will query the interface and set(via merge) the returned values in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there's a push towards moving cloud provider dependencies out of kubelet.
Could this be something the external cloud controller calls and updates for each node?
/cc @yujuhong |
|
||
### API Changes | ||
|
||
We propose following changes to node's status field: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this go under node.capacity with the other resource constraints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drawback is that you have to request a resource type. But, according to @thockin, device plugin people are pushing a model where admission controller detects what type of volume you are requesting and it decorates the correct type of resource request. Agreed that we should try to repurpose this design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a discussion today in sig-node call and there were valid concerns raised against using node.Capacity
for scheduling. The main problem being - there is no corresponding pod level field. memory, cpu and GPU devices are all specified at container level but volume is counted at pod level.
At this point we either heard some support for using node.Status.AttachedVolumeLimits
or no strong opinion about it. cc @derekwaynecarr
If we choose to use node and pod capacity then we will have to create a new field in pod object spec and there was concern that storage might be only sub system using that field. We expect more comments as follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we discussed this in a follow-on:
this feels closest to "pods" per node value.
node.status.capacity reports "pods" as a resource
the container is not able to request a "pods" resource as its not a standard compute resource name.
the scheduler special cases "pods" as a special resource type and counts each pod against that value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additional clarification:
"pods" never appears in a container resource requirement
if it did, it would be a problem.
|
||
For CSI the name used in key will be - `kubernetes.io/csi/<driver_name>` | ||
|
||
### Changes to CSI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove CSI call outs for now. I would like to move/remove 'defaults' from the volume plugins / CSI layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify why? how else then CSI volumes will be configured to have these limits which are essential for scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design should move the 'default values' from the volume plugin to an admin modifiable location (such as CLI args). The admin should set them per node as part of configuration.
We should not conflate CSI spec with this design. Once we have the design correct for kubernetes, then we look at changes to CSI.
We had a sig-storage call today and following items were discussed:
|
I would support modifying the CloudProvider interface to add support for this.
Can you clarify what this means? Does it mean modifying the in-tree plugins without modifying the cloud provider to figure out what the limit should be (the plugin could maintain an internal map of limits, but the challenge I see with that is figuring what type of machine it is running on without relying on cloud provider code). |
I believe the fallback plan is to have override flags into kubelet, and it would be up to the admin or the deployment manager to set those limits. |
Adding to the interface we're trying to remove doesn't seem like a great approach. Making new features work from the outside in will drive actually figuring out how to configure the kubelet in various cloud envs, instead of just leaning on the existing in-tree interface |
To add to @liggitt and summarize a design discussion we had last week.... The cloud provider could just as easily set the attach limits dynamically by updating the Node object or some other mechanism. I feel the best first step design for this feature is creating knobs to set attach limits by the volume type, but not try and populate or determine the limit 'dynamically'. The 'environment' (cloud provider or otherwise) can set the values as it determines, but the storage layer not create new API or tighter coupling with the cloud provider. The onus would be on each cloud provider (or in the absence, some startup script) to set appropriate values based on the environment. |
} | ||
``` | ||
|
||
For CSI the name used in key will be - `kubernetes.io/csi/<driver_name>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a format unlike any other in the system. I don't think this is correct use of resource names.
|
||
### Setting of limit for existing in-tree volume plugins | ||
|
||
The volume limit for existing volume plugins will be set by querying the cloudprovider. Following interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this belongs in cloud provider - it's a facet of the storage driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know which plugins are available on a given node? The plugin initialization code gets initialized on all node types. For example - we initialize EBS plugin on a GCE cluster. So populating limits by querying the volume plugin will result in populating limits for all available in-tree volume types.
Do we want to populate node.Status.Capacity
with relatively large dict? I do not know if there is a way if we can determine if certain plugin can be initialized on certain node type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know which plugins are available on a given node?
Plugins register themselves with the node on startup.
The plugin initialization code gets initialized on all node types. For example - we initialize EBS plugin on a GCE cluster. So populating limits by querying the volume plugin will result in populating limits for all available in-tree volume types.
In-tree plugin can have an additional check that verifies that it is running in cloud provider.
We had a discussion with @saad-ali @thockin and @cheftako and concluded that:
|
I do not like 'discovering' this at the Volume/CSI layer... It's impossible for the volume layer / CSI driver to have an accurate MAX that applies to most users / environments.... Kube can place conservative MAX default values but then admins will want to tweak it for performance. Kube can use the published performant MAX as default but users may have other agreements for unpublished values or find the performant values overload the cluster. I dont see a "set it and forget it" value for many storage system really. As further counter-point to adding this to the volume layer, I dont know of any storage API which supports this discovery.... It's all discoverable via documentation matrix. Is the conclusion of the discussion @msau42 that it would be better UX for the admin to 'tweak' the value through the volume layer / CSI config? |
For the scenario where the admin still wants to override the plugin, I can see two ways for override:
I agree that there's an issue that these limits for cloud volume types are usually only encoded in documentation, so it would be up to the CSI driver maintainers to update the limits table whenever the number changes. I think that's a problem we can't avoid either way. At least with CSI, since it's out of tree, we can push new versions faster. |
@msau42 thank you for summing the discussion. @thockin @saad-ali I would have still preferred that CLI provides an option to override the limits. For example - even on EBS the performance of disk and the limit slightly depends on network bandwidth. The default documented limit is 40 but if one attaches 40 volumes to a m1.medium instance type - likely the instance is not going to perform optimally. In that case, an admin may want to override the limit to a value that he/she knows best. So my vote is - have some defaults but allow admin to override them. I do not buy the argument that it is clumsy or error prone. We have feature-gates flag as prior art which uses similar format and requires enabling on all nodes. |
@gnufied the admin can still override the limit by directly setting the node.status. Do you see issues with that method vs having to login to the node, change and persistent new kubelet arguments, and restart kubelet? |
@msau42 Modifying Lastly - how does a configuration management tool like puppet/ansible keep this value in sync? Does it needs to watch node object? does it need to poll node object and make sure value is always set? It arguably seems so much worse. :( |
I believe kubelet only does Patch on the Node object, so a restart would not cause the value to be overridden or lost. I agree that in the case of managed instances, if the cloud provider deletes and creates a new instance, then those values would be lost. So perhaps, for this admin override case, a place to persistently store the override is needed. Could something like dynamic kubelet config help here? |
A node restart where node was down for sufficiently long time will cause node object to be recreated. On AWS etc a node object is removed from api-server when it is shutdown. |
I had another quick discussion with @saad-ali. In the long term, we can provide overrides through the CSI plugin. In the short term, this is still an alpha feature, and the admin could disable it if they run into issues with the new limits. I think it's better to not define a new flag/config parameter for something that we're going to replace soon. |
Automatic merge from submit-queue (batch tested with PRs 64613, 64596, 64573, 64154, 64639). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Implement dynamic volume limits Implement dynamic volume limits depending on node type. xref kubernetes/community#2051 ```release-note Add Alpha support for dynamic volume limits based on node type ``` Kubernetes-commit: e5686a366815cbb82ef91503151ef6b2e531e6f3
This PR implements the function to return attachable volume limit based on machineType for GCE PD. This is part of the design in kubernetes/community#2051/
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
849a4d0
to
d3edde3
Compare
csiPrefixLength := len(CSIAttachLimitPrefix) | ||
totalkeyLength := csiPrefixLength + len(driverName) | ||
if totalkeyLength >= ResourceNameLengthLimit { | ||
// compute SHA1 of driverName and get first 36 chars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that keeping some prefix with the original name would help users deciphering what the key is, e.g. reallyreallylongcsidrivername
-> reallyreallylongcs-19e793491b1c6
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
d3edde3
to
e846f61
Compare
format restrictions applied to Kubernetes Resource names. Volume limit key for existing plugins might look like: | ||
|
||
|
||
* `storage-attach-limits-aws-ebs` | ||
* `storage-attach-limits-gce-pd` | ||
* `attachable-volumes-aws-ebs` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we also want to add a max capacity limit, does "attachable-volumes" still make sense, or do we want to clarify that it is a count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to use separate resource name for max attachable capacity. Something like - attachable-capacity-xxx
and keep attachable-volumes-xx
for representing count. I think capacity vs volumes word is "good enough" indication.
e846f61
to
7a71128
Compare
7a71128
to
e1cb861
Compare
2249bc5
to
1ca2366
Compare
We can also use kubernetes#2514 when it becomes available.
1ca2366
to
eb28ce7
Compare
Alternately we also considered storing attach limit resource name in `CSIDriver` introduced as part | ||
of https://github.com/kubernetes/community/pull/2514 proposal. | ||
|
||
This will work but depends on acceptance of proposal. We can always migrate attach limit resource names to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSI doesn't have a way to report some alternative key though right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think there is a proposal to allow CSI drivers to register themselves and not use auto registration mechanism. In which case, it will be possible for a plugin to report alternative key as a part of registration process.
``` | ||
|
||
This function will be used both on node and scheduler for determining CSI attach limit key.The value of the | ||
limit will be retrieved using `GetNodeInfo` CSI RPC call and set if non-zero. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to handle the case of override and changing the limit on the node after the initial configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No not in this release. We will tackle this problem in 1.13
Automatic merge from submit-queue (batch tested with PRs 68161, 68023, 67909, 67955, 67731). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md. Fix csi attach limit Add support for volume limits for CSI. xref: kubernetes/community#2051 ```release-note Add support for volume attach limits for CSI volumes ```
} | ||
``` | ||
|
||
The prefix `storage-attach-limits-*` can not be used as a resource in pods, because it does not adhere to specs defined in following function: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be the same prefix as above?
format restrictions applied to Kubernetes Resource names. Volume limit key for existing plugins might look like: | ||
|
||
|
||
* `attachable-volumes-aws-ebs` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really dislike this sort of ad hoc syntax embedded in a string, and aside from that, I don't buy that this should be magical like this. Why can't we do something less stringy?
Just thinking out loud...
e.g. Some volume have per-node attachment limits. For those volumes, you install an admission controller that examines the volumes in play and decorates the pod as requiring certain resources, and then the scheduler simply schedules?
I feel like we have explored this, but if so, I don't recall why it was rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two part of it right?
-
The plugin must be able to specify a maximum attachment limit on per node basis. We chose to use node's allocatable/capacity field for that. And when we want to express this limit in node's allocatable regardless of how we do counting via admission controller or something else - the limit must have a key and a value. The key MUST be a string that adheres to resourceName naming conventions(allocatable is of type
map[v1.ResourceName]resource.Quantity
). So isn't theattachable-volumes-aws-ebs
prefix chosen for expressing attach limits in node's allocatable/capacity orthogonal to how we do counting? -
We could still use a admission controller to do actual counting but I think it was agreed that, scheduler is better(or may be easier) place for counting actual volumes in use. Because:
a. It can consider all pods on the node. Not just the one being scheduled.
b. I think the way volume limits interacts with late binding and even distribution of pods with volume workloads(Balanced resource allocation priority to include volume count on nodes. kubernetes#60525) makes it better place to do in scheduler.
And obviously a admission plugin may be disabled or not enabled(but same goes for a scheduler predicate I guess). I am not particularly opposed to using admission controller for counting but I think we did not see benefits we thought we will see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to reformat for comments:
We chose to use node's allocatable/capacity field for that [to specify a maximum attachment limit]
... The key MUST be a string that adheres to resourceName naming conventions
... So isn't the attachable-volumes-aws-ebs prefix chosen for expressing attach limits in node's allocatable/capacity orthogonal to how we do counting
Yes it is orthogonal. Sorry for putting them in one message.
My first grump here is that we have invented non-obvious syntax without any specification for it. I detest string-embedded syntaxes (stringly type APIs) and even if we DO go that way, we can't tolerate under-specified ones. Where do these magical names come from? How do they not conflict? Jordan had some comments on that, too. At the very very least this needs a strong spec, and perhaps we actually need something different.
Regarding scheduler, what I really don't want to do is build up a body of special -cases that we dump on the scheduler, when the problem is actually a pretty general problem. How do we opaquely allocate and schedule resources WITHOUT changing the scheduler? We may need to make other changes to the system to accomodate that, but I think it should be a preferred choice. @bsalamat am I over-blowing this?
Late-binding is from a PVC to an PV via a class. Presumably a single class can't use multiple drivers, so it's still feasible to reify the PVC -> class -> driver somewhere? The fact that classes could, in theory change, is a problem that maybe we should think harder about and maybe we need a different kind of plugin (e.g. maybe we need explicit scheduler pre-processor plugins). "Let's just hack the scheduler' seems like a very slippery slope to me.
And, BTW, I want this capability (and have for years), so please don't think I am poo-pooing the idea :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second what @thockin said about dumping a lot of special case logic in the scheduler. That said, not everything can be checked at admission time. For example, dynamic volume binding happens after scheduler finds a node to run a pod. I guess there is no choice other than checking the number of attached volumes to the node at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do these magical names come from? How do they not conflict? Jordan had some comments on that, too. At the very very least this needs a strong spec, and perhaps we actually need something different.
In current iteration any resource name(in node's allocatable) that uses attachable-volumes-
prefix refers to attachable limits. We have added some tests to ensure that containers themselves aren't able to directly request this resource. It is however possible that someone can set a new attachable-volumes-xxx
value that does not apply to volume limits(via patch
).
I am not entirely sure about some external component inadvertently overriding any of existing limits though. Because - when kubelet starts, it sets those limits and it periodically syncs the value from same source. So if an external component does override those values it will be wiped out.
I still see the need of speccing them better and perhaps there is a better way. I need to think and will bring this up on sig-storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the earlier design proposals was to leverage the existing opaque resource counting feature that already exists in the scheduler. However there were some hurdles:
- Opaque resources are specified at the container level. We would need to extend the pod api to add pod-level resources.
- Users don't specify volume counts via resources. I think this is where the admission controller idea came in. Have an admission controller that parses pod.volumes and convert that into a pod-level volume count resource. I remember discussing this option in a sig-node meeting and there were concerns about users being confused by this field they didn't set, and also users potentially manually setting the fields themselves.
That being said, it may be worth revisiting this. It turns out we also want to support limiting volume attached capacity per node. If we follow the current design, that will require adding more special names/logic for this new scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous discussion here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a discussion with @jsafrane as well about this and we discussed two problematic things about moving the count outside the scheduler:
- For local volumes with delayed binding, storageClass could be dummy or unknown and hence counting unbound PVCs is going to be tricky outside the scheduler.
- Currently the scheduler counts unique volumes. If we specify volumes as a pod level countable resource and scheduler counts volumes the way it counts memory or other resources, it can't accurately count unique volumes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is however possible that someone can set a new attachable-volumes-xxx value that does not apply to volume limits(via patch ).
What is the specification for xxx here? Is it the CSI driver name? Some other source? Does it get some auto-mutation? Are characters changed?
Imagine we had a generic resource-finalizer plugin, at the beginning of the
scheduler. It's job is to take a pod and return a resources block. That
resources set is the sum of all of the pod's containers (or whatever the
policy is) plus anything the plugin wants to do to mutate it. In this
case, add a volume-specific counted resource. We don't need to change the
pod API per se. We would need to change the scheduler, in a pretty big
way, but this would accomodate future situations, too.
As for unique volumes, this only impacts volumes that are ROX and used by
multiple pods. I know some plugins allow shared mounting of RWO volumes,
but all of those drivers are *broken*. Given that, this seems like a
pretty niche bug, and one I could probably live with.
…On Mon, Oct 1, 2018 at 7:53 AM Hemant Kumar ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In contributors/design-proposals/storage/dynamic_volume_limit.md
<#2051 (comment)>:
> +### Prerequisite
+
+* 1.11 This feature will be protected by an alpha feature gate, so as API and CLI changes needed for it. We are planning to call
+ the feature `AttachVolumeLimit`.
+* 1.12 This feature will be behind a beta feature gate and enabled by default.
+
+### API Changes
+
+There is no API change needed for this feature. However existing `node.Status.Capacity` and `node.Status.Allocatable` will
+be extended to cover volume limits available on the node too.
+
+The key name that will store volume will be start with prefix `attachable-volumes-`. The volume limit key will respect
+format restrictions applied to Kubernetes Resource names. Volume limit key for existing plugins might look like:
+
+
+* `attachable-volumes-aws-ebs`
Just had a discussion with @jsafrane <https://github.com/jsafrane> as
well about this and we discussed two problematic things about moving the
count outside the scheduler:
1. For local volumes with delayed binding, storageClass could be dummy
or unknown and hence counting unbound PVCs is going to be tricky outside
the scheduler.
2. Currently the scheduler counts unique volumes. If we specify
volumes as a pod level countable resource and scheduler counts volumes the
way it counts memory or other resources, it can't accurately count unique
volumes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2051 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVFMHIU3ZBgVNgZ7W-Q7omcUvvYR0ks5ugixWgaJpZM4Taw5z>
.
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Add a proposal for dynamic volume limits.
cc @liggitt @saad-ali @derekwaynecarr
xref - kubernetes/enhancements#554
/sig storage