-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable dynamic configue max number of PDs allowed on a node based on machine type #53461
Conversation
@@ -54,6 +55,8 @@ const ( | |||
replicaZoneURITemplateSingleZone = "%s/zones/%s" // {gce.projectID}/zones/{disk.Zone} | |||
) | |||
|
|||
var DiskNumberLimit = []int{16, 32, 64, 128} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a map instead? with key = num cpus, and value = pd limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about how many cpus possibly support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or at least define an enum to represent the index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added enum
if node != nil { | ||
instanceType = node.ObjectMeta.Labels[kubeletapis.LabelInstanceType] | ||
} | ||
maxVolumes := c.maxPDCount(instanceType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass in Node object instead? Then each cloud provider can use whatever label or annotation they have to determine the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to pass a node
Do we still want to support the ability for the user to override the limit by setting a flag? |
73b7435
to
7090665
Compare
@msau42 I checked again, there is no such flag for user to overwrite the number |
) | ||
|
||
var DiskNumberLimit = []int{16, 32, 64, 128} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that the values correspond to the indexes above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
func MaxPDCount(node *v1.Node) int { | ||
machineType := "" | ||
if node != nil { | ||
machineType = node.ObjectMeta.Labels[kubeletapis.LabelInstanceType] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could Labels be nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
maxVols: 4, | ||
fits: true, | ||
test: "fits when node capacity >= new pod's EBS volumes", | ||
}, | ||
{ | ||
newPod: twoVolPod, | ||
existingPods: []*v1.Pod{oneVolPod}, | ||
existingPods: []*v1.Pod{oneVolPod, twoVolPod, splitVolsPod}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to add a splitVolsPod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no issue if newPod is also in existingPods?
@@ -1805,14 +1989,14 @@ func TestEBSVolumeCountConflicts(t *testing.T) { | |||
}{ | |||
{ | |||
newPod: oneVolPod, | |||
existingPods: []*v1.Pod{twoVolPod, oneVolPod}, | |||
existingPods: []*v1.Pod{twoVolPod}, | |||
maxVols: 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field should be removed since it's not used anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
/unassign |
// TODO: allow for generically parameterized scheduler predicates, because this is a bit ugly | ||
maxVols := getMaxVols(aws.DefaultMaxEBSVolumes) | ||
return predicates.NewMaxPDVolumeCountPredicate(predicates.EBSVolumeFilter, maxVols, args.PVInfo, args.PVCInfo) | ||
return predicates.NewMaxPDVolumeCountPredicate(predicates.EBSVolumeFilter, aws.MaxPDCount, args.PVInfo, args.PVCInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the code where you can override the default max pd count with the environment variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All calls to 'getMaxVols' are dropped in this PR, which reads environment variable 'KUBE_MAX_PD_VOLS'.
Is it expected? If so, we'd better add a release note for deprecating it, or switch back to still allow the override.
This is documented here: https://github.com/kubernetes/community/blob/43ce57ac476b9f2ce3f0220354a075e095a0d469/contributors/devel/scheduler_algorithm.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the logic of checking environment variable back to the code. Thanks!
@karataliu could you look at this from the Azure perspective? Is the default max of 16 appropriate? |
@jdumars That looks fine since this PR only moves 'DefaultMaxAzureDiskVolumes'(16) from 'defaults.go' to 'azure.go', which won't cause behavior change. I could create a separate PR to calc the value based on node type. Also, if dynamic config is done, the following issue could be addressed: Azure/acs-engine#186 |
@karataliu that would be extremely helpful! Thank you for looking into this. |
@msau42 comments are addressed. PTAL. Thanks! |
existingPods: onePodList_15, | ||
node: small_node, | ||
fits: true, | ||
test: "doesn't fit when node capacity < new pod's GCE volumes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix description
@@ -2006,7 +2251,8 @@ func TestEBSVolumeCountConflicts(t *testing.T) { | |||
expectedFailureReasons := []algorithm.PredicateFailureReason{ErrMaxVolumeCountExceeded} | |||
|
|||
for _, test := range tests { | |||
pred := NewMaxPDVolumeCountPredicate(filter, test.maxVols, pvInfo, pvcInfo) | |||
os.Setenv(KubeMaxPDVols, strconv.Itoa(test.maxVols)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to restore the previous value like in the original test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks!
@khenidak PTAL |
@jingxu97 PR needs rebase |
if I understand correctly this change will require (1) linking scheduler with Cloud Provider code (2) making sure that scheduler is bootstrapped with cloud config (to allow provider to work correctly). The first creates one more dependency everybody is trying to walk away from (by out of tree of provider). The second will force users to revisit all the existing clusters to upgrade to whatever version carries this change. in addition to visiting all the bootstrap tooling/scripts to enable that for new clusters. Additionally Cloud Providers (all of them) are constantly changing/adding VM sizes and shapes (with support for larger # of disks). With this change users have to wait to new kubernetes release versions to support new sizes/shapes. With this limitation i really think we shouldn't go ahead with this PR. What i am proposing is to keep the predicate side of code but instead of using cloud provider to resolve max-pd-count, the scheduler depends on well-know configuration map in kube-system namespace, this configuration map carries a tuple as the following
Users can then modify the table as needed, or alternatively cloud provider can an publish updated table (As a config map) in json format for users to apply it on clusters using /sig azure |
@jingxu97 I agree with @khenidak here. We should have the predicate use a Not only does this provide better de-coupling, but it will also work for additional scenarios (like on-prem SAN, etc) where we don't have an equivalent cloud provider. Can we rework this PR so that it uses a Thanks! @kubernetes/sig-architecture-pr-reviews |
@bgrant0607 this one needs an eyeball |
The cloudprovider API is frozen. How would this be done with an external cloud provider? |
@brendanburns @khenidak @bgrant0607 thanks a lot for the comments. I will try to rework this PR based on your suggestions. |
[MILESTONENOTIFIER] Milestone Pull Request Labels Incomplete Action required: This pull request requires label changes. If the required changes are not made within 1 day, the pull request will be moved out of the v1.9 milestone. kind: Must specify exactly one of |
@dims done. |
// DefaultMaxAzureVolumes defines the maximum number of PD Volumes for Azure | ||
// Larger Azure VMs can actually have much more disks attached. | ||
// TODO We should determine the max based on VM size | ||
DefaultMaxAzureVolumes = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked on azure DefaultMaxAzureVolumes
should be 32
@jingxu97 Will you be able to do something for 1.10 ? |
@ymsaout Yes, we will work on it for 1.10. |
ping @jingxu97 is still moving forward? |
Is there a sticking point from your view @jingxu97 ? |
cc @cheftako, who is working on cloud provider extraction https://github.com/kubernetes/community/blob/master/keps/0002-controller-manager.md |
Looks stalled and since would like to move forward with this, I would like to take over this. Ping @jingxu97 |
This is available in 1.11 as an alpha feature: kubernetes/enhancements#554 |
Does this PR still need to be open? @jingxu97 |
Currently, for cloud provider including gce, aws, and azure, there is a
hardcoded number to limit the max number of PDs allowed on a node.
However, gce has changed this number based on machine type. This PR
allows scheduler to automatically get this number based on the machine
type of the given node.
fixes issue #24317
Release note: