Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable dynamic configue max number of PDs allowed on a node based on machine type #53461

Closed
wants to merge 1 commit into from

Conversation

jingxu97
Copy link
Contributor

@jingxu97 jingxu97 commented Oct 4, 2017

Currently, for cloud provider including gce, aws, and azure, there is a
hardcoded number to limit the max number of PDs allowed on a node.
However, gce has changed this number based on machine type. This PR
allows scheduler to automatically get this number based on the machine
type of the given node.

fixes issue #24317

Release note:

Enable dynamic configuration of maximum number of persistent disks allowed on a node based on machine type for scheduler to use to check the feasibility of assigning pod to node.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 4, 2017
@jingxu97 jingxu97 added this to the v1.9 milestone Oct 4, 2017
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 4, 2017
@@ -54,6 +55,8 @@ const (
replicaZoneURITemplateSingleZone = "%s/zones/%s" // {gce.projectID}/zones/{disk.Zone}
)

var DiskNumberLimit = []int{16, 32, 64, 128}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a map instead? with key = num cpus, and value = pd limit?

Copy link
Contributor Author

@jingxu97 jingxu97 Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about how many cpus possibly support

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least define an enum to represent the index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added enum

if node != nil {
instanceType = node.ObjectMeta.Labels[kubeletapis.LabelInstanceType]
}
maxVolumes := c.maxPDCount(instanceType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass in Node object instead? Then each cloud provider can use whatever label or annotation they have to determine the limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to pass a node

@msau42
Copy link
Member

msau42 commented Oct 4, 2017

Do we still want to support the ability for the user to override the limit by setting a flag?

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 5, 2017
@jingxu97 jingxu97 force-pushed the Oct/maxPD branch 2 times, most recently from 73b7435 to 7090665 Compare October 5, 2017 21:27
@jingxu97
Copy link
Contributor Author

jingxu97 commented Oct 5, 2017

@msau42 I checked again, there is no such flag for user to overwrite the number

)

var DiskNumberLimit = []int{16, 32, 64, 128}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that the values correspond to the indexes above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

func MaxPDCount(node *v1.Node) int {
machineType := ""
if node != nil {
machineType = node.ObjectMeta.Labels[kubeletapis.LabelInstanceType]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could Labels be nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

maxVols: 4,
fits: true,
test: "fits when node capacity >= new pod's EBS volumes",
},
{
newPod: twoVolPod,
existingPods: []*v1.Pod{oneVolPod},
existingPods: []*v1.Pod{oneVolPod, twoVolPod, splitVolsPod},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to add a splitVolsPod?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no issue if newPod is also in existingPods?

@@ -1805,14 +1989,14 @@ func TestEBSVolumeCountConflicts(t *testing.T) {
}{
{
newPod: oneVolPod,
existingPods: []*v1.Pod{twoVolPod, oneVolPod},
existingPods: []*v1.Pod{twoVolPod},
maxVols: 4,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field should be removed since it's not used anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@sttts
Copy link
Contributor

sttts commented Oct 6, 2017

/unassign

// TODO: allow for generically parameterized scheduler predicates, because this is a bit ugly
maxVols := getMaxVols(aws.DefaultMaxEBSVolumes)
return predicates.NewMaxPDVolumeCountPredicate(predicates.EBSVolumeFilter, maxVols, args.PVInfo, args.PVCInfo)
return predicates.NewMaxPDVolumeCountPredicate(predicates.EBSVolumeFilter, aws.MaxPDCount, args.PVInfo, args.PVCInfo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the code where you can override the default max pd count with the environment variable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All calls to 'getMaxVols' are dropped in this PR, which reads environment variable 'KUBE_MAX_PD_VOLS'.
Is it expected? If so, we'd better add a release note for deprecating it, or switch back to still allow the override.

This is documented here: https://github.com/kubernetes/community/blob/43ce57ac476b9f2ce3f0220354a075e095a0d469/contributors/devel/scheduler_algorithm.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the logic of checking environment variable back to the code. Thanks!

@jdumars
Copy link
Member

jdumars commented Oct 11, 2017

@karataliu could you look at this from the Azure perspective? Is the default max of 16 appropriate?

@karataliu
Copy link
Contributor

karataliu commented Oct 12, 2017

@jdumars That looks fine since this PR only moves 'DefaultMaxAzureDiskVolumes'(16) from 'defaults.go' to 'azure.go', which won't cause behavior change. I could create a separate PR to calc the value based on node type.

Also, if dynamic config is done, the following issue could be addressed: Azure/acs-engine#186

@jdumars
Copy link
Member

jdumars commented Oct 12, 2017

@karataliu that would be extremely helpful! Thank you for looking into this.

@jingxu97
Copy link
Contributor Author

@msau42 comments are addressed. PTAL. Thanks!

existingPods: onePodList_15,
node: small_node,
fits: true,
test: "doesn't fit when node capacity < new pod's GCE volumes",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix description

@@ -2006,7 +2251,8 @@ func TestEBSVolumeCountConflicts(t *testing.T) {
expectedFailureReasons := []algorithm.PredicateFailureReason{ErrMaxVolumeCountExceeded}

for _, test := range tests {
pred := NewMaxPDVolumeCountPredicate(filter, test.maxVols, pvInfo, pvcInfo)
os.Setenv(KubeMaxPDVols, strconv.Itoa(test.maxVols))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to restore the previous value like in the original test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks!

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2017
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2017
@jdumars
Copy link
Member

jdumars commented Oct 26, 2017

@khenidak PTAL

@k8s-github-robot
Copy link

@jingxu97 PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2017
@khenidak
Copy link
Contributor

khenidak commented Nov 2, 2017

if I understand correctly this change will require (1) linking scheduler with Cloud Provider code (2) making sure that scheduler is bootstrapped with cloud config (to allow provider to work correctly).

The first creates one more dependency everybody is trying to walk away from (by out of tree of provider). The second will force users to revisit all the existing clusters to upgrade to whatever version carries this change. in addition to visiting all the bootstrap tooling/scripts to enable that for new clusters.

Additionally Cloud Providers (all of them) are constantly changing/adding VM sizes and shapes (with support for larger # of disks). With this change users have to wait to new kubernetes release versions to support new sizes/shapes. With this limitation i really think we shouldn't go ahead with this PR.

What i am proposing is to keep the predicate side of code but instead of using cloud provider to resolve max-pd-count, the scheduler depends on well-know configuration map in kube-system namespace, this configuration map carries a tuple as the following

# with example data (below data for sampling not accuracy)
PD Spec (Name) | Instance-type       | Count
azureDisk            |   Standard_D2_v2 | 4
gcDisk                 |  n1-highcpu-2      | 4
awsDisk               | instance-name    | 6
  • a default value for instances that are not in the table (possible overridden as currently done by env var)

Users can then modify the table as needed, or alternatively cloud provider can an publish updated table (As a config map) in json format for users to apply it on clusters using kubectl apply

/sig azure
@jdumars @brendanburns

@brendandburns
Copy link
Contributor

brendandburns commented Nov 2, 2017

@jingxu97 I agree with @khenidak here.

We should have the predicate use a ConfigMap for configuration, rather than relying on an in-tree cloud-provider.

Not only does this provide better de-coupling, but it will also work for additional scenarios (like on-prem SAN, etc) where we don't have an equivalent cloud provider.

Can we rework this PR so that it uses a ConfigMap from the kube-system namespace instead of modifying the cloud provider?

Thanks!

@kubernetes/sig-architecture-pr-reviews

@jdumars
Copy link
Member

jdumars commented Nov 2, 2017

@bgrant0607 this one needs an eyeball

@bgrant0607
Copy link
Member

The cloudprovider API is frozen. How would this be done with an external cloud provider?

@bgrant0607 bgrant0607 removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 8, 2017
@jingxu97
Copy link
Contributor Author

jingxu97 commented Nov 8, 2017

@brendanburns @khenidak @bgrant0607 thanks a lot for the comments. I will try to rework this PR based on your suggestions.

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request Labels Incomplete

@jdef @jingxu97 @msau42

Action required: This pull request requires label changes. If the required changes are not made within 1 day, the pull request will be moved out of the v1.9 milestone.

kind: Must specify exactly one of kind/bug, kind/cleanup or kind/feature.
priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help

@dims
Copy link
Member

dims commented Nov 16, 2017

@jingxu97 @jdumars Can we please move this out of 1.9?

@jdumars jdumars modified the milestones: v1.9, next-candidate Nov 16, 2017
@jdumars
Copy link
Member

jdumars commented Nov 16, 2017

@dims done.

// DefaultMaxAzureVolumes defines the maximum number of PD Volumes for Azure
// Larger Azure VMs can actually have much more disks attached.
// TODO We should determine the max based on VM size
DefaultMaxAzureVolumes = 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked on azure DefaultMaxAzureVolumes should be 32

@ymsaout
Copy link

ymsaout commented Dec 29, 2017

@jingxu97 Will you be able to do something for 1.10 ?

@jingxu97
Copy link
Contributor Author

jingxu97 commented Jan 2, 2018

@ymsaout Yes, we will work on it for 1.10.

@khenidak
Copy link
Contributor

khenidak commented Feb 6, 2018

ping @jingxu97 is still moving forward?

@ymsaout
Copy link

ymsaout commented Feb 28, 2018

Is there a sticking point from your view @jingxu97 ?

@bgrant0607
Copy link
Member

cc @cheftako, who is working on cloud provider extraction

https://github.com/kubernetes/community/blob/master/keps/0002-controller-manager.md

@kkmsft
Copy link
Contributor

kkmsft commented Jun 22, 2018

Looks stalled and since would like to move forward with this, I would like to take over this. Ping @jingxu97

@msau42
Copy link
Member

msau42 commented Jun 22, 2018

This is available in 1.11 as an alpha feature: kubernetes/enhancements#554

@jdumars
Copy link
Member

jdumars commented Sep 2, 2018

Does this PR still need to be open? @jingxu97

@andyzhangx
Copy link
Member

@jdumars It's already resoved by #64154, and this feature is beta in v1.12

@jdumars jdumars closed this Sep 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. milestone/incomplete-labels needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.