Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support load aware scheduling #159

Merged

Conversation

eahydra
Copy link
Member

@eahydra eahydra commented May 19, 2022

Ⅰ. Describe what this PR does

The scheduling plugin filters abnormal nodes and scores them according
to resource usage. The plugin extends the Filter/Score/Reserve/Unreserve
extension points defined in the Kubernetes scheduling framework.

FYI: docs/proposals/scheduling/20220510-load-aware-scheduling.md

Ⅱ. Does this pull request fix one issue?

Fix #95

Ⅲ. Describe how to verify it

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta2
      kind: LoadAwareSchedulingArgs
      filterExpiredNodeMetrics: true
      nodeMetricExpirationSeconds: 300
      resourceWeights:
        cpu: 2
        memory: 1
      usageThresholds:
        cpu: 75
        memory: 85
      estimatedScalingFactors:
        cpu: 80
        memory: 70
    name: LoadAwareScheduling
  plugins:
    filter:
      enabled:
        - name: LoadAwareScheduling
          weight: 0
    reserve:
      enabled:
        - name: LoadAwareScheduling
          weight: 0
    score:
      enabled:
        - name: LoadAwareScheduling
          weight: 1000
  schedulerName: koord-scheduler

Ⅳ. Special notes for reviews

@eahydra eahydra force-pushed the support_load_aware_scheduling branch 15 times, most recently from cfcc279 to a6aece2 Compare May 20, 2022 08:41
@eahydra eahydra marked this pull request as ready for review May 20, 2022 08:55
@eahydra eahydra requested a review from saintube May 20, 2022 09:28
@eahydra eahydra force-pushed the support_load_aware_scheduling branch from a6aece2 to a166f15 Compare May 20, 2022 09:35
@@ -1,3 +1,4 @@
reviewers:
- eahydra
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alphabetical order

hack/update-scheduler-codegen.sh Show resolved Hide resolved
@eahydra eahydra requested a review from jasonliu747 May 22, 2022 10:22
Copy link
Member

@jasonliu747 jasonliu747 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't gone through the core logic of load aware plugin implementation yet. here is what I got for you so far ;)

cmd/koord-scheduler/app/options/options.go Outdated Show resolved Hide resolved
cmd/koord-scheduler/app/server.go Outdated Show resolved Hide resolved
pkg/scheduler/apis/config/doc.go Outdated Show resolved Hide resolved
pkg/scheduler/apis/config/types_pluginargs.go Outdated Show resolved Hide resolved
pkg/scheduler/apis/config/types_pluginargs.go Outdated Show resolved Hide resolved
)

var (
defaultNodeMetricExpirationSeconds int64 = 180
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments to briefly tell us abouth the stroy behind these default values?

pkg/util/pod.go Outdated Show resolved Hide resolved
@jasonliu747 jasonliu747 self-assigned this May 23, 2022
pkg/util/pod.go Outdated Show resolved Hide resolved

var estimatedUsed int64
switch resourceName {
case corev1.ResourceCPU:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a besteffort Pod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best-effort Pod uses extended resources and can be calculated directly using Value().

@eahydra eahydra force-pushed the support_load_aware_scheduling branch from a166f15 to 2e22a86 Compare May 24, 2022 01:28
@codecov-commenter
Copy link

codecov-commenter commented May 24, 2022

Codecov Report

Merging #159 (d57735e) into main (cb90038) will increase coverage by 0.38%.
The diff coverage is 71.02%.

❗ Current head d57735e differs from pull request most recent head 6cb8fdf. Consider uploading reports for the commit 6cb8fdf to get more accurate results

@@            Coverage Diff             @@
##             main     #159      +/-   ##
==========================================
+ Coverage   57.82%   58.21%   +0.38%     
==========================================
  Files          91       93       +2     
  Lines        8159     8404     +245     
==========================================
+ Hits         4718     4892     +174     
- Misses       3029     3081      +52     
- Partials      412      431      +19     
Flag Coverage Δ
unittests 58.21% <71.02%> (+0.38%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/util/pod.go 9.85% <0.00%> (-0.15%) ⬇️
...kg/scheduler/plugins/loadaware/pod_assign_cache.go 71.18% <71.18%> (ø)
pkg/scheduler/plugins/loadaware/load_aware.go 71.73% <71.73%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb90038...6cb8fdf. Read the comment docs.

@eahydra eahydra force-pushed the support_load_aware_scheduling branch 2 times, most recently from 5817d17 to 637f78f Compare May 24, 2022 08:34
@eahydra
Copy link
Member Author

eahydra commented May 24, 2022

support user-defined node resource utilization thresholds.

@eahydra eahydra force-pushed the support_load_aware_scheduling branch from 637f78f to d57735e Compare May 24, 2022 08:39
func (p *Plugin) estimatedAssignedPodUsage(nodeName string, nodeMetric *slov1alpha1.NodeMetric) map[corev1.ResourceName]int64 {
estimatedUsed := make(map[corev1.ResourceName]int64)
nodeMetricReportInterval := getNodeMetricReportInterval(nodeMetric)
p.podAssignCache.lock.RLock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to reduce the lock granularity as there is no correlation among different nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The granularity of fine-grained locks is not currently considered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

pkg/scheduler/plugins/loadaware/load_aware.go Show resolved Hide resolved
The scheduling plugin filters abnormal nodes and scores them according
to resource usage. The plugin extends the Filter/Score/Reserve/Unreserve
extension points defined in the Kubernetes scheduling framework.

FYI: docs/proposals/scheduling/20220510-load-aware-scheduling.md

Fix koordinator-sh#95

Signed-off-by: Tao Li <joseph.t.lee@outlook.com>
@eahydra eahydra force-pushed the support_load_aware_scheduling branch from d57735e to 6cb8fdf Compare May 25, 2022 13:07
@eahydra eahydra requested a review from saintube May 25, 2022 13:38
@saintube
Copy link
Member

/lgtm
others PTAL

@hormes
Copy link
Member

hormes commented May 26, 2022

/approve

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hormes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@koordinator-bot koordinator-bot bot merged commit 6ef657a into koordinator-sh:main May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[proposal] scheduler support node load balancing
5 participants