add fine-grained device scheduling proposal #322

buptcozy · 2022-06-29T07:13:11Z

Signed-off-by: yangzhang bupt_cozy@126.com

Ⅰ. Describe what this PR does

add schedule-device-in-card-level.md

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in make test

codecov · 2022-06-29T07:17:16Z

Codecov Report

Merging #322 (8a92f5a) into main (86097cc) will increase coverage by 0.31%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #322      +/-   ##
==========================================
+ Coverage   64.53%   64.85%   +0.31%     
==========================================
  Files         113      116       +3     
  Lines       11165    11451     +286     
==========================================
+ Hits         7205     7426     +221     
- Misses       3385     3440      +55     
- Partials      575      585      +10

Flag	Coverage Δ
unittests	`64.85% <ø> (+0.31%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/koordlet/statesinformer/kubelet_stub.go	`62.50% <0.00%> (-4.17%)`	⬇️
.../koordlet/runtimehooks/hooks/groupidentity/rule.go	`83.96% <0.00%> (-2.64%)`	⬇️
pkg/koordlet/runtimehooks/runtimehooks.go	`61.90% <0.00%> (-1.99%)`	⬇️
pkg/koordlet/resmanager/cpu_burst.go	`76.06% <0.00%> (-1.47%)`	⬇️
pkg/runtimeproxy/resexecutor/cri/pod.go	`23.40% <0.00%> (-0.51%)`	⬇️
pkg/koordlet/runtimehooks/config.go	`14.28% <0.00%> (ø)`
...g/koordlet/runtimehooks/hooks/groupidentity/bvt.go	`79.16% <0.00%> (ø)`
...et/runtimehooks/hooks/groupidentity/interceptor.go	`100.00% <0.00%> (ø)`
pkg/koordlet/runtimehooks/hooks/cpuset/rule.go	`95.83% <0.00%> (ø)`
pkg/koordlet/runtimehooks/hooks/cpuset/cpuset.go	`65.21% <0.00%> (ø)`
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 86097cc...8a92f5a. Read the comment docs.

shinytang6

That's a great proposal, thanks for the work!
I commented some of my questions, please help to answer, thanks

docs/proposals/scheduling/20220629-schedule-device-in-card-level.md

jasonliu747

Let's focus on GPU in this proposal first. And more details would be appreciated.

docs/proposals/scheduling/20220629-schedule-device-in-card-level.md

Signed-off-by: yangzhang <bupt_cozy@126.com>

Signed-off-by: Joseph <joseph.t.lee@outlook.com>

docs/proposals/scheduling/20220629-fine-grained-device-scheduling.md

refactor fine-grained device scheduling

eahydra

/lgtm

hormes · 2022-07-11T10:19:28Z

/approve

jasonliu747

/lgtm

koordinator-bot · 2022-07-11T10:50:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eahydra, hormes, jasonliu747

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hormes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* add schedule-device-in-card-level.md Signed-off-by: yangzhang <bupt_cozy@126.com> * refactor fine-grained device scheduling Signed-off-by: Joseph <joseph.t.lee@outlook.com>

caohe · 2022-08-18T05:56:55Z

docs/proposals/scheduling/20220629-fine-grained-device-scheduling.md

+
+If the user knows exactly or can roughly estimate the specific memory consumption of the workload, he can apply for GPU memory through `koordinator.sh/gpu-memory`. All details can be seen below.
+
+Besides, when dimension's value > 100, means Pod need multi-devices. now only allow the value can be divided by 100.


If the value of a container's gpu-core is greater than 100 and cannot be divided by 100 (e.g. 101), will the pod be rejected?

yes, should be rejected.

is it rejected by the webhook, or by the scheduler?

koordinator.sh/gpu-core should be something like: 25, 51, 77, 100, 200, 300. otherwise it will be rejected by scheduler in Prefilter step.

if a pod is rejected by the scheduler, it will go into Pending phase, and the scheduler will keep retrying to schedule the pod. I think this retry may be useless, and may increase the load on the scheduler. would it be better to reject the pod in the webhook?

it would be better to reject the pod in the webhook.

caohe · 2022-08-18T09:31:16Z

docs/proposals/scheduling/20220629-fine-grained-device-scheduling.md

+Why do we need `koordinator.sh/gpu-memory-ratio` and `koordinator.sh/gpu-memory` ? 
+When user apply 0.5/0.25 GPU, the user don't know the exact memory total bytes per GPU, only wants to use 
+half or quarter percentage of memory, so user can request the GPU memory with `koordinator.sh/gpu-memory-ratio`. 
+When scheduler assigned Pod on concrete node, scheduler will translate the `koordinator.sh/gpu-memory-ratio` to `koordinator.sh/gpu-memory` by the formulas:  ***allocatedMemory = totalMemoryOf(GPU)  * `koordinator.sh/gpu-memory-ratio`***, so that the GPU isolation can work.


scheduler will translate the koordinator.sh/gpu-memory-ratio to koordinator.sh/gpu-memory

Does this mean that scheduler will call kube-apiserver to update the pod's spec?

it will update container resources.

caohe · 2022-08-18T09:39:48Z

docs/proposals/scheduling/20220629-fine-grained-device-scheduling.md

+    memory: "8Gi"
+```
+
+##### Apply `koordinator.sh/gpu-core` and `koordinator.sh/gpu-memory` separately


if a container only requests gpu-core and gpu-memory, will the amount of gpu-memory-ratio resources for the node in the scheduler cache be incorrect? because gpu-memory-ratio resources may be not assumed.

we will translate gpu-memory-ratio based on gpu-memory on concrete node.

caohe · 2022-08-18T09:47:17Z

docs/proposals/scheduling/20220629-fine-grained-device-scheduling.md

+As we know, the GPU scheduling in kube-scheduler side has no any different with other scalar resources. The concrete 
+device-level assigning is done by kubelet and GPU device plugin, which will generate container's GPU env. 
+
+Our design has no conflict with the above process. our device reporter will report koordinator GPU resources for kubelet


our device reporter will report koordinator GPU resources for kubelet updating node resources

how does device reporter report koordinator GPU resources to kubelet?

does this mean that device reporter still implements a device plugin for koordinator GPU resources?

resolved after reading #410

caohe · 2022-08-19T03:47:25Z

@jasonliu747 @zwzhang0107 hello, I have figured out some of these questions after reading #410. However, I still have some doubts about the scheduler:

Could you provide more details about the resource translation mechanism?
How do you ensure that the the quantities of node-level extended resources in the core cache are correct, if a container only requests some kinds of these resources?

Thanks for any reply!

jasonliu747 · 2022-08-19T08:14:48Z

@caohe if it's ok for you, let's discuss this on WeChat or DingTalk. WDYT?

caohe · 2022-08-22T13:34:18Z

@jasonliu747 sure, happy to discuss this on WeChat or DingTalk.

jasonliu747 · 2022-08-25T09:27:25Z

@jasonliu747 sure, happy to discuss this on WeChat or DingTalk.

@caohe You can find our DingTalk QR Code in README. Please PM me once you join the group, below is my DingTalk avatar. Thanks.

koordinator-bot bot requested review from FillZpp and hormes June 29, 2022 07:13

koordinator-bot bot added the size/XS label Jun 29, 2022

buptcozy force-pushed the device-card-level branch from 91107cb to 90153e8 Compare June 29, 2022 07:53

koordinator-bot bot added size/M and removed size/XS labels Jun 29, 2022

buptcozy force-pushed the device-card-level branch 4 times, most recently from 30f2ece to fee92ed Compare June 29, 2022 10:02

koordinator-bot bot added size/L and removed size/M labels Jun 29, 2022

buptcozy force-pushed the device-card-level branch 4 times, most recently from 3597156 to e9547b9 Compare June 29, 2022 10:20

buptcozy mentioned this pull request Jun 29, 2022

[proposal] scheduler support device(.like gpu\rdma) card level schedule #321

Closed

shinytang6 suggested changes Jun 29, 2022

View reviewed changes

docs/proposals/scheduling/20220629-schedule-device-in-card-level.md Outdated Show resolved Hide resolved

docs/proposals/scheduling/20220629-schedule-device-in-card-level.md Outdated Show resolved Hide resolved

hormes reviewed Jun 29, 2022

View reviewed changes

docs/proposals/scheduling/20220629-schedule-device-in-card-level.md Outdated Show resolved Hide resolved

docs/proposals/scheduling/20220629-schedule-device-in-card-level.md Outdated Show resolved Hide resolved

buptcozy force-pushed the device-card-level branch 2 times, most recently from e9c163f to 7dbe9ee Compare June 29, 2022 13:24

jasonliu747 reviewed Jun 29, 2022

View reviewed changes

hormes added this to the v0.6 milestone Jun 30, 2022