Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fine-grained device scheduling proposal #322

Merged
merged 3 commits into from
Jul 11, 2022

Conversation

buptcozy
Copy link
Contributor

Signed-off-by: yangzhang bupt_cozy@126.com

Ⅰ. Describe what this PR does

add schedule-device-in-card-level.md

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

@codecov
Copy link

codecov bot commented Jun 29, 2022

Codecov Report

Merging #322 (8a92f5a) into main (86097cc) will increase coverage by 0.31%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #322      +/-   ##
==========================================
+ Coverage   64.53%   64.85%   +0.31%     
==========================================
  Files         113      116       +3     
  Lines       11165    11451     +286     
==========================================
+ Hits         7205     7426     +221     
- Misses       3385     3440      +55     
- Partials      575      585      +10     
Flag Coverage Δ
unittests 64.85% <ø> (+0.31%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/koordlet/statesinformer/kubelet_stub.go 62.50% <0.00%> (-4.17%) ⬇️
.../koordlet/runtimehooks/hooks/groupidentity/rule.go 83.96% <0.00%> (-2.64%) ⬇️
pkg/koordlet/runtimehooks/runtimehooks.go 61.90% <0.00%> (-1.99%) ⬇️
pkg/koordlet/resmanager/cpu_burst.go 76.06% <0.00%> (-1.47%) ⬇️
pkg/runtimeproxy/resexecutor/cri/pod.go 23.40% <0.00%> (-0.51%) ⬇️
pkg/koordlet/runtimehooks/config.go 14.28% <0.00%> (ø)
...g/koordlet/runtimehooks/hooks/groupidentity/bvt.go 79.16% <0.00%> (ø)
...et/runtimehooks/hooks/groupidentity/interceptor.go 100.00% <0.00%> (ø)
pkg/koordlet/runtimehooks/hooks/cpuset/rule.go 95.83% <0.00%> (ø)
pkg/koordlet/runtimehooks/hooks/cpuset/cpuset.go 65.21% <0.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 86097cc...8a92f5a. Read the comment docs.

Copy link
Contributor

@shinytang6 shinytang6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great proposal, thanks for the work!
I commented some of my questions, please help to answer, thanks

@buptcozy buptcozy force-pushed the device-card-level branch 2 times, most recently from e9c163f to 7dbe9ee Compare June 29, 2022 13:24
Copy link
Member

@jasonliu747 jasonliu747 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's focus on GPU in this proposal first. And more details would be appreciated.

@hormes hormes added this to the v0.6 milestone Jun 30, 2022
Signed-off-by: yangzhang <bupt_cozy@126.com>
Signed-off-by: Joseph <joseph.t.lee@outlook.com>
@jasonliu747 jasonliu747 changed the title add schedule-device-in-card-level.md add fine-grained device scheduling proposal Jul 8, 2022
@buptcozy buptcozy force-pushed the device-card-level branch 2 times, most recently from 26cdcef to c022d81 Compare July 8, 2022 14:17
Copy link
Member

@eahydra eahydra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hormes
Copy link
Member

hormes commented Jul 11, 2022

/approve

Copy link
Member

@jasonliu747 jasonliu747 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eahydra, hormes, jasonliu747

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@koordinator-bot koordinator-bot bot merged commit 7d46fad into koordinator-sh:main Jul 11, 2022
jasonliu747 pushed a commit to jasonliu747/koordinator that referenced this pull request Jul 26, 2022
* add schedule-device-in-card-level.md

Signed-off-by: yangzhang <bupt_cozy@126.com>

* refactor fine-grained device scheduling

Signed-off-by: Joseph <joseph.t.lee@outlook.com>

If the user knows exactly or can roughly estimate the specific memory consumption of the workload, he can apply for GPU memory through `koordinator.sh/gpu-memory`. All details can be seen below.

Besides, when dimension's value > 100, means Pod need multi-devices. now only allow the value can be divided by 100.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the value of a container's gpu-core is greater than 100 and cannot be divided by 100 (e.g. 101), will the pod be rejected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, should be rejected.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it rejected by the webhook, or by the scheduler?

Copy link
Member

@jasonliu747 jasonliu747 Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

koordinator.sh/gpu-core should be something like: 25, 51, 77, 100, 200, 300. otherwise it will be rejected by scheduler in Prefilter step.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a pod is rejected by the scheduler, it will go into Pending phase, and the scheduler will keep retrying to schedule the pod. I think this retry may be useless, and may increase the load on the scheduler. would it be better to reject the pod in the webhook?

Copy link
Contributor Author

@buptcozy buptcozy Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to reject the pod in the webhook.

Why do we need `koordinator.sh/gpu-memory-ratio` and `koordinator.sh/gpu-memory` ?
When user apply 0.5/0.25 GPU, the user don't know the exact memory total bytes per GPU, only wants to use
half or quarter percentage of memory, so user can request the GPU memory with `koordinator.sh/gpu-memory-ratio`.
When scheduler assigned Pod on concrete node, scheduler will translate the `koordinator.sh/gpu-memory-ratio` to `koordinator.sh/gpu-memory` by the formulas: ***allocatedMemory = totalMemoryOf(GPU) * `koordinator.sh/gpu-memory-ratio`***, so that the GPU isolation can work.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheduler will translate the koordinator.sh/gpu-memory-ratio to koordinator.sh/gpu-memory

Does this mean that scheduler will call kube-apiserver to update the pod's spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will update container resources.

memory: "8Gi"
```

##### Apply `koordinator.sh/gpu-core` and `koordinator.sh/gpu-memory` separately
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a container only requests gpu-core and gpu-memory, will the amount of gpu-memory-ratio resources for the node in the scheduler cache be incorrect? because gpu-memory-ratio resources may be not assumed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will translate gpu-memory-ratio based on gpu-memory on concrete node.

As we know, the GPU scheduling in kube-scheduler side has no any different with other scalar resources. The concrete
device-level assigning is done by kubelet and GPU device plugin, which will generate container's GPU env.

Our design has no conflict with the above process. our device reporter will report koordinator GPU resources for kubelet
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our device reporter will report koordinator GPU resources for kubelet updating node resources

how does device reporter report koordinator GPU resources to kubelet?

does this mean that device reporter still implements a device plugin for koordinator GPU resources?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved after reading #410

@caohe
Copy link

caohe commented Aug 19, 2022

@jasonliu747 @zwzhang0107 hello, I have figured out some of these questions after reading #410. However, I still have some doubts about the scheduler:

  1. Could you provide more details about the resource translation mechanism?
  2. How do you ensure that the the quantities of node-level extended resources in the core cache are correct, if a container only requests some kinds of these resources?

Thanks for any reply!

@jasonliu747
Copy link
Member

@caohe if it's ok for you, let's discuss this on WeChat or DingTalk. WDYT?

@caohe
Copy link

caohe commented Aug 22, 2022

@jasonliu747 sure, happy to discuss this on WeChat or DingTalk.

@jasonliu747
Copy link
Member

@jasonliu747 sure, happy to discuss this on WeChat or DingTalk.

@caohe You can find our DingTalk QR Code in README. Please PM me once you join the group, below is my DingTalk avatar. Thanks.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants