Skip to content
This repository has been archived by the owner on May 25, 2023. It is now read-only.

[Features] improve scheduling performance on batch jobs #492

Closed
jiaxuanzhou opened this issue Dec 10, 2018 · 11 comments
Closed

[Features] improve scheduling performance on batch jobs #492

jiaxuanzhou opened this issue Dec 10, 2018 · 11 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@jiaxuanzhou
Copy link
Contributor

jiaxuanzhou commented Dec 10, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
more like feature, but bug when scheduling tasks with Critical Source.

Uncomment only one, leave it on its own line:

/kind feature

What happened:
1, Looping checking the whole nodes of the cluster on each loop of scheduling one task
2, Allocated resource for part of tasks within one job, even the idle resource of the cluster can not satisfy all the tasks of the job.
What you expected to happen:
1, For action of allocate as an example:
https://github.com/kubernetes-sigs/kube-batch/blob/c4896a41a061cd2e3d071fc01b1dd15df06b84ea/pkg/scheduler/actions/allocate/allocate.go#L112
before allocating resources for the tasks of one job, better filtering the nodes and calculate the idle of the nodes to check if they can satisfy the req resources of the whole job, if not just leave it in the queue and continue to handle the next one...

2, And for https://github.com/kubernetes-sigs/kube-batch/blob/c4896a41a061cd2e3d071fc01b1dd15df06b84ea/pkg/scheduler/framework/session.go#L144
Is it better to release the resources of the tasks which are not bound to node as the idle resources of the cluster can not satisfy the whole job?
Currently, it often occurred that 2 jobs were pending even the idle resources of the cluster can satisfy the last submitted one.
How to reproduce it (as minimally and precisely as possible):
Suppose the idle resources of the cluster is: 100 cores, 100GB mem
1, submit 1st job with 4 pods, each req: 30 cores, 40GB mem
2, submit 2nd job with 4 pods, each req: 20 cores, 20GB mem

Anything else we need to know?:

Environment:
kube-batch: master branch

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 10, 2018
@jiaxuanzhou jiaxuanzhou changed the title [Features] improve scheduling on batch jobs [Features] improve scheduling performance on batch jobs Dec 10, 2018
@jiaxuanzhou
Copy link
Contributor Author

/cc @k82cn

@k82cn
Copy link
Contributor

k82cn commented Dec 13, 2018

Is it better to release the resources of the tasks which are not bound to node as the idle resources of the cluster can not satisfy the whole job?

We need to care about starvation of "huge job" by "smaller job"; the other part is ok to me :)

@k82cn
Copy link
Contributor

k82cn commented Apr 15, 2019

/kind feature
/sig scheduling

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Apr 15, 2019
@dhzhuo
Copy link

dhzhuo commented Apr 23, 2019

@jiaxuanzhou

before allocating resources for the tasks of one job, better filtering the nodes and calculate the idle of the nodes to check if they can satisfy the req resources of the whole job, if not just leave it in the queue and continue to handle the next one

With this design, big jobs (i.e. a job with many tasks) are more likely to be starved because we keep allocating resources to smaller jobs. If we add a starvation prevention mechanism, big jobs will eventually be scheduled. But this behavior is not ideal because starvation prevention mechanism will kick in only after the job has waited for long than specified starvation threshold.

PR 821 proposed a slightly different approach. We always allocate resources to big jobs. But if the big job is not ready to run, allocated resources will be released in backfill phase, and smaller jobs will be scheduled as backfill jobs. In the following scheduling rounds, if the scheduler notices that the big job is conditionally ready, it will either 1) preempt backfill jobs and start big job right away (in preemption mode), or 2) disable backfilling, wait for backfill jobs to finish, and start the big job (in non-preemption mode). In either case, the big job is likely to start earlier.

@jiaxuanzhou
Copy link
Contributor Author

jiaxuanzhou commented Apr 24, 2019

@DonghuiZhuo looks good, looking forward the pr of backfill part, thanks 👍

@k82cn
Copy link
Contributor

k82cn commented Apr 24, 2019

There're two issue here: 1. fragment because of queue's algorithm 2. starvation. In this issue, we only need to handle first one :)

@dhzhuo
Copy link

dhzhuo commented Apr 24, 2019

@jiaxuanzhou PR 805 implements backfill.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 22, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

5 participants