-
Notifications
You must be signed in to change notification settings - Fork 264
[Features] improve scheduling performance on batch jobs #492
Comments
/cc @k82cn |
We need to care about starvation of "huge job" by "smaller job"; the other part is ok to me :) |
/kind feature |
With this design, big jobs (i.e. a job with many tasks) are more likely to be starved because we keep allocating resources to smaller jobs. If we add a starvation prevention mechanism, big jobs will eventually be scheduled. But this behavior is not ideal because starvation prevention mechanism will kick in only after the job has waited for long than specified starvation threshold. PR 821 proposed a slightly different approach. We always allocate resources to big jobs. But if the big job is not ready to run, allocated resources will be released in backfill phase, and smaller jobs will be scheduled as backfill jobs. In the following scheduling rounds, if the scheduler notices that the big job is conditionally ready, it will either 1) preempt backfill jobs and start big job right away (in preemption mode), or 2) disable backfilling, wait for backfill jobs to finish, and start the big job (in non-preemption mode). In either case, the big job is likely to start earlier. |
@DonghuiZhuo looks good, looking forward the pr of backfill part, thanks 👍 |
There're two issue here: 1. fragment because of queue's algorithm 2. starvation. In this issue, we only need to handle first one :) |
@jiaxuanzhou PR 805 implements backfill. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Is this a BUG REPORT or FEATURE REQUEST?:
more like feature, but bug when scheduling tasks with Critical Source.
/kind feature
What happened:
1, Looping checking the whole nodes of the cluster on each loop of scheduling one task
2, Allocated resource for part of tasks within one job, even the idle resource of the cluster can not satisfy all the tasks of the job.
What you expected to happen:
1, For action of
allocate
as an example:https://github.com/kubernetes-sigs/kube-batch/blob/c4896a41a061cd2e3d071fc01b1dd15df06b84ea/pkg/scheduler/actions/allocate/allocate.go#L112
before allocating resources for the tasks of one job, better filtering the nodes and calculate the idle of the nodes to check if they can satisfy the req resources of the whole job, if not just leave it in the queue and continue to handle the next one...
2, And for https://github.com/kubernetes-sigs/kube-batch/blob/c4896a41a061cd2e3d071fc01b1dd15df06b84ea/pkg/scheduler/framework/session.go#L144
Is it better to release the resources of the tasks which are not bound to node as the idle resources of the cluster can not satisfy the whole job?
Currently, it often occurred that 2 jobs were pending even the idle resources of the cluster can satisfy the last submitted one.
How to reproduce it (as minimally and precisely as possible):
Suppose the idle resources of the cluster is:
100 cores, 100GB mem
1, submit 1st job with 4 pods, each req:
30 cores, 40GB mem
2, submit 2nd job with 4 pods, each req:
20 cores, 20GB mem
Anything else we need to know?:
Environment:
kube-batch: master branch
The text was updated successfully, but these errors were encountered: