koord-scheduler: coscheduling support multiple gang match policy #1380

xulinfei1996 · 2023-06-13T03:45:21Z

Ⅰ. Describe what this PR does

Expand the logic of determining whether gang is satisfied by "match-policy".

In some specific scenarios, the completed task Pod may not exist. At this time, it is impossible to count the number of completed tasks. In this case, once Gang has been satisfied, all subsequent resource allocations are no longer constrained by Gang rules, which is called once-satisfied, the default gang 'match-policy'.

For GPU topology, all the tasks need to be scheduled at the same time to ensure the best performance, so the gang only consider the number of pods waiting on permit is larger than the min-available, which is called only-waiting. What's more, if the workload support failover, but workload controller can't persist the new gangName or support reconstruct all pods at one-time, only-waiting can help.

For tasks of elastic training, the new workers' arrival is random, so the gang only consider the sum of running pods and waiting pods is larger than the min-available, which is called waiting-and-running. For example, if the gang's total number is 5, min available is 3 and there are 3 pods running at first. After two of the three pods completed, the fourth pod comes, if the match-policy is once-satisfied, the fourth pod will be scheduled and bound. If the match-policy is waiting-and-running, the fourth pod will keep pending, since there are only 2 pods' status is waiting-and-running which is smaller than min-available.

Ⅱ. Does this pull request fix one issue?

fixes #940 Matching Policy part.

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in make test

eahydra · 2023-06-13T03:54:49Z

Please describe more user scenarios to explain why these policies are needed? Are these policies really necessary to implement in the scheduler?
WDYT @hormes @KunWuLuan

codecov · 2023-06-13T03:55:35Z

Codecov Report

Patch coverage: 64.00% and project coverage change: +0.04 🎉

Comparison is base (fe0a750) 64.36% compared to head (a00cd02) 64.41%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1380      +/-   ##
==========================================
+ Coverage   64.36%   64.41%   +0.04%     
==========================================
  Files         337      337              
  Lines       34552    34594      +42     
==========================================
+ Hits        22241    22282      +41     
+ Misses      10676    10675       -1     
- Partials     1635     1637       +2

Flag	Coverage Δ
unittests	`64.41% <64.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...cheduler/plugins/coscheduling/core/gang_summary.go	`0.00% <0.00%> (ø)`
pkg/scheduler/plugins/coscheduling/core/core.go	`39.81% <7.69%> (+1.15%)`	⬆️
pkg/scheduler/plugins/coscheduling/core/gang.go	`83.09% <86.11%> (+1.88%)`	⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

jasonliu747 · 2023-06-13T03:59:52Z

/cc @hormes @KunWuLuan

KunWuLuan · 2023-06-13T06:33:07Z

I also wonder in what situations we need these match policies. Could you please describe more, thanks?

xulinfei1996 · 2023-06-19T01:47:55Z

Please describe more user scenarios to explain why these policies are needed? Are these policies really necessary to implement in the scheduler? WDYT @hormes @KunWuLuan

Edited the gang scheduling proposal and this PR's introduction to describe the scenarios why these three policies are needed, please take a look.

eahydra · 2023-06-20T04:35:57Z

docs/proposals/scheduling/20220901-gang-scheduling.md


 As mentioned above, `WaitTime` is the max wait time since first pod comes to permit stage. If `WaitTime` is timeout, 
 scheduler will roll back all assumed pods, update each pod's annotation with `gang.scheduling.koordinator.sh/timeout=true`, and
 won't schedule these pods anymore. User should pay attention to this status and delete pods timely.

+#### Gang Match Policy
+When using Gang scheduling, you can expand the logic of determining whether gang is satisfied by `match-policy`. 


I still don't quite understand the scenes behind this. There are several issues to discuss:

From these descriptions, it seems that after minAvailable is satisfied, you want to continue to do Gang Scheduling for subsequent Pods again?

Do these special policies have to be implemented in the scheduler? When the Job Controller (no matter which Job Controller it is) creates a new batch of Pods, can it not set new Gang parameters for a new batch of Pods?

Which of these three policies is the default strategy, and is it compatible with existing implementations?

To issue 1, the answer is yes.
To issue 2, if the Job Controller need reconstruct all pods to enhance fault toleration and the pods' reconstruction behavior is to reconstruct pod one by one, Koordinator now will schedule reconstructed pod immediately, which may not meet the requirement of scheduling all pods at the same time to ensure the best performance.
To issue3,as the Koordinator's current implementation is "once-satisfied" policy, so the default strategy is "once-satisfied" policy to compat with existing implementation.

In what situation we have to do Gang for subsequent Pods?
If job's replica is larger than min available, subsequent Pods should pass.
If job's replica is same with min available, why pods come again? If pods failed, the job status will be set as failed for tfjobs.

If job's replica is same with min available and pods failed, but the job supports failover, pods will come again.

If job's replica is same with min available and pods failed, but the job supports failover, pods will come again.

If job supports failover, why we should block the subsequent arrival pods? I mean, if all pods should be deleted and created, we can create a new gang, otherwise once-satisfied is enough.

Yes, create a new gang is a solution. However, not all workload controller can persist the new gangName, and only-waiting may help.

Is there any open source job controller supports failover? How do they support failover and distinguish pods from different rounds？
If a job supports failover and all pods should be deleted and created, I think the better way is to include rounds in pod name, this is not difficult.
@eahydra WDYT

docs/proposals/scheduling/20220901-gang-scheduling.md

Signed-off-by: xulinfei.xlf <xulinfei.xlf@alibaba-inc.com>

hormes · 2023-06-27T11:03:38Z

/lgtm

eahydra · 2023-06-29T05:37:52Z

/lgtm
/approve

koordinator-bot · 2023-06-29T05:50:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eahydra, hormes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hormes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

koordinator-bot bot requested review from eahydra and FillZpp June 13, 2023 03:45

koordinator-bot bot added the size/L label Jun 13, 2023

koordinator-bot bot requested review from hormes and KunWuLuan June 13, 2023 03:59

xulinfei1996 force-pushed the xlf2 branch 4 times, most recently from 434351a to bfd2f6c Compare June 14, 2023 09:32

eahydra reviewed Jun 20, 2023

View reviewed changes

xulinfei1996 force-pushed the xlf2 branch 2 times, most recently from 30c686f to 41928e3 Compare June 26, 2023 06:25

hormes reviewed Jun 26, 2023

View reviewed changes

docs/proposals/scheduling/20220901-gang-scheduling.md Show resolved Hide resolved

koord-scheduler: coscheduling support multiple gang match policy

a00cd02

Signed-off-by: xulinfei.xlf <xulinfei.xlf@alibaba-inc.com>

xulinfei1996 force-pushed the xlf2 branch from 41928e3 to a00cd02 Compare June 26, 2023 08:15

koordinator-bot bot assigned hormes Jun 27, 2023

koordinator-bot bot added the lgtm label Jun 27, 2023

koordinator-bot bot assigned eahydra Jun 29, 2023

hormes approved these changes Jun 29, 2023

View reviewed changes

koordinator-bot bot added the approved label Jun 29, 2023

koordinator-bot bot merged commit 6aa7472 into koordinator-sh:main Jun 29, 2023
15 checks passed

xulinfei1996 deleted the xlf2 branch June 29, 2023 07:02

This was referenced Feb 26, 2024

REQUEST: New membership for < xulinfei1996 > #1924

Closed

[REQUEST Membership]: New membership for <xulinfei1996> koordinator-sh/community#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

koord-scheduler: coscheduling support multiple gang match policy #1380

koord-scheduler: coscheduling support multiple gang match policy #1380

xulinfei1996 commented Jun 13, 2023 •

edited

Loading

eahydra commented Jun 13, 2023

codecov bot commented Jun 13, 2023 •

edited

Loading

jasonliu747 commented Jun 13, 2023

KunWuLuan commented Jun 13, 2023

xulinfei1996 commented Jun 19, 2023

eahydra Jun 20, 2023

xulinfei1996 Jun 21, 2023

KunWuLuan Jun 21, 2023

xulinfei1996 Jun 25, 2023

KunWuLuan Jun 27, 2023

xulinfei1996 Jun 27, 2023

KunWuLuan Jun 27, 2023

hormes commented Jun 27, 2023

eahydra commented Jun 29, 2023

koordinator-bot bot commented Jun 29, 2023

koord-scheduler: coscheduling support multiple gang match policy #1380

koord-scheduler: coscheduling support multiple gang match policy #1380

Conversation

xulinfei1996 commented Jun 13, 2023 • edited Loading

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

eahydra commented Jun 13, 2023

codecov bot commented Jun 13, 2023 • edited Loading

Codecov Report

jasonliu747 commented Jun 13, 2023

KunWuLuan commented Jun 13, 2023

xulinfei1996 commented Jun 19, 2023

eahydra Jun 20, 2023

Choose a reason for hiding this comment

xulinfei1996 Jun 21, 2023

Choose a reason for hiding this comment

KunWuLuan Jun 21, 2023

Choose a reason for hiding this comment

xulinfei1996 Jun 25, 2023

Choose a reason for hiding this comment

KunWuLuan Jun 27, 2023

Choose a reason for hiding this comment

xulinfei1996 Jun 27, 2023

Choose a reason for hiding this comment

KunWuLuan Jun 27, 2023

Choose a reason for hiding this comment

hormes commented Jun 27, 2023

eahydra commented Jun 29, 2023

koordinator-bot bot commented Jun 29, 2023

xulinfei1996 commented Jun 13, 2023 •

edited

Loading

codecov bot commented Jun 13, 2023 •

edited

Loading