Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

koord-scheduler: coscheduling support multiple gang match policy #1380

Merged
merged 1 commit into from
Jun 29, 2023

Conversation

xulinfei1996
Copy link
Contributor

@xulinfei1996 xulinfei1996 commented Jun 13, 2023

Ⅰ. Describe what this PR does

Expand the logic of determining whether gang is satisfied by "match-policy".

In some specific scenarios, the completed task Pod may not exist. At this time, it is impossible to count the number of completed tasks. In this case, once Gang has been satisfied, all subsequent resource allocations are no longer constrained by Gang rules, which is called once-satisfied, the default gang 'match-policy'.

For GPU topology, all the tasks need to be scheduled at the same time to ensure the best performance, so the gang only consider the number of pods waiting on permit is larger than the min-available, which is called only-waiting. What's more, if the workload support failover, but workload controller can't persist the new gangName or support reconstruct all pods at one-time, only-waiting can help.

For tasks of elastic training, the new workers' arrival is random, so the gang only consider the sum of running pods and waiting pods is larger than the min-available, which is called waiting-and-running. For example, if the gang's total number is 5, min available is 3 and there are 3 pods running at first. After two of the three pods completed, the fourth pod comes, if the match-policy is once-satisfied, the fourth pod will be scheduled and bound. If the match-policy is waiting-and-running, the fourth pod will keep pending, since there are only 2 pods' status is waiting-and-running which is smaller than min-available.

Ⅱ. Does this pull request fix one issue?

fixes #940 Matching Policy part.

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

@eahydra
Copy link
Member

eahydra commented Jun 13, 2023

Please describe more user scenarios to explain why these policies are needed? Are these policies really necessary to implement in the scheduler?
WDYT @hormes @KunWuLuan

@codecov
Copy link

codecov bot commented Jun 13, 2023

Codecov Report

Patch coverage: 64.00% and project coverage change: +0.04 🎉

Comparison is base (fe0a750) 64.36% compared to head (a00cd02) 64.41%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1380      +/-   ##
==========================================
+ Coverage   64.36%   64.41%   +0.04%     
==========================================
  Files         337      337              
  Lines       34552    34594      +42     
==========================================
+ Hits        22241    22282      +41     
+ Misses      10676    10675       -1     
- Partials     1635     1637       +2     
Flag Coverage Δ
unittests 64.41% <64.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...cheduler/plugins/coscheduling/core/gang_summary.go 0.00% <0.00%> (ø)
pkg/scheduler/plugins/coscheduling/core/core.go 39.81% <7.69%> (+1.15%) ⬆️
pkg/scheduler/plugins/coscheduling/core/gang.go 83.09% <86.11%> (+1.88%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@jasonliu747
Copy link
Member

/cc @hormes @KunWuLuan

@KunWuLuan
Copy link
Contributor

I also wonder in what situations we need these match policies. Could you please describe more, thanks?

@xulinfei1996
Copy link
Contributor Author

Please describe more user scenarios to explain why these policies are needed? Are these policies really necessary to implement in the scheduler? WDYT @hormes @KunWuLuan

Edited the gang scheduling proposal and this PR's introduction to describe the scenarios why these three policies are needed, please take a look.


As mentioned above, `WaitTime` is the max wait time since first pod comes to permit stage. If `WaitTime` is timeout,
scheduler will roll back all assumed pods, update each pod's annotation with `gang.scheduling.koordinator.sh/timeout=true`, and
won't schedule these pods anymore. User should pay attention to this status and delete pods timely.

#### Gang Match Policy
When using Gang scheduling, you can expand the logic of determining whether gang is satisfied by `match-policy`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't quite understand the scenes behind this. There are several issues to discuss:

  1. From these descriptions, it seems that after minAvailable is satisfied, you want to continue to do Gang Scheduling for subsequent Pods again?
  2. Do these special policies have to be implemented in the scheduler? When the Job Controller (no matter which Job Controller it is) creates a new batch of Pods, can it not set new Gang parameters for a new batch of Pods?
  3. Which of these three policies is the default strategy, and is it compatible with existing implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To issue 1, the answer is yes.
To issue 2, if the Job Controller need reconstruct all pods to enhance fault toleration and the pods' reconstruction behavior is to reconstruct pod one by one, Koordinator now will schedule reconstructed pod immediately, which may not meet the requirement of scheduling all pods at the same time to ensure the best performance.
To issue3,as the Koordinator's current implementation is "once-satisfied" policy, so the default strategy is "once-satisfied" policy to compat with existing implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation we have to do Gang for subsequent Pods?
If job's replica is larger than min available, subsequent Pods should pass.
If job's replica is same with min available, why pods come again? If pods failed, the job status will be set as failed for tfjobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If job's replica is same with min available and pods failed, but the job supports failover, pods will come again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If job's replica is same with min available and pods failed, but the job supports failover, pods will come again.

If job supports failover, why we should block the subsequent arrival pods? I mean, if all pods should be deleted and created, we can create a new gang, otherwise once-satisfied is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, create a new gang is a solution. However, not all workload controller can persist the new gangName, and only-waiting may help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any open source job controller supports failover? How do they support failover and distinguish pods from different rounds?
If a job supports failover and all pods should be deleted and created, I think the better way is to include rounds in pod name, this is not difficult.
@eahydra WDYT

Signed-off-by: xulinfei.xlf <xulinfei.xlf@alibaba-inc.com>
@hormes
Copy link
Member

hormes commented Jun 27, 2023

/lgtm

@eahydra
Copy link
Member

eahydra commented Jun 29, 2023

/lgtm
/approve

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eahydra, hormes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@koordinator-bot koordinator-bot bot merged commit 6aa7472 into koordinator-sh:main Jun 29, 2023
15 checks passed
@xulinfei1996 xulinfei1996 deleted the xlf2 branch June 29, 2023 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants