Proposal for Support of Pod Scheduling Readiness #3581

ykcai-daniel · 2024-07-10T09:48:42Z

Proposal to add support for Pod Scheduling Readiness as mentioned in #3555

volcano-sh-bot · 2024-07-10T09:48:46Z

Welcome @ykcai-daniel!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

ykcai-daniel · 2024-07-10T09:54:29Z

@Monokaix Please have a look first. Will fix the CI problems soon.

Monokaix · 2024-07-11T01:18:11Z

You can sign off your commit with git commit -s: )

Monokaix · 2024-07-11T01:29:08Z

docs/design/pod-scheduling-readiness.md

+# Pod Scheduling Readiness
+## Motivation
+
+Pod Scheduling Readiness is a beta feature in Kubernetes v1.27. Users expect Volcano to be aware of it. By specifying/removing a Pod's `.spec.schedulingGates`, which is an array of strings, users can control when a Pod is ready to be considered for scheduling. For Pods with none-empty `schedulingGates`, it will only be removed by kube-scheduler once all the gates are removed.


it will only be removed by kube-scheduler once all the gates are removed.

Actually the schedulingGates field is usually removed by external controllers and kube-scheduler just check this filed to determine whether to schedule.

Monokaix · 2024-07-11T01:33:17Z

docs/design/pod-scheduling-readiness.md

+
+Pod Scheduling Readiness is a beta feature in Kubernetes v1.27. Users expect Volcano to be aware of it. By specifying/removing a Pod's `.spec.schedulingGates`, which is an array of strings, users can control when a Pod is ready to be considered for scheduling. For Pods with none-empty `schedulingGates`, it will only be removed by kube-scheduler once all the gates are removed.
+
+The following example illustrates why support for Scheduling Readiness is needed in Volcano. Suppose we have implemented an external quota manager responsible for reviewing all incoming pod requests for capacity/quota requirements. Only once these requests receive approval from the quota manager are they considered eligible for scheduling. The pods schedulingGates feature can be handy when implementing this funtionality


Please add a full stop of this passage.

Monokaix · 2024-07-11T01:45:06Z

docs/design/pod-scheduling-readiness.md

+```
+2. **Plugins**: Scheduler Plugins such as proportion  will register functions to determine whether a PodGroup is enqueueable or allocatable. These functions calculate the resources already used in the cluster based on states of each PodGroup. For example, the proportion plugin determines whether a task is enqueuable when `sum(Inqueue)+sum(Running)+cur_job<TotalResource+elastic`. Since the scheduling gated jobs do not occupy resources in the cluster, by having the scheduling gated job in `SchGated` rather than `Inqueue`, they will not be summed up when calculating total used resources, which reflects the actual situation.
+3. **Actions**: Transition to `SchGated` happens to and from `Inqueue` first in the allocation action. Then, other actions will skip the gated pods.
+4. **Controllers**: One alternative design of state transition is to directly transition from `Pending` to `SchGated`. However, this is not possible because Controllers only create Pods for a job once it is inqueued (see [delayed-pod-creation](./delay-pod-creation.md)). We can only know a Pod is scheduling gated in the Inqueue state. Therefore, we need to transition to SchGated from Inqueue.


Therefore, we need to transition to SchGated from Inqueue.

We should clarify that only the first time pg is created should we transit pg to Inqueue from Pending, and when pods of current pg are created, we transit pg to SchGated from Inqueue.

Maybe we should also emphasize atomicity of Pod creation: Once the state of a pg is transitioned from Pending to Inqueue, I think all Pods of that pg will be created. In other words, the scheduler will never see a Job with partially created Pods. If this is not the case, the following situation might occur:

Consider a Job with four pods: p1,p2,p3,p4
Job enqueued ->p1,p2 created with gates -> p1,p2 gates removed -> Job allocated to node -> p3, p4 created with gates
We will end up with a Running Job but with scheduling gates

goog catch!

Monokaix · 2024-07-11T01:48:25Z

docs/design/pod-scheduling-readiness.md

+3. **Actions**: Transition to `SchGated` happens to and from `Inqueue` first in the allocation action. Then, other actions will skip the gated pods.
+4. **Controllers**: One alternative design of state transition is to directly transition from `Pending` to `SchGated`. However, this is not possible because Controllers only create Pods for a job once it is inqueued (see [delayed-pod-creation](./delay-pod-creation.md)). We can only know a Pod is scheduling gated in the Inqueue state. Therefore, we need to transition to SchGated from Inqueue.
+## Granularity of Pod Scheduling Readiness
+Pod scheduling readiness is a field of the spec of a pod. However, Volcano schedules Jobs, which consists of many tasks, each corresponding to a Pod. It is possible that some of these pods are scheduling gated while others are not. To align the granularity of the Pod Scheduling Readiness and Job, we see scheduling gates as a property of a Job: as long as there is one gated pod, the job is scheduling gated. This is consistent with the workload of Volcano: most Volcano Jobs need to run as a whole and cannot be partially run.


Maybe we can add that only voclano jobs have this limitation, and for normal pod/Deployment/Statefulset, etc, it's still a pod level gated feature.

Monokaix · 2024-07-11T01:52:29Z

Antoher part that we should concern is the observability, when a pod is scheduling gated, what's the behavior of a pod and its pg? should we reports some events to let users know what happen, maybe we can also refer to kube-scheduler to determine whethet we should report events and what's the frequency of the event reported.

ykcai-daniel · 2024-08-15T02:48:30Z

Proposal will be updated based on the new design, main changes include:

no longer add new state of PodGroup. Instead, update the plugins for correct inqueue resource computation
In allocation and preempt actions, skip scheduling gated tasks instead of the entire job

ykcai-daniel · 2024-08-16T03:07:44Z

@Monokaix design doc updated according to latest implementation #3658

ykcai-daniel · 2024-08-17T17:16:16Z

/assign @wpeng102

Monokaix · 2024-09-03T02:28:38Z

docs/design/pod-scheduling-readiness.md

+4. **Controllers**: K8S native resources like Deployment support template level removal of scheduling gates. In other words, if a Deployment has scheduling gates in its pod template, by patching the Deployment and remove the scheduling gates, all its pods will be deleted and recreated without gates. However, for Vcjob, currently the job controller cannot detect changes in PodTemplate and cannot support this feature. Despite this, it is uncommon to remove scheduling gates from PodTemplate and the Pod Scheduling Gates feature is usually used at pod level. What's more, scheduling gates are often added by webhooks instead of in the Job template (more details in K8S KEP). Therefore, we choose to not align this behavior with K8S.
+
+## Limitations
+1. **Vcjob support removing scheduling gates in template**: As mentioned above, currently, if we remove the scheduling gates field in a Vcjob, the gates of its pods are not removed. This behavior of Vcjob is different from native K8S resources.


These are limitations not our expectations, please modify them: )

Signed-off-by: ykcai-daniel <1155141377@link.cuhk.edu.hk>

Monokaix · 2024-09-06T03:07:07Z

/lgtm

volcano-sh-bot · 2024-09-06T03:34:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [william-wang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot added the retest-not-required-docs-only label Jul 10, 2024

volcano-sh-bot requested review from archlitchi and lowang-bh July 10, 2024 09:48

volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 10, 2024

Monokaix reviewed Jul 11, 2024

View reviewed changes

ykcai-daniel force-pushed the sch-gates-proposal branch from 7707911 to 0539517 Compare July 11, 2024 02:22

This was referenced Jul 12, 2024

Add PodGroupSchGated State ykcai-daniel/apis#1

Closed

Add new state SchGated volcano-sh/apis#131

Closed

volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 15, 2024

ykcai-daniel force-pushed the sch-gates-proposal branch from 10b66e1 to 252aa26 Compare July 19, 2024 02:46

ykcai-daniel requested a review from Monokaix July 19, 2024 02:47

This was referenced Jul 19, 2024

Add Pod Scheduling Readiness in Volcano Scheduler #3611

Closed

Pod Scheduling Readiness Draft #3612

Closed

ykcai-daniel mentioned this pull request Aug 15, 2024

Pod Scheduling Readiness #3658

Merged

volcano-sh-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 16, 2024

volcano-sh-bot assigned wpeng102 Aug 17, 2024

Monokaix reviewed Sep 3, 2024

View reviewed changes

ykcai-daniel force-pushed the sch-gates-proposal branch from f9a4bc7 to ba86b30 Compare September 6, 2024 02:47

volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed retest-not-required-docs-only size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 6, 2024

sch gates proposal

307e5f0

Signed-off-by: ykcai-daniel <1155141377@link.cuhk.edu.hk>

ykcai-daniel force-pushed the sch-gates-proposal branch from ba86b30 to 307e5f0 Compare September 6, 2024 02:52

volcano-sh-bot added retest-not-required-docs-only size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 6, 2024

volcano-sh-bot assigned Monokaix Sep 6, 2024

volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 6, 2024

william-wang approved these changes Sep 6, 2024

View reviewed changes

volcano-sh-bot assigned william-wang Sep 6, 2024

volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 6, 2024

volcano-sh-bot merged commit 9ea246b into volcano-sh:master Sep 6, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for Support of Pod Scheduling Readiness #3581

Proposal for Support of Pod Scheduling Readiness #3581

ykcai-daniel commented Jul 10, 2024

volcano-sh-bot commented Jul 10, 2024

ykcai-daniel commented Jul 10, 2024

Monokaix commented Jul 11, 2024 •

edited

Loading

Monokaix Jul 11, 2024

Monokaix Jul 11, 2024

Monokaix Jul 11, 2024

ykcai-daniel Jul 11, 2024

Monokaix Jul 11, 2024

Monokaix Jul 11, 2024

Monokaix commented Jul 11, 2024

ykcai-daniel commented Aug 15, 2024

ykcai-daniel commented Aug 16, 2024

ykcai-daniel commented Aug 17, 2024

Monokaix Sep 3, 2024

Monokaix commented Sep 6, 2024

volcano-sh-bot commented Sep 6, 2024


		Pod Scheduling Readiness is a beta feature in Kubernetes v1.27. Users expect Volcano to be aware of it. By specifying/removing a Pod's `.spec.schedulingGates`, which is an array of strings, users can control when a Pod is ready to be considered for scheduling. For Pods with none-empty `schedulingGates`, it will only be removed by kube-scheduler once all the gates are removed.

		The following example illustrates why support for Scheduling Readiness is needed in Volcano. Suppose we have implemented an external quota manager responsible for reviewing all incoming pod requests for capacity/quota requirements. Only once these requests receive approval from the quota manager are they considered eligible for scheduling. The pods schedulingGates feature can be handy when implementing this funtionality

Proposal for Support of Pod Scheduling Readiness #3581

Proposal for Support of Pod Scheduling Readiness #3581

Conversation

ykcai-daniel commented Jul 10, 2024

volcano-sh-bot commented Jul 10, 2024

ykcai-daniel commented Jul 10, 2024

Monokaix commented Jul 11, 2024 • edited Loading

Monokaix Jul 11, 2024

Choose a reason for hiding this comment

Monokaix Jul 11, 2024

Choose a reason for hiding this comment

Monokaix Jul 11, 2024

Choose a reason for hiding this comment

ykcai-daniel Jul 11, 2024

Choose a reason for hiding this comment

Monokaix Jul 11, 2024

Choose a reason for hiding this comment

Monokaix Jul 11, 2024

Choose a reason for hiding this comment

Monokaix commented Jul 11, 2024

ykcai-daniel commented Aug 15, 2024

ykcai-daniel commented Aug 16, 2024

ykcai-daniel commented Aug 17, 2024

Monokaix Sep 3, 2024

Choose a reason for hiding this comment

Monokaix commented Sep 6, 2024

volcano-sh-bot commented Sep 6, 2024

Monokaix commented Jul 11, 2024 •

edited

Loading