-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for Support of Pod Scheduling Readiness #3581
Proposal for Support of Pod Scheduling Readiness #3581
Conversation
Welcome @ykcai-daniel! |
@Monokaix Please have a look first. Will fix the CI problems soon. |
You can sign off your commit with |
# Pod Scheduling Readiness | ||
## Motivation | ||
|
||
Pod Scheduling Readiness is a beta feature in Kubernetes v1.27. Users expect Volcano to be aware of it. By specifying/removing a Pod's `.spec.schedulingGates`, which is an array of strings, users can control when a Pod is ready to be considered for scheduling. For Pods with none-empty `schedulingGates`, it will only be removed by kube-scheduler once all the gates are removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will only be removed by kube-scheduler once all the gates are removed.
Actually the schedulingGates
field is usually removed by external controllers and kube-scheduler just check this filed to determine whether to schedule.
|
||
Pod Scheduling Readiness is a beta feature in Kubernetes v1.27. Users expect Volcano to be aware of it. By specifying/removing a Pod's `.spec.schedulingGates`, which is an array of strings, users can control when a Pod is ready to be considered for scheduling. For Pods with none-empty `schedulingGates`, it will only be removed by kube-scheduler once all the gates are removed. | ||
|
||
The following example illustrates why support for Scheduling Readiness is needed in Volcano. Suppose we have implemented an external quota manager responsible for reviewing all incoming pod requests for capacity/quota requirements. Only once these requests receive approval from the quota manager are they considered eligible for scheduling. The pods schedulingGates feature can be handy when implementing this funtionality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a full stop of this passage.
``` | ||
2. **Plugins**: Scheduler Plugins such as proportion will register functions to determine whether a PodGroup is enqueueable or allocatable. These functions calculate the resources already used in the cluster based on states of each PodGroup. For example, the proportion plugin determines whether a task is enqueuable when `sum(Inqueue)+sum(Running)+cur_job<TotalResource+elastic`. Since the scheduling gated jobs do not occupy resources in the cluster, by having the scheduling gated job in `SchGated` rather than `Inqueue`, they will not be summed up when calculating total used resources, which reflects the actual situation. | ||
3. **Actions**: Transition to `SchGated` happens to and from `Inqueue` first in the allocation action. Then, other actions will skip the gated pods. | ||
4. **Controllers**: One alternative design of state transition is to directly transition from `Pending` to `SchGated`. However, this is not possible because Controllers only create Pods for a job once it is inqueued (see [delayed-pod-creation](./delay-pod-creation.md)). We can only know a Pod is scheduling gated in the Inqueue state. Therefore, we need to transition to SchGated from Inqueue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Therefore, we need to transition to SchGated from Inqueue.
We should clarify that only the first time pg is created should we transit pg to Inqueue from Pending, and when pods of current pg are created, we transit pg to SchGated from Inqueue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also emphasize atomicity of Pod creation: Once the state of a pg is transitioned from Pending
to Inqueue
, I think all Pods of that pg will be created. In other words, the scheduler will never see a Job with partially created Pods. If this is not the case, the following situation might occur:
Consider a Job with four pods: p1,p2,p3,p4
Job enqueued ->p1,p2 created with gates -> p1,p2 gates removed -> Job allocated to node -> p3, p4 created with gates
We will end up with a Running Job but with scheduling gates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goog catch!
3. **Actions**: Transition to `SchGated` happens to and from `Inqueue` first in the allocation action. Then, other actions will skip the gated pods. | ||
4. **Controllers**: One alternative design of state transition is to directly transition from `Pending` to `SchGated`. However, this is not possible because Controllers only create Pods for a job once it is inqueued (see [delayed-pod-creation](./delay-pod-creation.md)). We can only know a Pod is scheduling gated in the Inqueue state. Therefore, we need to transition to SchGated from Inqueue. | ||
## Granularity of Pod Scheduling Readiness | ||
Pod scheduling readiness is a field of the spec of a pod. However, Volcano schedules Jobs, which consists of many tasks, each corresponding to a Pod. It is possible that some of these pods are scheduling gated while others are not. To align the granularity of the Pod Scheduling Readiness and Job, we see scheduling gates as a property of a Job: as long as there is one gated pod, the job is scheduling gated. This is consistent with the workload of Volcano: most Volcano Jobs need to run as a whole and cannot be partially run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add that only voclano jobs have this limitation, and for normal pod/Deployment/Statefulset, etc, it's still a pod level gated feature.
Antoher part that we should concern is the observability, when a pod is scheduling gated, what's the behavior of a pod and its pg? should we reports some events to let users know what happen, maybe we can also refer to kube-scheduler to determine whethet we should report events and what's the frequency of the event reported. |
7707911
to
0539517
Compare
10b66e1
to
252aa26
Compare
Proposal will be updated based on the new design, main changes include:
|
/assign @wpeng102 |
4. **Controllers**: K8S native resources like Deployment support template level removal of scheduling gates. In other words, if a Deployment has scheduling gates in its pod template, by patching the Deployment and remove the scheduling gates, all its pods will be deleted and recreated without gates. However, for Vcjob, currently the job controller cannot detect changes in PodTemplate and cannot support this feature. Despite this, it is uncommon to remove scheduling gates from PodTemplate and the Pod Scheduling Gates feature is usually used at pod level. What's more, scheduling gates are often added by webhooks instead of in the Job template (more details in K8S KEP). Therefore, we choose to not align this behavior with K8S. | ||
|
||
## Limitations | ||
1. **Vcjob support removing scheduling gates in template**: As mentioned above, currently, if we remove the scheduling gates field in a Vcjob, the gates of its pods are not removed. This behavior of Vcjob is different from native K8S resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are limitations not our expectations, please modify them: )
f9a4bc7
to
ba86b30
Compare
Signed-off-by: ykcai-daniel <1155141377@link.cuhk.edu.hk>
ba86b30
to
307e5f0
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: william-wang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Proposal to add support for Pod Scheduling Readiness as mentioned in #3555