diff --git a/docs/design/delay-pod-creation.md b/docs/design/delay-pod-creation.md new file mode 100644 index 0000000000..89e6470bef --- /dev/null +++ b/docs/design/delay-pod-creation.md @@ -0,0 +1,116 @@ +# Delay Pod Creation + +@k82cn; Jan 7, 2019 + +## Table of Contents + + * [Delay Pod Creation](#delay-pod-creation) + * [Table of Contents](#table-of-contents) + * [Motivation](#motivation) + * [Function Detail](#function-detail) + * [State](#state) + * [Action](#action) + * [Admission Webhook](#admission-webhook) + * [Feature interaction](#feature-interaction) + * [Queue](#queue) + * [Quota](#quota) + * [Operator/Controller](#operatorcontroller) + * [Others](#others) + * [Compatibility](#compatibility) + * [Roadmap](#roadmap) + * [Reference](#reference) + +Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc) + +## Motivation + +For a batch system, there're always several pending jobs because of limited resources and throughput. +Different with other kubernetes type, e.g. Deployment, DaemonSet, it's better to delay pods creation for +batch workload to reduce apiserver pressure and speed up scheduling (e.g. less pending pods to consider). +In this document, several enhancements are introduced to delay pod creation. + +## Function Detail + +### State + +A new state, named `InQueue`, will be introduced to denote the phase that jobs are ready to be allocated. +After `InQueue`, the state transform map is updated as follow. + +| From | To | Reason | +|---------------|----------------|---------| +| Pending | InQueue | When it's ready to allocate resource to job | +| InQueue | Pending | When there's not enough resources anymore | +| InQueue | Running | When every pods of `spec.minMember` are running | + +The `InQueue` is a new state between `Pending` and `Running`; and it'll let operators/controllers start to +create pods. If it meets errors, e.g. unschedulable, it rollbacks to `Pending` instead of `InQueue` to +avoid retry-loop. + +### Action + +Currently, `kube-batch` supports several actions, e.g. `allocate`, `preempt`; but all those actions are executed +based on pending pods. To support `InQueue` state, a new action, named `enqueue`, will be introduced. + +By default, `enqueue` action will handle `PodGroup`s in FCFS policy; `enqueue` will go through all PodGroup +(by creation timestamp) and update PodGroup's phase to `InQueue` if: + +* there're enough idle resources for `spec.minResources` of `PodGroup` +* there're enough quota for `spec.minResources` of `PodGroup` + +As `kube-batch` handling `PodGroup` by `spec.minResources`, the operator/controller may create more `Pod`s than +`spec.minResources`; in such case, `preempt` action will be enhanced to evict overused `PodGroup` to release +resources. + +### Admission Webhook + +To guarantee the transaction of `spec.minResources`, a new `MutatingAdmissionWebhook`, named `PodGroupMinResources`, +is introduced. `PodGroupMinResources` make sure + +* the summary of all PodGroups' `spec.minResources` in a namespace not more than `Quota` +* if resources are reserved by `spec.minResources`, the resources can not be used by others + +Generally, it's better to let total `Quota` to be more than available resources in cluster, as some pods maybe +unschedulable because of scheduler's algorithm, e.g. predicates. + +## Feature interaction + +### Queue + +The resources will be shared between `Queue`s algorithm, e.g. proportion by default. If the resources can not be +fully used because of fragment, `backfill` action will help on that. If `Queue` used more resources than its +deserved, `reclaim` action will help to balance resources. The Pod can not be evicted currently if eviction will +break `spec.minMember`; it'll be enhanced for job level eviction. + +### Quota + +To delay pod creation, both `kube-batch` and `PodGroupMinResources` will watch `ResourceQuota` to decide which +`PodGroup` should be in queue firstly. The decision maybe invalid because of race condition, e.g. other +controllers create Pods. In such case, `PodGroupMinResources` will reject `PodGroup` creation and keep `InQueue` +state until `kube-batch` transform it back to `Pending`. To avoid race condition, it's better to let `kube-batch` +manage `Pod` number and resources (e.g. CPU, memory) instead of `Quota`. + +### Operator/Controller + +The Operator/Controller should follow the above "protocol" to work together with scheduler. A new component, +named `PodGroupController`, will be introduced later to enforce this protocol if necessary. + +## Others + +### Compatibility + +To support this new feature, a new state and a new action are introduced; so when the new `enqueue` action is +disabled in the configuration, it'll keep the same behaviour as before. + +## Roadmap + +* `InQueue` phase and `enqueue` action (v0.5+) +* Admission Controller (v0.6+) + +## Reference + +* [Coscheduling](https://github.com/kubernetes/enhancements/pull/639) +* [Delay Pod creation](https://github.com/kubernetes-sigs/kube-batch/issues/539) +* [PodGroup Status](https://github.com/kubernetes-sigs/kube-batch/blob/master/doc/design/podgroup-status.md) +* [Support 'spec.TotalResources' in PodGroup](https://github.com/kubernetes-sigs/kube-batch/issues/401) +* [Dynamic Admission Control](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#write-an-admission-webhook-server) +* [Add support for podGroup number limits for one queue](https://github.com/kubernetes-sigs/kube-batch/issues/452) diff --git a/docs/design/drf.md b/docs/design/drf.md new file mode 100644 index 0000000000..010d0a61c8 --- /dev/null +++ b/docs/design/drf.md @@ -0,0 +1,33 @@ +## Dominant Resource Fairness (DRF) + +## Introduction +Dominant Resource Fairness (DRF), a generalization of max-min fairness to multiple resource types is a resource allocation policy that handles multiple resource types. + +Dominant resource - a resource of specific type (cpu, memory, gpu) which is most demanded by given job among other resources it needs. This resource is identified as a share of the total cluster resources of the same type. + +DRF computes the share of dominant resource allocated to a job (dominant share) and tries to maximize the smallest dominant share in the system. +Schedule the task of the job with smallest dominant resource + + +## Kube-Batch Implementation +DRF calculate shares for each job. The share is the highest value of ratio of the (allocated resource/Total Resource) of the three resource types CPU, Memory and GPU. +This share value is used for job ordering and task premption. + +#### 1. Job Ordering: + The job having the lowest share will have higher priority. + In the example below all the tasks task1, task2 of job1 and task3 and task4 of job2 is already allocated to the cluster. + ![drfjobordering](./images/drfjobordering.png) + + + ##### 1.1 Gang Scheduling with DRF in job ordering ( Gang -> DRF) + Gang scheduling sorts the job based on whether the job has atleast **minAvailable** task already (allocated + successfully completed + pipelined) or not. + Jobs which has not met the minAvailable criteria has higher priority than jobs which has met + the minAvailable criteria. + + For the jobs which has met the minAvailable criteria will be sorted according to DRF. + + ![gangwithdrf](./images/gangwithdrf.png) + +#### 2. Task Preemption: + +The preemptor can only preempt other tasks only if the share of the preemptor is less than the share of the preemptee after recalculating the resource allocation of the premptor and preemptee. diff --git a/docs/design/execution-flow.md b/docs/design/execution-flow.md new file mode 100644 index 0000000000..8cb4331581 --- /dev/null +++ b/docs/design/execution-flow.md @@ -0,0 +1,48 @@ +## Execution of the Scheduler for Allocating the Workloads to Node + +The Allocation of the workloads to the node in scheduler happens in each session and the workflow of the session is illustrated in the below diagram. + +1. Session Opens every 1 sec +2. In very session local copies of Queues, JobsMap, PendingTasks and Node List is created. +3. For Each Jobs in the Session + 1. If the Queued ID in the Job exists in the Local Copy of Queues then : + 1. Add the Queue in the Local Copy of Queues + 1. If the QueueID exists in local copyof JobsMap: + 1. Push the Job in the JobsMap + 2. If Not then Add the QueueID as the key in Local JobsMap and add the job in the Map. + 2. If Not then Give warning and continue to Step 3 +4. For Each Item in Local Queues + 1. Pop an Queue from the Queues + 1. Check if Queue is overused + 1. If Yes then Continue to Step 4 + 2. If Not then get the list of JobsList from the JobsMap for the particular Queue. + 1. If List is empty then continue to Step 4 + 2. If Yes then Pop a Job from the JobsList + 1. If Job exits the Local PendingTasks + 1. If Not then : + 1. Create a Local Task List + 1. Get the List of Each Tasks in the pending state for that job + 1. If the required resource for the job is Empty then go back to previous step + 2. If Not then Add the tasks in the Local TasksList + 2. Add the local Tasks List to PendingTasks for that Job + 2. If Yes then : + 1. For each Tasks from the pendingTasksList for the Job. + 1. Pop the tasks + 2. Get the list of predicate nodes for the task. + 3. Score the predicate nodes and sort it. + 4. For each node in the sorted predicated + scored nodes + 1. If the Resource required by task is less than Idle resource of Nodes + 1. If Resource required by task is less than releasing resource of Node + 1. Then Add the tasks in Pipeline + 3. Check if Job is ready to be allocated + 1. If yes the push the Job + 2. If No then add the Queue back to the list. + 3. Continue till all the Job is ready + 2. Continue till each Queue is processed. + + + + + + +![Execution flow graph](../../images/AllocateDesign.png) diff --git a/docs/design/drf - fairshare.md b/docs/design/fairshare.md similarity index 100% rename from docs/design/drf - fairshare.md rename to docs/design/fairshare.md diff --git a/docs/design/images/drfjobordering.png b/docs/design/images/drfjobordering.png new file mode 100644 index 0000000000..24340cb7d0 Binary files /dev/null and b/docs/design/images/drfjobordering.png differ diff --git a/docs/design/images/gangwithdrf.png b/docs/design/images/gangwithdrf.png new file mode 100644 index 0000000000..9c33a51f23 Binary files /dev/null and b/docs/design/images/gangwithdrf.png differ diff --git a/docs/design/metrics.md b/docs/design/metrics.md new file mode 100644 index 0000000000..ba2e44af28 --- /dev/null +++ b/docs/design/metrics.md @@ -0,0 +1,39 @@ +## Scheduler Monitoring + +## Introduction +Currently users can leverage controller logs and job events to monitor scheduler. While useful for debugging, none of this options is particularly practical for monitoring kube-batch behaviour over time. There's also requirement like to monitor kube-batch in one view to resolve critical performance issue in time [#427](https://github.com/kubernetes-sigs/kube-batch/issues/427). + +This document describes metrics we want to add into kube-batch to better monitor performance. + +## Metrics +In order to support metrics, kube-batch needs to expose a metrics endpoint which can provide golang process metrics like number of goroutines, gc duration, cpu and memory usage, etc as well as kube-batch custom metrics related to time taken by plugins or actions. + +All the metrics are prefixed with `kube_batch_`. + +### kube-batch execution +This metrics track execution of plugins and actions of kube-batch loop. + +| Metric name | Metric type | Labels | Description | +| ----------- | ----------- | ------ | ----------- | +| e2e_scheduling_latency | histogram | | E2e scheduling latency in seconds | +| plugin_latency | histogram | `plugin`=<plugin_name> | Schedule latency for plugin | +| action_latency | histogram | `action`=<action_name> | Schedule latency for action | +| task_latency | histogram | `job`=<job_id> `task`=<task_id> | Schedule latency for each task | + + +### kube-batch operations +This metrics describe internal state of kube-batch. + +| Metric name | Metric type | Labels | Description | +| ----------- | ----------- | ------ | ----------- | +| pod_schedule_errors | Counter | | The number of kube-batch failed due to an error | +| pod_schedule_successes | Counter | | The number of kube-batch success in scheduling a job | +| pod_preemption_victims | Counter | | Number of selected preemption victims | +| total_preemption_attempts | Counter | | Total preemption attempts in the cluster till now | +| unschedule_task_count | Counter | `job`=<job_id> | The number of tasks failed to schedule | +| unschedule_job_counts | Counter | | The number of job failed to schedule in each iteration | +| job_retry_counts | Counter | `job`=<job_id> | The number of retry times of one job | + + +### kube-batch Liveness +Healthcheck last time of kube-batch activity and timeout diff --git a/docs/design/node-priority.md b/docs/design/node-priority.md new file mode 100644 index 0000000000..8b35dfe635 --- /dev/null +++ b/docs/design/node-priority.md @@ -0,0 +1,18 @@ +## Node Priority in Kube-Batch + +This feature allows `kube-batch` to schedule workloads based on the priority of the Nodes, Workloads will be scheduled on Nodes with higher priority and these priorities will be calculated based on different parameters like `ImageLocality`, `Most/Least Requested Nodes`...etc. +A basic flow for the Node priority functions is depicted below. + +![Node Priority Flow](../images/Node-Priority.png) + +Currently in kube-batch `Session` is opened every 1 sec and the workloads which are there in Queue goes through `Predicate` to find a suitable set of Nodes where workloads can be scheduled and after that it goes through `Allocate` function to assign the Pods to the Nodes and then goes to `Preempt` if applicable. + +Node Priority can be introduced in the current flow for `Allocate` and `Preempt` function. Once we have set of Nodes where we can scheduled the workloads then flow will go through `Prioritize` function which will do the following things : + + - Run all the priority functions on all the list Nodes which is given by `Predicate` function in a parallel go-routine. + - Score the Node based on whether the `Priority Rule` satisfies the Workload scheduling criteria. + - Once the scores are returned from all the `PriorityFn` then aggregate the scoring and identify the Node with highest scoring. + - Delegate this selected Node in last step to `AllocateFn` to Bind the workload to the Node. + +Currently there are multiple `PriorityFn` available with default Scheduler of Kubernetes. Going forward with each release we will implement all the priority functions in kube-batch based on their importance to batch scheduling. + diff --git a/docs/design/plugin-conf.md b/docs/design/plugin-conf.md new file mode 100644 index 0000000000..c0b0d3c8f6 --- /dev/null +++ b/docs/design/plugin-conf.md @@ -0,0 +1,79 @@ +# Dynamic Plugins Configuration + +## Table of Contents + + * [Dynamic Plugins Configuration](#dynamic-plugins-configuration) + * [Table of Contents](#table-of-contents) + * [Motivation](#motivation) + * [Function Detail](#function-detail) + * [Feature Interaction](#feature-interaction) + * [ConfigMap](#configmap) + * [Reference](#reference) + +Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc) + +## Motivation + +There are several plugins and actions in `kube-batch` right now; the users may want to only enable part of plugins and actions. This document is going to introduce dynamic plugins configuration, so the users can configure `kube-batch` according to their +scenario on the fly. + +## Function Detail + +The following YAML format will be introduced for dynamic plugin configuration: + +```yaml +actions: "list_of_action_in_order" +tiers: +- plugins: + - name: "plugin_1" + disableJobOrder: true + - name: "plugin_2" +- plugins: + - name: "plugin_3" + disableJobOrder: true +``` + +The `actions` is a list of actions that will be executed by `kube-batch` in order, separated +by commas. Refer to the [tutorial](https://github.com/kubernetes-sigs/kube-batch/issues/434) for +the list of supported actions in `kube-batch`. Those actions will be executed in order, although +the "order" maybe incorrect; the `kube-batch` does not enforce that. + +The `tiers` is a list of plugins that will be used by related actions, e.g. `allocate`. It includes +several tiers of plugin list by `plugins`; if it fits plugins in high priority tier, the action will not +go through the plugins in lower priority tiers. In each tier, it's considered passed if all the plugins are +fitted in `plugins.names`. + +The `options` defines the detail behaviour of each plugins, e.g. whether preemption is enabled. If not +specific, `true` is default value. For now, `preemptable`, `jobOrder`, `taskOrder` are supported. + +Takes following example as demonstration: + +1. The actions `"reclaim, allocate, backfill, preempt"` will be executed in order by `kube-batch` +1. `"priority"` has higher priority than `"gang, drf, predicates, proportion"`; a job with higher priority +will preempt other jobs, although it's already allocated "enough" resource according to `"drf"` +1. `"tiers.plugins.drf.disableTaskOrder"` is `true`, so `drf` will not impact task order phase/action + +```yaml +actions: "reclaim, allocate, backfill, preempt" +tiers: +- plugins: + - name: "priority" + - name: "gang" +- plugins: + - name: "drf" + disableTaskOrder: true + - name: "predicates" + - name: "proportion" +``` + +## Feature Interaction + +### ConfigMap + +`kube-batch` will read the plugin configuration from command line argument `--scheduler-conf`; user can +use `ConfigMap` to acesss the volume of `kube-batch` pod during deployment. + +## Reference + +* [Add preemption by Job priority](https://github.com/kubernetes-sigs/kube-batch/issues/261) +* [Support multiple tiers for Plugins](https://github.com/kubernetes-sigs/kube-batch/issues/484) diff --git a/docs/design/podgroup-status.md b/docs/design/podgroup-status.md new file mode 100644 index 0000000000..595c603d63 --- /dev/null +++ b/docs/design/podgroup-status.md @@ -0,0 +1,144 @@ +# PodGroup Status Enhancement + +@k82cn; Jan 2, 2019 + +## Table of Contents + +* [Table of Contents](#table-of-contents) +* [Motivation](#motivation) +* [Function Detail](#function-detail) +* [Feature Interaction](#feature-interaction) + * [Cluster AutoScale](#cluster-autoscale) + * [Operators/Controllers](#operatorscontrollers) +* [Reference](#reference) + +## Motivation + +In [Coscheduling v1alph1](https://github.com/kubernetes/enhancements/pull/639) design, `PodGroup`'s status +only includes counters of related pods which is not enough for `PodGroup` lifecycle management. More information +about PodGroup's status will be introduced in this design doc for lifecycle management, e.g. `PodGroupPhase`. + +## Function Detail + +To include more information for PodGroup current status/phase, the following types are introduced: + +```go +// PodGroupPhase is the phase of a pod group at the current time. +type PodGroupPhase string + +// These are the valid phase of podGroups. +const ( + // PodPending means the pod group has been accepted by the system, but scheduler can not allocate + // enough resources to it. + PodGroupPending PodGroupPhase = "Pending" + + // PodRunning means `spec.minMember` pods of PodGroups has been in running phase. + PodGroupRunning PodGroupPhase = "Running" + + // PodGroupUnknown means part of `spec.minMember` pods are running but the other part can not + // be scheduled, e.g. not enough resource; scheduler will wait for related controller to recover it. + PodGroupUnknown PodGroupPhase = "Unknown" +) + +type PodGroupConditionType string + +const ( + PodGroupUnschedulableType PodGroupConditionType = "Unschedulable" +) + +// PodGroupCondition contains details for the current state of this pod group. +type PodGroupCondition struct { + // Type is the type of the condition + Type PodGroupConditionType `json:"type,omitempty" protobuf:"bytes,1,opt,name=type"` + + // Status is the status of the condition. + Status v1.ConditionStatus `json:"status,omitempty" protobuf:"bytes,2,opt,name=status"` + + // The ID of condition transition. + TransitionID string `json:"transitionID,omitempty" protobuf:"bytes,3,opt,name=transitionID"` + + // Last time the phase transitioned from another to current phase. + // +optional + LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty" protobuf:"bytes,4,opt,name=lastTransitionTime"` + + // Unique, one-word, CamelCase reason for the phase's last transition. + // +optional + Reason string `json:"reason,omitempty" protobuf:"bytes,5,opt,name=reason"` + + // Human-readable message indicating details about last transition. + // +optional + Message string `json:"message,omitempty" protobuf:"bytes,6,opt,name=message"` +} + +const ( + // PodFailedReason is probed if pod of PodGroup failed + PodFailedReason string = "PodFailed" + + // PodDeletedReason is probed if pod of PodGroup deleted + PodDeletedReason string = "PodDeleted" + + // NotEnoughResourcesReason is probed if there're not enough resources to schedule pods + NotEnoughResourcesReason string = "NotEnoughResources" + + // NotEnoughPodsReason is probed if there're not enough tasks compared to `spec.minMember` + NotEnoughPodsReason string = "NotEnoughTasks" +) + +// PodGroupStatus represents the current state of a pod group. +type PodGroupStatus struct { + // Current phase of PodGroup. + Phase PodGroupPhase `json:"phase,omitempty" protobuf:"bytes,1,opt,name=phase"` + + // The conditions of PodGroup. + // +optional + Conditions []PodGroupCondition `json:"conditions,omitempty" protobuf:"bytes,2,opt,name=conditions"` + + // The number of actively running pods. + // +optional + Running int32 `json:"running,omitempty" protobuf:"bytes,3,opt,name=running"` + + // The number of pods which reached phase Succeeded. + // +optional + Succeeded int32 `json:"succeeded,omitempty" protobuf:"bytes,4,opt,name=succeeded"` + + // The number of pods which reached phase Failed. + // +optional + Failed int32 `json:"failed,omitempty" protobuf:"bytes,5,opt,name=failed"` +} + +``` + +According to the PodGroup's lifecycle, the following phase/state transactions are reasonable. And related +reasons will be appended to `Reason` field. + +| From | To | Reason | +|---------|---------------|---------| +| Pending | Running | When every pods of `spec.minMember` are running | +| Running | Unknown | When some pods of `spec.minMember` are restarted but can not be rescheduled | +| Unknown | Pending | When all pods (`spec.minMember`) in PodGroups are deleted | + +## Feature Interaction + +### Cluster AutoScale + +[Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) is a tool that +automatically adjusts the size of the Kubernetes cluster when one of the following conditions is true: + +* there are pods that failed to run in the cluster due to insufficient resources, +* there are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other existing nodes. + +When Cluster-Autoscaler scale-out a new node, it leverage predicates in scheduler to check whether the new node can be +scheduled. But Coscheduling is not an implementation of predicates for now; so it'll not work well together with +Cluster-Autoscaler right now. Alternative solution will be proposed later for that. + +### Operators/Controllers + +The lifecycle of `PodGroup` are managed by operators/controllers, the scheduler only probes related state for +controllers. For example, if `PodGroup` is `Unknown` for MPI job, the controller need to re-start all pods in `PodGroup`. + +## Reference + +* [Coscheduling](https://github.com/kubernetes/enhancements/pull/639) +* [Add phase/conditions into PodGroup.Status](https://github.com/kubernetes-sigs/kube-batch/issues/521) +* [Add Pod Condition and unblock cluster autoscaler](https://github.com/kubernetes-sigs/kube-batch/issues/526) + diff --git a/docs/design/preempt-action.md b/docs/design/preempt-action.md new file mode 100644 index 0000000000..28b835813f --- /dev/null +++ b/docs/design/preempt-action.md @@ -0,0 +1,53 @@ +# Preemption + +## Introduction + +In scheduler there are 4 actions such as `allocate`, `preempt`, `reclaim`, `backfill` and with the help of +plugins like `conformance`, `drf`, `gang`, `nodeorder` and more plugins. All these plugins provides +behavioural characteristics how scheduler make scheduling decisions. + +## Preempt Action + +As discussed in Introduction, preempt is one of the actions in kube-batch scheduler. Preempt action comes into play +when a high priority task comes and there is no resource requested by that task is available in the cluster, +then few of the tasks should be evicted so that new task will get resource to run. + +In preempt action, multiple plugin function are getting used like + +1. TaskOrderFn(Plugin: Priority), +2. JobOrderFn(Plugin: Priority, DRF, Gang), +3. NodeOrderFn(Plugin: NodeOrder), +4. PredicateFn(Plugin: Predicates), +5. PreemptableFn(Plugin: Conformance, Gang, DRF). + +### 1. TaskOrderFn: +#### Priority: +Compares taskPriority set in PodSpec and returns the decision of comparison between two priorities. + +### 2. JobOrderFn: +#### Priority: +Compares jobPriority set in Spec(using PriorityClass) and returns the decision of comparison between two priorities. + +#### DRF: +The job having the lowest share will have higher priority. + +#### Gang: +The job which is not yet ready(i.e. minAvailable number of task is not yet in Bound, Binding, Running, Allocated, Succeeded, Pipelined state) will have high priority. + +### 3. NodeOrderFn: +#### NodeOrder: +NodeOrderFn returns the score of a particular node for a specific task by running through sets of priorities. + +### 4. PredicateFn: +#### Predicates: +PredicateFn returns whether a task can be bounded to a node or not by running through set of predicates. + +### 5. PreemptableFn: +Checks whether a task can be preempted or not, which returns set of tasks that can be preempted so that new task can be deployed. +#### Conformance: +In conformance plugin, it checks whether a task is critical or running in kube-system namespace, so that it can be avoided while computing set of tasks that can be preempted. +#### Gang: +It checks whether by evicting a task, it affects gang scheduling in kube-batch. It checks whether by evicting particular task, +total number of tasks running for a job is going to be less than the minAvailable requirement for gang scheduling requirement. +#### DRF: +The preemptor can only preempt other tasks only if the share of the preemptor is less than the share of the preemptee after recalculating the resource allocation of the premptor and preemptee. diff --git a/docs/design/reclaim-action.md b/docs/design/reclaim-action.md new file mode 100644 index 0000000000..7fb3a669a4 --- /dev/null +++ b/docs/design/reclaim-action.md @@ -0,0 +1,62 @@ +# Reclaim + +## Introduction + +In kube-batch there are 4 actions such as allocate, preempt, reclaim, backfill and with the help of plugins like conformance, drf, gang, nodeorder and more plugins. All these plugins provides behavioural characteristics how scheduler make scheduling decisions. + +## Reclaim Action + +Reclaim is one of the actions in kube-batch scheduler. Reclaim action comes into play when +a new queue is created, and new job comes under that queue but there is no resource / less resource +in cluster because of change of deserved share for previous present queues. + +When a new queue is created, resource is divided among queues depending on its respective weight ratio. +Consider two queues is already present and entire cluster resource is used by both the queues. When third queue +is created, deserved share of previous two queues is reduced since resource should be given to third queue as well. +So jobs/tasks which is under old queues will not be evicted until, new jobs/tasks comes to new queue(Third Queue). At that point of time, +resource for third queue(i.e. New Queue) should be reclaimed(i.e. few tasks/jobs should be evicted) from previous two queues, so that new job in third queue can +be created. + +Reclaim is basically evicting tasks from other queues so that present queue can make use of it's entire deserved share for +creating tasks. + +In Reclaim Action, there are multiple plugin functions that are getting used like, + +1. TaskOrderFn(Plugin: Priority), +2. JobOrderFn(Plugin: Priority, DRF, Gang), +3. NodeOrderFn(Plugin: NodeOrder), +4. PredicateFn(Plugin: Predicates), +5. ReclaimableFn(Plugin: Conformance, Gang, Proportion). + +### 1. TaskOrderFn: +#### Priority: +Compares taskPriority set in PodSpec and returns the decision of comparison between two priorities. + +### 2. JobOrderFn: +#### Priority: +Compares jobPriority set in Spec(using PriorityClass) and returns the decision of comparison between two priorities. + +#### DRF: +The job having the lowest share will have higher priority. + +#### Gang: +The job which is not yet ready(i.e. minAvailable number of task is not yet in Bound, Binding, Running, Allocated, Succeeded, Pipelined state) will have high priority. + +### 3. NodeOrderFn: +#### NodeOrder: +NodeOrderFn returns the score of a particular node for a specific task by running through sets of priorities. + +### 4. PredicateFn: +#### Predicates: +PredicateFn returns whether a task can be bounded to a node or not by running through set of predicates. + +### 5. ReclaimableFn: +Checks whether a task can be evicted or not, which returns set of tasks that can be evicted so that new task can be created in new queue. +#### Conformance: +In conformance plugin, it checks whether a task is critical or running in kube-system namespace, so that it can be avoided while computing set of tasks that can be preempted. +#### Gang: +It checks whether by evicting a task, it affects gang scheduling in kube-batch. It checks whether by evicting particular task, +total number of tasks running for a job is going to be less than the minAvailable requirement for gang scheduling requirement. +#### Proportion: +It checks whether by evicting a task, that task's queue has allocated resource less than the deserved share. If so, that task +is added as a victim task that can be evicted so that resource can be reclaimed. \ No newline at end of file diff --git a/docs/design/reclaim-design.md b/docs/design/reclaim-design.md new file mode 100644 index 0000000000..d8a7ef87a6 --- /dev/null +++ b/docs/design/reclaim-design.md @@ -0,0 +1,31 @@ +## Execution flow for Reclaim action + +Reclaim runs in each session and the workflow of the session is explained below with the help of diagram. + +1. In every session, local copies of objects(**queues**, **queueMap**, **preemptorsMap**, **preemptorTasks**) are created. +2. Range over all Jobs + 1. If Job's Queue is not found, move on to next job + 2. If found, add queue to **queueMap** and **queues** local object. + 3. Check for Job's Pending tasks, + 1. If no pending tasks, move on to next job + 2. If Job has pending tasks, update local objects. +3. Check whether **queues** object is empty + 1. If **queues** object is not empty + 1. Pop out queue from **queues** object + 1. If queue is overused, move on to next queue from queues object + 2. If queue is not overused, check for for jobs which has pending tasks within that queue and select preemptor task +4. Range over all nodes and run predicateFn for preemptor task + 1. If predicates are not satisfied, move on to next node + 2. If all the predicates are satisfied + 1. Range over all tasks running that node but from different queue other than preemptor task's queue and find all **reclaimees** tasks + 2. Send preemptor task and set of **reclaimees** task to ReclaimableFn which has been loaded by following plugins such as conformance, gang and proportion +5. ReclaimableFn returns all possible victim tasks that can be evicted +6. If number or victim tasks is zero or resource requirement of preemptor task is greater than total resource of all victim tasks, then move on to next node +7. If resouce requirement of preemptor task is satisfied, then evict tasks from victim tasks one by one until preemptor task can be pipelined +8. Run this until **queues** object is empty + +![Execution flow graph for Reclaim](../../images/ReclaimDesign.png) + + + + \ No newline at end of file diff --git a/docs/design/task-order.md b/docs/design/task-order.md new file mode 100644 index 0000000000..83bf1724f5 --- /dev/null +++ b/docs/design/task-order.md @@ -0,0 +1,23 @@ +# Task Priority within Job + +## Introduction + +When a workload is presented to kube-batch in the form of jobs or tasks, +kube-batch prioritizes those job/tasks, so job/task with high priority is +handled first. In this doc, we will look into how Tasks within job are prioritized. + +## Implementation + +Task priority in kube-batch is done by using either of following data + +1. Task's Priority given in TaskSpec(i.e. PodSpec as defined in the YAML) +2. Task's Creation time +3. Task's UID + +![taskordering](../../doc/images/task_order.png) + +If priority plugin in kube-batch is loaded, then priority is decided using +task's priority that will be provided in TaskSpec. +Else it checks for creationTime of tasks. Depending on which task has been created first, +that task will be given high priority. If creationTime is also same, +then UID is compared and then priority is decided. \ No newline at end of file