Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design #350

Merged
merged 1 commit into from
Jul 18, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions docs/design/delay-pod-creation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Delay Pod Creation

@k82cn; Jan 7, 2019

## Table of Contents

* [Delay Pod Creation](#delay-pod-creation)
* [Table of Contents](#table-of-contents)
* [Motivation](#motivation)
* [Function Detail](#function-detail)
* [State](#state)
* [Action](#action)
* [Admission Webhook](#admission-webhook)
* [Feature interaction](#feature-interaction)
* [Queue](#queue)
* [Quota](#quota)
* [Operator/Controller](#operatorcontroller)
* [Others](#others)
* [Compatibility](#compatibility)
* [Roadmap](#roadmap)
* [Reference](#reference)

Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)

## Motivation

For a batch system, there're always several pending jobs because of limited resources and throughput.
Different with other kubernetes type, e.g. Deployment, DaemonSet, it's better to delay pods creation for
batch workload to reduce apiserver pressure and speed up scheduling (e.g. less pending pods to consider).
In this document, several enhancements are introduced to delay pod creation.

## Function Detail

### State

A new state, named `InQueue`, will be introduced to denote the phase that jobs are ready to be allocated.
After `InQueue`, the state transform map is updated as follow.

| From | To | Reason |
|---------------|----------------|---------|
| Pending | InQueue | When it's ready to allocate resource to job |
| InQueue | Pending | When there's not enough resources anymore |
| InQueue | Running | When every pods of `spec.minMember` are running |

The `InQueue` is a new state between `Pending` and `Running`; and it'll let operators/controllers start to
create pods. If it meets errors, e.g. unschedulable, it rollbacks to `Pending` instead of `InQueue` to
avoid retry-loop.

### Action

Currently, `kube-batch` supports several actions, e.g. `allocate`, `preempt`; but all those actions are executed
based on pending pods. To support `InQueue` state, a new action, named `enqueue`, will be introduced.

By default, `enqueue` action will handle `PodGroup`s in FCFS policy; `enqueue` will go through all PodGroup
(by creation timestamp) and update PodGroup's phase to `InQueue` if:

* there're enough idle resources for `spec.minResources` of `PodGroup`
* there're enough quota for `spec.minResources` of `PodGroup`

As `kube-batch` handling `PodGroup` by `spec.minResources`, the operator/controller may create more `Pod`s than
`spec.minResources`; in such case, `preempt` action will be enhanced to evict overused `PodGroup` to release
resources.

### Admission Webhook

To guarantee the transaction of `spec.minResources`, a new `MutatingAdmissionWebhook`, named `PodGroupMinResources`,
is introduced. `PodGroupMinResources` make sure

* the summary of all PodGroups' `spec.minResources` in a namespace not more than `Quota`
* if resources are reserved by `spec.minResources`, the resources can not be used by others

Generally, it's better to let total `Quota` to be more than available resources in cluster, as some pods maybe
unschedulable because of scheduler's algorithm, e.g. predicates.

## Feature interaction

### Queue

The resources will be shared between `Queue`s algorithm, e.g. proportion by default. If the resources can not be
fully used because of fragment, `backfill` action will help on that. If `Queue` used more resources than its
deserved, `reclaim` action will help to balance resources. The Pod can not be evicted currently if eviction will
break `spec.minMember`; it'll be enhanced for job level eviction.

### Quota

To delay pod creation, both `kube-batch` and `PodGroupMinResources` will watch `ResourceQuota` to decide which
`PodGroup` should be in queue firstly. The decision maybe invalid because of race condition, e.g. other
controllers create Pods. In such case, `PodGroupMinResources` will reject `PodGroup` creation and keep `InQueue`
state until `kube-batch` transform it back to `Pending`. To avoid race condition, it's better to let `kube-batch`
manage `Pod` number and resources (e.g. CPU, memory) instead of `Quota`.

### Operator/Controller

The Operator/Controller should follow the above "protocol" to work together with scheduler. A new component,
named `PodGroupController`, will be introduced later to enforce this protocol if necessary.

## Others

### Compatibility

To support this new feature, a new state and a new action are introduced; so when the new `enqueue` action is
disabled in the configuration, it'll keep the same behaviour as before.

## Roadmap

* `InQueue` phase and `enqueue` action (v0.5+)
* Admission Controller (v0.6+)

## Reference

* [Coscheduling](https://github.com/kubernetes/enhancements/pull/639)
* [Delay Pod creation](https://github.com/kubernetes-sigs/kube-batch/issues/539)
* [PodGroup Status](https://github.com/kubernetes-sigs/kube-batch/blob/master/doc/design/podgroup-status.md)
* [Support 'spec.TotalResources' in PodGroup](https://github.com/kubernetes-sigs/kube-batch/issues/401)
* [Dynamic Admission Control](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#write-an-admission-webhook-server)
* [Add support for podGroup number limits for one queue](https://github.com/kubernetes-sigs/kube-batch/issues/452)
33 changes: 33 additions & 0 deletions docs/design/drf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
## Dominant Resource Fairness (DRF)

## Introduction
Dominant Resource Fairness (DRF), a generalization of max-min fairness to multiple resource types is a resource allocation policy that handles multiple resource types.

Dominant resource - a resource of specific type (cpu, memory, gpu) which is most demanded by given job among other resources it needs. This resource is identified as a share of the total cluster resources of the same type.

DRF computes the share of dominant resource allocated to a job (dominant share) and tries to maximize the smallest dominant share in the system.
Schedule the task of the job with smallest dominant resource


## Kube-Batch Implementation
DRF calculate shares for each job. The share is the highest value of ratio of the (allocated resource/Total Resource) of the three resource types CPU, Memory and GPU.
This share value is used for job ordering and task premption.

#### 1. Job Ordering:
The job having the lowest share will have higher priority.
In the example below all the tasks task1, task2 of job1 and task3 and task4 of job2 is already allocated to the cluster.
![drfjobordering](./images/drfjobordering.png)


##### 1.1 Gang Scheduling with DRF in job ordering ( Gang -> DRF)
Gang scheduling sorts the job based on whether the job has atleast **minAvailable** task already (allocated + successfully completed + pipelined) or not.
Jobs which has not met the minAvailable criteria has higher priority than jobs which has met
the minAvailable criteria.

For the jobs which has met the minAvailable criteria will be sorted according to DRF.

![gangwithdrf](./images/gangwithdrf.png)

#### 2. Task Preemption:

The preemptor can only preempt other tasks only if the share of the preemptor is less than the share of the preemptee after recalculating the resource allocation of the premptor and preemptee.
48 changes: 48 additions & 0 deletions docs/design/execution-flow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
## Execution of the Scheduler for Allocating the Workloads to Node

The Allocation of the workloads to the node in scheduler happens in each session and the workflow of the session is illustrated in the below diagram.

1. Session Opens every 1 sec
2. In very session local copies of Queues, JobsMap, PendingTasks and Node List is created.
3. For Each Jobs in the Session
1. If the Queued ID in the Job exists in the Local Copy of Queues then :
1. Add the Queue in the Local Copy of Queues
1. If the QueueID exists in local copyof JobsMap:
1. Push the Job in the JobsMap
2. If Not then Add the QueueID as the key in Local JobsMap and add the job in the Map.
2. If Not then Give warning and continue to Step 3
4. For Each Item in Local Queues
1. Pop an Queue from the Queues
1. Check if Queue is overused
1. If Yes then Continue to Step 4
2. If Not then get the list of JobsList from the JobsMap for the particular Queue.
1. If List is empty then continue to Step 4
2. If Yes then Pop a Job from the JobsList
1. If Job exits the Local PendingTasks
1. If Not then :
1. Create a Local Task List
1. Get the List of Each Tasks in the pending state for that job
1. If the required resource for the job is Empty then go back to previous step
2. If Not then Add the tasks in the Local TasksList
2. Add the local Tasks List to PendingTasks for that Job
2. If Yes then :
1. For each Tasks from the pendingTasksList for the Job.
1. Pop the tasks
2. Get the list of predicate nodes for the task.
3. Score the predicate nodes and sort it.
4. For each node in the sorted predicated + scored nodes
1. If the Resource required by task is less than Idle resource of Nodes
1. If Resource required by task is less than releasing resource of Node
1. Then Add the tasks in Pipeline
3. Check if Job is ready to be allocated
1. If yes the push the Job
2. If No then add the Queue back to the list.
3. Continue till all the Job is ready
2. Continue till each Queue is processed.






![Execution flow graph](../../images/AllocateDesign.png)
File renamed without changes.
Binary file added docs/design/images/drfjobordering.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/design/images/gangwithdrf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 39 additions & 0 deletions docs/design/metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
## Scheduler Monitoring

## Introduction
Currently users can leverage controller logs and job events to monitor scheduler. While useful for debugging, none of this options is particularly practical for monitoring kube-batch behaviour over time. There's also requirement like to monitor kube-batch in one view to resolve critical performance issue in time [#427](https://github.com/kubernetes-sigs/kube-batch/issues/427).

This document describes metrics we want to add into kube-batch to better monitor performance.

## Metrics
In order to support metrics, kube-batch needs to expose a metrics endpoint which can provide golang process metrics like number of goroutines, gc duration, cpu and memory usage, etc as well as kube-batch custom metrics related to time taken by plugins or actions.

All the metrics are prefixed with `kube_batch_`.

### kube-batch execution
This metrics track execution of plugins and actions of kube-batch loop.

| Metric name | Metric type | Labels | Description |
| ----------- | ----------- | ------ | ----------- |
| e2e_scheduling_latency | histogram | | E2e scheduling latency in seconds |
| plugin_latency | histogram | `plugin`=<plugin_name> | Schedule latency for plugin |
| action_latency | histogram | `action`=<action_name> | Schedule latency for action |
| task_latency | histogram | `job`=<job_id> `task`=<task_id> | Schedule latency for each task |


### kube-batch operations
This metrics describe internal state of kube-batch.

| Metric name | Metric type | Labels | Description |
| ----------- | ----------- | ------ | ----------- |
| pod_schedule_errors | Counter | | The number of kube-batch failed due to an error |
| pod_schedule_successes | Counter | | The number of kube-batch success in scheduling a job |
| pod_preemption_victims | Counter | | Number of selected preemption victims |
| total_preemption_attempts | Counter | | Total preemption attempts in the cluster till now |
| unschedule_task_count | Counter | `job`=<job_id> | The number of tasks failed to schedule |
| unschedule_job_counts | Counter | | The number of job failed to schedule in each iteration |
| job_retry_counts | Counter | `job`=<job_id> | The number of retry times of one job |


### kube-batch Liveness
Healthcheck last time of kube-batch activity and timeout
18 changes: 18 additions & 0 deletions docs/design/node-priority.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
## Node Priority in Kube-Batch

This feature allows `kube-batch` to schedule workloads based on the priority of the Nodes, Workloads will be scheduled on Nodes with higher priority and these priorities will be calculated based on different parameters like `ImageLocality`, `Most/Least Requested Nodes`...etc.
A basic flow for the Node priority functions is depicted below.

![Node Priority Flow](../images/Node-Priority.png)

Currently in kube-batch `Session` is opened every 1 sec and the workloads which are there in Queue goes through `Predicate` to find a suitable set of Nodes where workloads can be scheduled and after that it goes through `Allocate` function to assign the Pods to the Nodes and then goes to `Preempt` if applicable.

Node Priority can be introduced in the current flow for `Allocate` and `Preempt` function. Once we have set of Nodes where we can scheduled the workloads then flow will go through `Prioritize` function which will do the following things :

- Run all the priority functions on all the list Nodes which is given by `Predicate` function in a parallel go-routine.
- Score the Node based on whether the `Priority Rule` satisfies the Workload scheduling criteria.
- Once the scores are returned from all the `PriorityFn` then aggregate the scoring and identify the Node with highest scoring.
- Delegate this selected Node in last step to `AllocateFn` to Bind the workload to the Node.

Currently there are multiple `PriorityFn` available with default Scheduler of Kubernetes. Going forward with each release we will implement all the priority functions in kube-batch based on their importance to batch scheduling.

79 changes: 79 additions & 0 deletions docs/design/plugin-conf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Dynamic Plugins Configuration

## Table of Contents

* [Dynamic Plugins Configuration](#dynamic-plugins-configuration)
* [Table of Contents](#table-of-contents)
* [Motivation](#motivation)
* [Function Detail](#function-detail)
* [Feature Interaction](#feature-interaction)
* [ConfigMap](#configmap)
* [Reference](#reference)

Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)

## Motivation

There are several plugins and actions in `kube-batch` right now; the users may want to only enable part of plugins and actions. This document is going to introduce dynamic plugins configuration, so the users can configure `kube-batch` according to their
scenario on the fly.

## Function Detail

The following YAML format will be introduced for dynamic plugin configuration:

```yaml
actions: "list_of_action_in_order"
tiers:
- plugins:
- name: "plugin_1"
disableJobOrder: true
- name: "plugin_2"
- plugins:
- name: "plugin_3"
disableJobOrder: true
```

The `actions` is a list of actions that will be executed by `kube-batch` in order, separated
by commas. Refer to the [tutorial](https://github.com/kubernetes-sigs/kube-batch/issues/434) for
the list of supported actions in `kube-batch`. Those actions will be executed in order, although
the "order" maybe incorrect; the `kube-batch` does not enforce that.

The `tiers` is a list of plugins that will be used by related actions, e.g. `allocate`. It includes
several tiers of plugin list by `plugins`; if it fits plugins in high priority tier, the action will not
go through the plugins in lower priority tiers. In each tier, it's considered passed if all the plugins are
fitted in `plugins.names`.

The `options` defines the detail behaviour of each plugins, e.g. whether preemption is enabled. If not
specific, `true` is default value. For now, `preemptable`, `jobOrder`, `taskOrder` are supported.

Takes following example as demonstration:

1. The actions `"reclaim, allocate, backfill, preempt"` will be executed in order by `kube-batch`
1. `"priority"` has higher priority than `"gang, drf, predicates, proportion"`; a job with higher priority
will preempt other jobs, although it's already allocated "enough" resource according to `"drf"`
1. `"tiers.plugins.drf.disableTaskOrder"` is `true`, so `drf` will not impact task order phase/action

```yaml
actions: "reclaim, allocate, backfill, preempt"
tiers:
- plugins:
- name: "priority"
- name: "gang"
- plugins:
- name: "drf"
disableTaskOrder: true
- name: "predicates"
- name: "proportion"
```

## Feature Interaction

### ConfigMap

`kube-batch` will read the plugin configuration from command line argument `--scheduler-conf`; user can
use `ConfigMap` to acesss the volume of `kube-batch` pod during deployment.

## Reference

* [Add preemption by Job priority](https://github.com/kubernetes-sigs/kube-batch/issues/261)
* [Support multiple tiers for Plugins](https://github.com/kubernetes-sigs/kube-batch/issues/484)
Loading