Skip to content
This repository has been archived by the owner on May 25, 2023. It is now read-only.

The design doc of PodGroup Phase in Status. #533

Merged
merged 1 commit into from
Jan 5, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions doc/design/podgroup-status.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# PodGroup Status Enhancement

@k82cn; Jan 2, 2019

## Table of Contents

* [Table of Contents](#table-of-contents)
* [Motivation](#motivation)
* [Function Detail](#function-detail)
* [Feature Interaction](#feature-interaction)
* [Cluster AutoScale](#cluster-autoscale)
* [Operators/Controllers](#operatorscontrollers)
* [Reference](#reference)

## Motivation

In [Coscheduling v1alph1](https://github.com/kubernetes/enhancements/pull/639) design, `PodGroup`'s status
only includes counters of related pods which is not enough for `PodGroup` lifecycle management. More information
about PodGroup's status will be introduced in this design doc for lifecycle management, e.g. `PodGroupPhase`.

## Function Detail

To include more information for PodGroup current status/phase, the following types are introduced:

```go
// PodGroupPhase is the phase of a pod group at the current time.
type PodGroupPhase string

// These are the valid phase of podGroups.
const (
// PodPending means the pod group has been accepted by the system, but scheduler can not allocate
// enough resources to it.
PodGroupPending PodGroupPhase = "Pending"
// PodRunning means `spec.minMember` pods of PodGroups has been in running phase.
PodGroupRunning PodGroupPhase = "Running"
// PodGroupRecovering means part of `spec.minMember` pods have exception, e.g. killed; scheduler will
// wait for related controller to recover it.
PodGroupRecovering PodGroupPhase = "Recovering"
// PodGroupUnschedulable means part of `spec.minMember` pods are running but the other part can not
// be scheduled, e.g. not enough resource; scheduler will wait for related controller to recover it.
PodGroupUnschedulable PodGroupPhase = "Unschedulable"
)

const (
// PodFailedReason is probed if pod of PodGroup failed
PodFailedReason string = "PodFailed"
// PodDeletedReason is probed if pod of PodGroup deleted
PodDeletedReason string = "PodDeleted"
// NotEnoughResourcesReason is probed if there're not enough resources to schedule pods
NotEnoughResourcesReason string = "NotEnoughResources"
// NotEnoughPodsReason is probed if there're not enough tasks compared to `spec.minMember`
NotEnoughPodsReason string = "NotEnoughTasks"
)

// PodGroupState contains details for the current state of this pod group.
type PodGroupState struct {
// Current phase of PodGroup.
Phase PodGroupPhase `json:"phase,omitempty" protobuf:"bytes,1,opt,name=phase"`

// Last time we probed to this Phase.
// +optional
LastProbeTime metav1.Time `json:"lastProbeTime,omitempty" protobuf:"bytes,2,opt,name=lastProbeTime"`
// Last time the phase transitioned from another to current phase.
// +optional
LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty" protobuf:"bytes,3,opt,name=lastTransitionTime"`
// Unique, one-word, CamelCase reason for the phase's last transition.
// +optional
Reason string `json:"reason,omitempty" protobuf:"bytes,4,opt,name=reason"`
// Human-readable message indicating details about last transition.
// +optional
Message string `json:"message,omitempty" protobuf:"bytes,5,opt,name=message"`
}

type PodGroupStatus struct {
......

// +optional
State PodGroupState `json:"state,omitempty" protobuf:"bytes,1,opt,name=state,casttype=State"`
}
```

According to the PodGroup's lifecycle, the following phase/state transactions are reasonable. And related
reasons will be appended to `Reason` field.

| From | To | Reason |
|---------------|---------------|---------|
| Pending | Running | When every pods of `spec.minMember` are running |
| Pending | Recovering | When only part of `spec.minMember` are running and the other part pod are rejected by kubelet |
| Running | Recovering | When part of `spec.minMember` have exception, e.g. kill |
| Recovering | Running | When the failed pods re-run successfully |
| Recovering | Unschedulable | When the new pod can not be scheduled |
| Unschedulable | Pending | When all pods (`spec.minMember`) in PodGroups are deleted |
| Unschedulable | Running | When all pods (`spec.minMember`) are deleted |


## Feature Interaction

### Cluster AutoScale

[Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) is a tool that
automatically adjusts the size of the Kubernetes cluster when one of the following conditions is true:

* there are pods that failed to run in the cluster due to insufficient resources,
* there are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other existing nodes.

When Cluster-Autoscaler scale-out a new node, it leverage predicates in scheduler to check whether the new node can be
scheduled. But Coscheduling is not an implementation of predicates for now; so it'll not work well together with
Cluster-Autoscaler right now. Alternative solution will be proposed later for that.

### Operators/Controllers

The lifecycle of `PodGroup` are managed by operators/controllers, the scheduler only probes related state for
controllers. For example, if `PodGroup` is `Unschedulable` for MPI job, the controller need to re-start all
pods in `PodGroup`.

## Reference

* [Coscheduling](https://github.com/kubernetes/enhancements/pull/639)
* [Add phase/conditions into PodGroup.Status](https://github.com/kubernetes-sigs/kube-batch/issues/521)
* [Add Pod Condition and unblock cluster autoscaler](https://github.com/kubernetes-sigs/kube-batch/issues/526)