Skip to content

Commit

Permalink
feat: Add MachineHealthCheck example template (#175)
Browse files Browse the repository at this point in the history
  • Loading branch information
joekr authored Sep 30, 2022
1 parent 6a30ad7 commit 7b59f48
Show file tree
Hide file tree
Showing 3 changed files with 307 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
- [Provision a management cluster with OKE](./gs/mgmt/mgmt-oke.md)
- [Install Cluster API for Oracle Cloud Infrastructure](./gs/install-cluster-api.md)
- [Create Workload Cluster](./gs/create-workload-cluster.md)
- [MachineHealthChecks](./gs/create-mhc-workload-cluster.md)
- [Create GPU Workload Cluster](./gs/create-gpu-workload-cluster.md)
- [Create Workload Templates](./gs/create-workload-templates.md)
- [Using externally managed infrastructure](./gs/externally-managed-cluster-infrastructure.md)
Expand Down
116 changes: 116 additions & 0 deletions docs/src/gs/create-mhc-workload-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Create a workload cluster with MachineHealthChecks (MHC)

To better understand MachineHealthChecks please read over [the Cluster-API book][mhc]
and make sure to read the [limitations][mhc-limitations] sections.

## Create a new workload cluster with MHC

In the project's code repository we provide an [example template][mhc-template] that sets up two MachineHealthChecks
at workload creation time. The example sets up two MHCs to allow differing remediation values:

- `control-plane-unhealthy-5m` setups a health check for the control plane machines
- `md-unhealthy-5m` sets up a health check for the workload machines

> NOTE: As a part of the example template the MHCs will start remediating nodes that are `not ready` after 10 minutes.
In order prevent this side effect make sure to [install your CNI][install-a-cni-provider] once the API is available.
This will move the machines into a `Ready` state.

## Add MHC to existing workload cluster

Another approach is to install MHC after the cluster is up and healthy (aka Day-2 Operation). This can prevent
machine remediation while setting up the cluster.

### Add control-plane MHC

We need to add the `controlplane.remediation` label to the `KubeadmControlPlane`.

Create a file named `control-plane-patch.yaml` that has this content:
```yaml
spec:
machineTemplate:
metadata:
labels:
controlplane.remediation: ""
```
Then run `kubectl patch KubeadmControlPlane <your-cluster-name>-control-plane --patch-file control-plane-patch.yaml --type=merge`.

Then add the new label to any existing control-plane node(s)
`kubectl label node <control-plane-name> controlplane.remediation=""`. This will prevent the `KubeadmControlPlane` provisioning
new nodes once the MHC is deployed.

Create a file named `control-plane-mhc.yaml` that has this content:
```yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
name: "<your-cluster-name>-control-plane-unhealthy-5m"
spec:
clusterName: "<your-cluster-name>"
maxUnhealthy: 100%
nodeStartupTimeout: 10m
selector:
matchLabels:
controlplane.remediation: ""
unhealthyConditions:
- type: Ready
status: Unknown
timeout: 300s
- type: Ready
status: "False"
timeout: 300s
```

Then run `kubectl apply -f control-plane-mhc.yaml`.

Then run `kubectl get machinehealthchecks` to check your MachineHealthCheck sees the expected machines.

### Add machine MHC

We need to add the `machine.remediation` label to the `MachineDeployment`.

Create a file named `machine-patch.yaml` that has this content:
```yaml
spec:
template:
metadata:
labels:
machine.remediation: ""
```

Then run `kubectl patch MachineDeployment oci-cluster-stage-md-0 --patch-file machine-patch.yaml --type=merge`.

Then add the new label to any existing control-plane node(s)
`kubectl label node <machine-name> machine.remediation=""`. This will prevent the `MachineDeployment` provisioning
new nodes once the MHC is deployed.

Create a file named `machine-mhc.yaml` that has this content:
```yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
name: "<your-cluster-name>-stage-md-unhealthy-5m"
spec:
clusterName: "oci-cluster-stage"
maxUnhealthy: 100%
nodeStartupTimeout: 10m
selector:
matchLabels:
machine.remediation: ""
unhealthyConditions:
- type: Ready
status: Unknown
timeout: 300s
- type: Ready
status: "False"
timeout: 300s
```

Then run `kubectl apply -f machine-mhc.yaml`.

Then run `kubectl get machinehealthchecks` to check your MachineHealthCheck sees the expected machines.

[install-a-cni-provider]: ../gs/create-workload-cluster.md#install-a-cni-provider
[mhc]: https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html
[mhc-limitations]: https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#limitations-and-caveats-of-a-machinehealthcheck
[mhc-template]: https://github.com/oracle/cluster-api-provider-oci/blob/main/templates/cluster-template-healcheck.yaml
190 changes: 190 additions & 0 deletions templates/cluster-template-healthcheck.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
labels:
cluster.x-k8s.io/cluster-name: "${CLUSTER_NAME}"
name: "${CLUSTER_NAME}"
namespace: "${NAMESPACE}"
spec:
clusterNetwork:
pods:
cidrBlocks:
- ${POD_CIDR:="192.168.0.0/16"}
serviceDomain: ${SERVICE_DOMAIN:="cluster.local"}
services:
cidrBlocks:
- ${SERVICE_CIDR:="10.128.0.0/12"}
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: OCICluster
name: "${CLUSTER_NAME}"
namespace: "${NAMESPACE}"
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: "${CLUSTER_NAME}-control-plane"
namespace: "${NAMESPACE}"
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: OCICluster
metadata:
labels:
cluster.x-k8s.io/cluster-name: "${CLUSTER_NAME}"
name: "${CLUSTER_NAME}"
spec:
compartmentId: "${OCI_COMPARTMENT_ID}"
---
kind: KubeadmControlPlane
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
metadata:
name: "${CLUSTER_NAME}-control-plane"
namespace: "${NAMESPACE}"
spec:
version: "${KUBERNETES_VERSION}"
replicas: ${CONTROL_PLANE_MACHINE_COUNT}
machineTemplate:
metadata:
labels:
controlplane.remediation: ""
infrastructureRef:
kind: OCIMachineTemplate
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
name: "${CLUSTER_NAME}-control-plane"
namespace: "${NAMESPACE}"
kubeadmConfigSpec:
clusterConfiguration:
kubernetesVersion: ${KUBERNETES_VERSION}
apiServer:
certSANs: [localhost, 127.0.0.1]
dns: {}
etcd: {}
networking: {}
scheduler: {}
initConfiguration:
nodeRegistration:
criSocket: /var/run/containerd/containerd.sock
kubeletExtraArgs:
cloud-provider: external
provider-id: oci://{{ ds["id"] }}
joinConfiguration:
discovery: {}
nodeRegistration:
criSocket: /var/run/containerd/containerd.sock
kubeletExtraArgs:
cloud-provider: external
provider-id: oci://{{ ds["id"] }}
---
kind: OCIMachineTemplate
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
metadata:
name: "${CLUSTER_NAME}-control-plane"
# labels:
# controlplane.remediation: ""
spec:
template:
spec:
imageId: "${OCI_IMAGE_ID}"
compartmentId: "${OCI_COMPARTMENT_ID}"
shape: "${OCI_CONTROL_PLANE_MACHINE_TYPE=VM.Standard.E4.Flex}"
shapeConfig:
ocpus: "${OCI_CONTROL_PLANE_MACHINE_TYPE_OCPUS=1}"
metadata:
ssh_authorized_keys: "${OCI_SSH_KEY}"
isPvEncryptionInTransitEnabled: ${OCI_CONTROL_PLANE_PV_TRANSIT_ENCRYPTION=true}
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: OCIMachineTemplate
metadata:
name: "${CLUSTER_NAME}-md-0"
# labels:
# machine.remediation: ""
spec:
template:
spec:
imageId: "${OCI_IMAGE_ID}"
compartmentId: "${OCI_COMPARTMENT_ID}"
shape: "${OCI_NODE_MACHINE_TYPE=VM.Standard.E4.Flex}"
shapeConfig:
ocpus: "${OCI_NODE_MACHINE_TYPE_OCPUS=1}"
metadata:
ssh_authorized_keys: "${OCI_SSH_KEY}"
isPvEncryptionInTransitEnabled: ${OCI_NODE_PV_TRANSIT_ENCRYPTION=true}
---
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha4
kind: KubeadmConfigTemplate
metadata:
name: "${CLUSTER_NAME}-md-0"
spec:
template:
spec:
joinConfiguration:
nodeRegistration:
kubeletExtraArgs:
cloud-provider: external
provider-id: oci://{{ ds["id"] }}
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: "${CLUSTER_NAME}-md-0"
# labels:
# machine.remediation: ""
spec:
clusterName: "${CLUSTER_NAME}"
replicas: ${NODE_MACHINE_COUNT}
selector:
matchLabels:
template:
metadata:
labels:
machine.remediation: ""
spec:
clusterName: "${CLUSTER_NAME}"
version: "${KUBERNETES_VERSION}"
bootstrap:
configRef:
name: "${CLUSTER_NAME}-md-0"
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
infrastructureRef:
name: "${CLUSTER_NAME}-md-0"
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: OCIMachineTemplate
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
name: "${CLUSTER_NAME}-control-plane-unhealthy-5m"
spec:
clusterName: "${CLUSTER_NAME}"
maxUnhealthy: 100%
nodeStartupTimeout: 10m
selector:
matchLabels:
controlplane.remediation: ""
unhealthyConditions:
- type: Ready
status: Unknown
timeout: 300s
- type: Ready
status: "False"
timeout: 300s
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
name: "${CLUSTER_NAME}-md-unhealthy-5m"
spec:
clusterName: "${CLUSTER_NAME}"
maxUnhealthy: 100%
nodeStartupTimeout: 10m
selector:
matchLabels:
machine.remediation: ""
unhealthyConditions:
- type: Ready
status: Unknown
timeout: 300s
- type: Ready
status: "False"
timeout: 300s

0 comments on commit 7b59f48

Please sign in to comment.