MachineHealthCheck documentation should clarify which use cases covers #2861

jayunit100 · 2020-04-03T20:49:07Z

Although this is a bug report, I think it also might correspond to a feature request: Adding a status field to MachineHealthChecks, which allows for easy inspection of wether the MHC is targeting a non-null set of nodes.

What steps did you take and what happened:

I copied the labels for a cluster to a MHC yaml
I added the correct cluster-name
I deleted the corresponding machine in a CAPV cluster

What did you expect to happen:

A log message showing what nodes my MHC was targeting, and maybe another log message saying something along the lines of "this machine exceeded its failure timeout, recreating!".

But I saw neither...

Anything else you would like to add:

I actually also saw all logs for capi-controller-manager freeze during this time and restarted it.

First time MHC user, so forgive me if I did something wrong

Heres the list of clusters, definitely the smoke-test-1... machine is out of commision.

~ » kubectl get machines | grep "smoke-test-1-"                                                                                                                                                                  130 ↵ ubuntu@ubuntu
smoke-test-1-28ztk                    vsphere://4230c2a0-32a6-3a03-13a8-ad27cc01ffef   Failed
smoke-test-1-md-0-6ddbcf577b-4mvr4    vsphere://42308e20-d8b1-c603-df06-d6f0005367d7   Running
smoke-test-1-md-0-6ddbcf577b-dbh9n    vsphere://4230d5f3-240e-ee7d-eef1-45fd3ffcee50   Running
smoke-test-1-md-0-6ddbcf577b-r5mmk    vsphere://42308c5c-8a09-15bb-277c-e772b060a266   Running
smoke-test-1-rgwj8                    vsphere://42302040-4cc1-f1ee-35b2-2fcd144b420d   Running
smoke-test-1-v2ppc                    vsphere://4230700e-e736-9663-a0c3-f02a208fb1da   Running

The health check which targetted the smoke-test-1 machines:

~ » kubectl get machineHealthCheck -o yaml                                                                                                                                                                             ubuntu@ubuntu
apiVersion: v1
items:
- apiVersion: cluster.x-k8s.io/v1alpha3
  kind: MachineHealthCheck
  metadata:
    creationTimestamp: "2020-04-03T20:28:32Z"
    generation: 1
    labels:
      cluster.x-k8s.io/cluster-name: smoke-test-1
    name: omg
    namespace: default
    ownerReferences:
    - apiVersion: cluster.x-k8s.io/v1alpha3
      kind: Cluster
      name: smoke-test-1
      uid: df4fcc06-151c-4761-816e-d326092d535a
    resourceVersion: "438406"
    selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/default/machinehealthchecks/omg
    uid: f91016f4-5ae2-429e-8d4a-e5a705ba4957
  spec:
    clusterName: smoke-test-1
    maxUnhealthy: 40%
    nodeStartupTimeout: 10m0s
    selector:
      matchLabels:
        cluster.x-k8s.io/cluster-name: smoke-test-1
        cluster.x-k8s.io/control-plane: ""
        kubeadm.controlplane.cluster.x-k8s.io/hash: "4235374148"
    unhealthyConditions:
    - status: Unknown
      timeout: 5m0s
      type: Ready
    - status: "False"
      timeout: 30s
      type: Ready
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

The machine that deleted manually, which I expected to be cleaned up and recreated :

~ » kubectl get machine smoke-test-1-28ztk -o yaml                                                                                                                                                                     ubuntu@ubuntu
apiVersion: cluster.x-k8s.io/v1alpha3
kind: Machine
metadata:
  creationTimestamp: "2020-04-03T13:50:05Z"
  finalizers:
  - machine.cluster.x-k8s.io
  generation: 3
  labels: <-- these are the labels which i think should be sufficient to target this machine
    cluster.x-k8s.io/cluster-name: smoke-test-1
    cluster.x-k8s.io/control-plane: ""
    kubeadm.controlplane.cluster.x-k8s.io/hash: "3005926658"
  name: smoke-test-1-28ztk
  namespace: default
  ownerReferences:
  - apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    blockOwnerDeletion: true
    controller: true
    kind: KubeadmControlPlane
    name: smoke-test-1
    uid: a5aaf30b-5b6d-459d-9147-723e07807269
  resourceVersion: "377948"
  selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/default/machines/smoke-test-1-28ztk
  uid: 6032954c-0d31-4d5d-b7ea-a0ce9895ff98
spec:
  bootstrap:
    configRef:
      apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
      kind: KubeadmConfig
      name: smoke-test-1-pmbts
      namespace: default
      uid: e5a358b8-ee29-44e6-9d55-5a80192e71fc
    dataSecretName: smoke-test-1-pmbts
  clusterName: smoke-test-1
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: VSphereMachine
    name: smoke-test-1-fcmhk
    namespace: default
    uid: 5cae6139-bb03-4cbb-bde3-9bf40a7a73f3
  providerID: vsphere://4230c2a0-32a6-3a03-13a8-ad27cc01ffef
  version: v1.17.3+vmware.1
status:
  addresses:
  - address: 192.168.3.52
    type: ExternalIP
  bootstrapReady: true
  failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3,
    Kind=VSphereMachine with name "smoke-test-1-fcmhk": Unable to find VM by BIOS
    UUID 4230c2a0-32a6-3a03-13a8-ad27cc01ffef. The vm was removed from infra'
  failureReason: UpdateError
  infrastructureReady: true
  lastUpdated: "2020-04-03T17:42:05Z"
  nodeRef:
    name: smoke-test-1-28ztk
    uid: 2ac76837-ecd0-4427-a110-5e3b428c3dc5
  phase: Failed
-----------------------

Environment:

Cluster-api version: v1alpha3
Kubernetes version: (use kubectl version): 1.17.3

/kind bug

The text was updated successfully, but these errors were encountered:

ncdc · 2020-04-03T21:50:28Z

As I wrote in Slack, MHC only currently reconciles Machines that have an OwnerReference to a MachineSet. We will be adding control plane remediation in a future release.

vincepri · 2020-04-05T17:26:32Z

If this isn't super clear in the docs, we should make it. I also got confused at first and assume it'd work with any Machine.

vincepri · 2020-04-05T17:26:45Z

/kind documentation
/milestone v0.3.x

enxebre · 2020-04-06T07:33:00Z

related #2836

vincepri · 2020-04-06T14:10:16Z

/retitle MachineHealthCheck documentation should clarify which use cases covers

JoelSpeed · 2020-04-06T16:25:15Z

@jayunit100 I just had a quick read through this and noticed a couple of things about your examples, not sure if this was mentioned in slack or not so will post here for posterity

The labels you've set on the MachineHealthCheck don't match the Machine you've deleted, the kubeadm.controlplane.cluster.x-k8s.io/hash value is different. The label selector follows normal label selector principles and must match all labels specified
Your example did not include a status section for the MachineHealthCheck, was the status ever updated? Do you happen to have what the status was during your smoke test? This might help provide insight into what was happening at the time. Eg, there is a field expectedMachines which correlates to the number of Machines the MHC thinks it should be monitoring

@vincepri Re docs: While this is already mentioned in the docs as part of the limitations and caveats section, I'm aware this has come up a few times so I'm thinking not many people are making it to the bottom of the page, do you think moving the limitations section higher up the page might be better?

Control Plane Machines are currently not supported and will not be remediated if they are unhealthy

vincepri · 2020-04-06T17:07:15Z

Yeah, either a warning or informational sign at the top would be great.

JoelSpeed · 2020-04-07T14:36:43Z

I've added a PR to highlight the limitation at the top of the MHC docs #2875

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 3, 2020

jayunit100 changed the title ~~MachineHealthChecks not working~~ MachineHealthChecks not working : Should we have a MachineHealthChecks.Status field? Apr 3, 2020

jayunit100 mentioned this issue Apr 3, 2020

Add Status or Conditions metadata to MachineHealthChecks #2863

Closed

k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Apr 5, 2020

k8s-ci-robot added this to the v0.3.x milestone Apr 5, 2020

k8s-ci-robot changed the title ~~MachineHealthChecks not working : Should we have a MachineHealthChecks.Status field?~~ MachineHealthCheck documentation should clarify which use cases covers Apr 6, 2020

JoelSpeed mentioned this issue Apr 7, 2020

📖 Add note to top of MHC docs #2875

Merged

k8s-ci-robot closed this as completed in #2875 Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MachineHealthCheck documentation should clarify which use cases covers #2861

MachineHealthCheck documentation should clarify which use cases covers #2861

jayunit100 commented Apr 3, 2020 •

edited

Loading

ncdc commented Apr 3, 2020

vincepri commented Apr 5, 2020

vincepri commented Apr 5, 2020

enxebre commented Apr 6, 2020

vincepri commented Apr 6, 2020

JoelSpeed commented Apr 6, 2020

vincepri commented Apr 6, 2020 •

edited

Loading

JoelSpeed commented Apr 7, 2020

MachineHealthCheck documentation should clarify which use cases covers #2861

MachineHealthCheck documentation should clarify which use cases covers #2861

Comments

jayunit100 commented Apr 3, 2020 • edited Loading

ncdc commented Apr 3, 2020

vincepri commented Apr 5, 2020

vincepri commented Apr 5, 2020

enxebre commented Apr 6, 2020

vincepri commented Apr 6, 2020

JoelSpeed commented Apr 6, 2020

vincepri commented Apr 6, 2020 • edited Loading

JoelSpeed commented Apr 7, 2020

jayunit100 commented Apr 3, 2020 •

edited

Loading

vincepri commented Apr 6, 2020 •

edited

Loading