Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachineHealthCheck documentation should clarify which use cases covers #2861

Closed
jayunit100 opened this issue Apr 3, 2020 · 8 comments · Fixed by #2875
Closed

MachineHealthCheck documentation should clarify which use cases covers #2861

jayunit100 opened this issue Apr 3, 2020 · 8 comments · Fixed by #2875
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/documentation Categorizes issue or PR as related to documentation.
Milestone

Comments

@jayunit100
Copy link
Contributor

jayunit100 commented Apr 3, 2020

Although this is a bug report, I think it also might correspond to a feature request: Adding a status field to MachineHealthChecks, which allows for easy inspection of wether the MHC is targeting a non-null set of nodes.

What steps did you take and what happened:

  • I copied the labels for a cluster to a MHC yaml
  • I added the correct cluster-name
  • I deleted the corresponding machine in a CAPV cluster

What did you expect to happen:

A log message showing what nodes my MHC was targeting, and maybe another log message saying something along the lines of "this machine exceeded its failure timeout, recreating!".

But I saw neither...

Anything else you would like to add:

I actually also saw all logs for capi-controller-manager freeze during this time and restarted it.

First time MHC user, so forgive me if I did something wrong

Heres the list of clusters, definitely the smoke-test-1... machine is out of commision.

~ » kubectl get machines | grep "smoke-test-1-"                                                                                                                                                                  130 ↵ ubuntu@ubuntu
smoke-test-1-28ztk                    vsphere://4230c2a0-32a6-3a03-13a8-ad27cc01ffef   Failed
smoke-test-1-md-0-6ddbcf577b-4mvr4    vsphere://42308e20-d8b1-c603-df06-d6f0005367d7   Running
smoke-test-1-md-0-6ddbcf577b-dbh9n    vsphere://4230d5f3-240e-ee7d-eef1-45fd3ffcee50   Running
smoke-test-1-md-0-6ddbcf577b-r5mmk    vsphere://42308c5c-8a09-15bb-277c-e772b060a266   Running
smoke-test-1-rgwj8                    vsphere://42302040-4cc1-f1ee-35b2-2fcd144b420d   Running
smoke-test-1-v2ppc                    vsphere://4230700e-e736-9663-a0c3-f02a208fb1da   Running

The health check which targetted the smoke-test-1 machines:

~ » kubectl get machineHealthCheck -o yaml                                                                                                                                                                             ubuntu@ubuntu
apiVersion: v1
items:
- apiVersion: cluster.x-k8s.io/v1alpha3
  kind: MachineHealthCheck
  metadata:
    creationTimestamp: "2020-04-03T20:28:32Z"
    generation: 1
    labels:
      cluster.x-k8s.io/cluster-name: smoke-test-1
    name: omg
    namespace: default
    ownerReferences:
    - apiVersion: cluster.x-k8s.io/v1alpha3
      kind: Cluster
      name: smoke-test-1
      uid: df4fcc06-151c-4761-816e-d326092d535a
    resourceVersion: "438406"
    selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/default/machinehealthchecks/omg
    uid: f91016f4-5ae2-429e-8d4a-e5a705ba4957
  spec:
    clusterName: smoke-test-1
    maxUnhealthy: 40%
    nodeStartupTimeout: 10m0s
    selector:
      matchLabels:
        cluster.x-k8s.io/cluster-name: smoke-test-1
        cluster.x-k8s.io/control-plane: ""
        kubeadm.controlplane.cluster.x-k8s.io/hash: "4235374148"
    unhealthyConditions:
    - status: Unknown
      timeout: 5m0s
      type: Ready
    - status: "False"
      timeout: 30s
      type: Ready
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

The machine that deleted manually, which I expected to be cleaned up and recreated :

~ » kubectl get machine smoke-test-1-28ztk -o yaml                                                                                                                                                                     ubuntu@ubuntu
apiVersion: cluster.x-k8s.io/v1alpha3
kind: Machine
metadata:
  creationTimestamp: "2020-04-03T13:50:05Z"
  finalizers:
  - machine.cluster.x-k8s.io
  generation: 3
  labels: <-- these are the labels which i think should be sufficient to target this machine
    cluster.x-k8s.io/cluster-name: smoke-test-1
    cluster.x-k8s.io/control-plane: ""
    kubeadm.controlplane.cluster.x-k8s.io/hash: "3005926658"
  name: smoke-test-1-28ztk
  namespace: default
  ownerReferences:
  - apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    blockOwnerDeletion: true
    controller: true
    kind: KubeadmControlPlane
    name: smoke-test-1
    uid: a5aaf30b-5b6d-459d-9147-723e07807269
  resourceVersion: "377948"
  selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/default/machines/smoke-test-1-28ztk
  uid: 6032954c-0d31-4d5d-b7ea-a0ce9895ff98
spec:
  bootstrap:
    configRef:
      apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
      kind: KubeadmConfig
      name: smoke-test-1-pmbts
      namespace: default
      uid: e5a358b8-ee29-44e6-9d55-5a80192e71fc
    dataSecretName: smoke-test-1-pmbts
  clusterName: smoke-test-1
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: VSphereMachine
    name: smoke-test-1-fcmhk
    namespace: default
    uid: 5cae6139-bb03-4cbb-bde3-9bf40a7a73f3
  providerID: vsphere://4230c2a0-32a6-3a03-13a8-ad27cc01ffef
  version: v1.17.3+vmware.1
status:
  addresses:
  - address: 192.168.3.52
    type: ExternalIP
  bootstrapReady: true
  failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3,
    Kind=VSphereMachine with name "smoke-test-1-fcmhk": Unable to find VM by BIOS
    UUID 4230c2a0-32a6-3a03-13a8-ad27cc01ffef. The vm was removed from infra'
  failureReason: UpdateError
  infrastructureReady: true
  lastUpdated: "2020-04-03T17:42:05Z"
  nodeRef:
    name: smoke-test-1-28ztk
    uid: 2ac76837-ecd0-4427-a110-5e3b428c3dc5
  phase: Failed
-----------------------

Environment:

  • Cluster-api version: v1alpha3
  • Kubernetes version: (use kubectl version): 1.17.3

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 3, 2020
@jayunit100 jayunit100 changed the title MachineHealthChecks not working MachineHealthChecks not working : Should we have a MachineHealthChecks.Status field? Apr 3, 2020
@ncdc
Copy link
Contributor

ncdc commented Apr 3, 2020

As I wrote in Slack, MHC only currently reconciles Machines that have an OwnerReference to a MachineSet. We will be adding control plane remediation in a future release.

@vincepri
Copy link
Member

vincepri commented Apr 5, 2020

If this isn't super clear in the docs, we should make it. I also got confused at first and assume it'd work with any Machine.

@vincepri
Copy link
Member

vincepri commented Apr 5, 2020

/kind documentation
/milestone v0.3.x

@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Apr 5, 2020
@k8s-ci-robot k8s-ci-robot added this to the v0.3.x milestone Apr 5, 2020
@enxebre
Copy link
Member

enxebre commented Apr 6, 2020

related #2836

@vincepri
Copy link
Member

vincepri commented Apr 6, 2020

/retitle MachineHealthCheck documentation should clarify which use cases covers

@k8s-ci-robot k8s-ci-robot changed the title MachineHealthChecks not working : Should we have a MachineHealthChecks.Status field? MachineHealthCheck documentation should clarify which use cases covers Apr 6, 2020
@JoelSpeed
Copy link
Contributor

@jayunit100 I just had a quick read through this and noticed a couple of things about your examples, not sure if this was mentioned in slack or not so will post here for posterity

  1. The labels you've set on the MachineHealthCheck don't match the Machine you've deleted, the kubeadm.controlplane.cluster.x-k8s.io/hash value is different. The label selector follows normal label selector principles and must match all labels specified

  2. Your example did not include a status section for the MachineHealthCheck, was the status ever updated? Do you happen to have what the status was during your smoke test? This might help provide insight into what was happening at the time. Eg, there is a field expectedMachines which correlates to the number of Machines the MHC thinks it should be monitoring

@vincepri Re docs: While this is already mentioned in the docs as part of the limitations and caveats section, I'm aware this has come up a few times so I'm thinking not many people are making it to the bottom of the page, do you think moving the limitations section higher up the page might be better?

Control Plane Machines are currently not supported and will not be remediated if they are unhealthy

@vincepri
Copy link
Member

vincepri commented Apr 6, 2020

Yeah, either a warning or informational sign at the top would be great.

@JoelSpeed
Copy link
Contributor

I've added a PR to highlight the limitation at the top of the MHC docs #2875

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/documentation Categorizes issue or PR as related to documentation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants