✨ Add Health Check logic to MachineHealthCheck Reconciler #2250

JoelSpeed · 2020-02-03T17:44:17Z

What this PR does / why we need it:

This PR adds logic to fetch targets from MachineHealthChecks and perform the health check on them to determine whether or not they should be remediated, but it does not yet remediate them since it is not yet implemented. I will follow up with this logic in a separate PR once I have had time to work on it.

I ran this on a cluster and verified that it was indeed reacting to events from Nodes/Machines correctly and that, if a node was unhealthy, the MachineHealthChecker followed the correct paths and logged as I was expecting.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Machine health check targeting logic from #1990

JoelSpeed · 2020-02-03T17:45:47Z

/cc @ncdc (for continuity)

controllers/machinehealthcheck_controller.go

controllers/machinehealthcheck_targets.go

controllers/machinehealthcheck_controller.go

api/v1alpha3/machinehealthcheck_webhook.go

JoelSpeed · 2020-02-14T15:26:25Z

Rebased to resolve conflicts

vincepri · 2020-02-16T16:49:52Z

/milestone v0.3.0-rc.1

vincepri · 2020-02-19T21:08:43Z

/milestone v0.3.0

bumping this from today's release

vincepri · 2020-02-20T16:43:42Z

/milestone v0.3.0-rc.2

vincepri · 2020-02-21T15:44:46Z

Reviewing this now

api/v1alpha3/machinehealthcheck_types.go

api/v1alpha3/machinehealthcheck_webhook.go

controllers/machinehealthcheck_controller.go

controllers/machinehealthcheck_targets.go

controllers/machinehealthcheck_controller.go

vincepri · 2020-02-26T20:43:00Z

@ncdc did you want to take another look?

ncdc · 2020-02-26T20:44:06Z

@vincepri yes I have a review in progress

ncdc · 2020-02-24T16:46:22Z

api/v1alpha3/machinehealthcheck_webhook.go

+	if m.Spec.NodeStartupTimeout != nil && m.Spec.NodeStartupTimeout.Seconds() < 30 {
+		allErrs = append(
+			allErrs,
+			field.Invalid(field.NewPath("spec", "nodeStartupTimeout"), m.Spec.NodeStartupTimeout, "must be greater at least 30s"),


close 😄 - remove "greater"

Doh 🤦‍♂ 😂

controllers/machinehealthcheck_controller.go

ncdc · 2020-02-26T20:00:21Z

controllers/machinehealthcheck_controller.go

+	controller           controller.Controller
+	recorder             record.EventRecorder
+	scheme               *runtime.Scheme
+	clusterNodeInformers *sync.Map


I don't believe this needs to be a pointer

ncdc · 2020-02-26T20:24:40Z

controllers/machinehealthcheck_controller.go

+	mhcList := &clusterv1.MachineHealthCheckList{}
+	if err := r.Client.List(
+		context.Background(),
+		mhcList,
+		&client.ListOptions{Namespace: machine.Namespace},
+		client.MatchingFields{mhcClusterNameIndex: machine.Spec.ClusterName},
+	); err != nil {
+		r.Log.Error(err, "Unable to list MachineHealthChecks", "node", node.Name, "machine", machine.Name, "namespace", machine.Namespace)
+		return nil
+	}
+
+	var requests []reconcile.Request
+	for k := range mhcList.Items {
+		mhc := &mhcList.Items[k]
+		if hasMatchingLabels(mhc.Spec.Selector, machine.Labels) {
+			key := util.ObjectKey(mhc)
+			requests = append(requests, reconcile.Request{NamespacedName: key})
+		}
+	}
+	return requests


Would you want to extract this to a function for reuse here & in machineToMachineHealthCheck?

I've changed this method to call machineToMachineHealthCheck once it has the machine, I think that makes sense to do

controllers/machinehealthcheck_targets.go

controllers/machinehealthcheck_targets_test.go

JoelSpeed

@ncdc Thanks for the review again, I've addressed all of your feedback and left a few comments where appropriate

JoelSpeed · 2020-02-27T10:41:54Z

api/v1alpha3/machinehealthcheck_webhook.go

+	if m.Spec.NodeStartupTimeout != nil && m.Spec.NodeStartupTimeout.Seconds() < 30 {
+		allErrs = append(
+			allErrs,
+			field.Invalid(field.NewPath("spec", "nodeStartupTimeout"), m.Spec.NodeStartupTimeout, "must be greater at least 30s"),


Doh 🤦‍♂ 😂

JoelSpeed · 2020-02-27T10:50:26Z

controllers/machinehealthcheck_controller.go

+	mhcList := &clusterv1.MachineHealthCheckList{}
+	if err := r.Client.List(
+		context.Background(),
+		mhcList,
+		&client.ListOptions{Namespace: machine.Namespace},
+		client.MatchingFields{mhcClusterNameIndex: machine.Spec.ClusterName},
+	); err != nil {
+		r.Log.Error(err, "Unable to list MachineHealthChecks", "node", node.Name, "machine", machine.Name, "namespace", machine.Namespace)
+		return nil
+	}
+
+	var requests []reconcile.Request
+	for k := range mhcList.Items {
+		mhc := &mhcList.Items[k]
+		if hasMatchingLabels(mhc.Spec.Selector, machine.Labels) {
+			key := util.ObjectKey(mhc)
+			requests = append(requests, reconcile.Request{NamespacedName: key})
+		}
+	}
+	return requests


I've changed this method to call machineToMachineHealthCheck once it has the machine, I think that makes sense to do

controllers/machinehealthcheck_targets.go

JoelSpeed · 2020-02-27T11:26:47Z

controllers/machinehealthcheck_targets.go

+	}
+
+	// durations should all be less than 1 Hour
+	minDuration := time.Hour


Yeah that's a good idea, I then started the range from durations[1:], not sure if that's good or not from a readability perspective

controllers/machinehealthcheck_targets_test.go

JoelSpeed · 2020-02-27T11:54:33Z

controllers/machinehealthcheck_controller.go

+	// Ensure that concurrent reconciles don't clash when setting up watches
+
+	key := util.ObjectKey(cluster)
+	if _, ok := r.loadClusterNodeInformer(key); ok {


I've gone back to map and RWMutex, the logic is now:

Check if the informer exists in the map under RLock

If it doesn't, attempt to acquire write Lock

Once Lock is acquired, double check no one else updated the map in the mean time

Still under lock, set up informer and add to map
I broke this into a couple of smaller methods so the locks could be scoped nicely, let me know what you think

ncdc · 2020-02-27T15:02:21Z

controllers/machinehealthcheck_targets.go

+			// a node with only a name represents a
+			// not found node in the target


This is subtle, + when combined with L100. Do you think it would be clearer if we added a nodeMissing field to healthCheckTarget?

ncdc · 2020-02-27T15:02:47Z

controllers/machinehealthcheck_controller.go

+	recorder                 record.EventRecorder
+	scheme                   *runtime.Scheme
+	clusterNodeInformers     map[client.ObjectKey]cache.Informer
+	clusterNodeInformersLock *sync.RWMutex


Suggested change

clusterNodeInformersLock *sync.RWMutex

clusterNodeInformersLock sync.RWMutex

controllers/machinehealthcheck_targets_test.go

k8s-ci-robot · 2020-02-27T15:51:37Z

@JoelSpeed: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-cluster-api-capd-e2e	`067f1e0`	link	`/test pull-cluster-api-capd-e2e`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ncdc

LGTM

ncdc · 2020-02-28T16:22:52Z

controllers/machinehealthcheck_controller.go

 	r.controller = controller
 	r.recorder = mgr.GetEventRecorderFor("machinehealthcheck-controller")
+	r.scheme = mgr.GetScheme()
+	r.clusterNodeInformers = make(map[client.ObjectKey]cache.Informer)
+	r.clusterNodeInformersLock = sync.RWMutex{}


nit (non-blocking): the zero value of a RWMutex is ready for use, so this is not strictly necessary

ncdc · 2020-02-28T16:24:12Z

/approve

k8s-ci-robot · 2020-02-28T16:24:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, ncdc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ncdc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vincepri

One minor comment, can be tackled later.

Thank you for the great work @JoelSpeed!
🎉

/lgtm

vincepri · 2020-02-28T19:53:17Z

controllers/machinehealthcheck_controller.go

+	}
+
+	var requests []reconcile.Request
+	for k := range mhcList.Items {


given that we're not paginating the list could not have all the results we need, although this seems unlikely for now, I'd add at least a TODO, wdyt?

This is coming from the cache. I'm fairly certain we only need to paginate if we're doing live reads. Right?

not 100% sure, what if there are more items? we should be able to test this one easily

Follow up from https://kubernetes.slack.com/archives/C0EG7JC6T/p1582925668071800: the apiserver does not force paging on clients, clients must request paging by setting options.Limit > 0. And the controller-runtime informer-based caching client we get from the Manager does a full List() without pagination.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 3, 2020

k8s-ci-robot requested review from detiber and justinsb February 3, 2020 17:44

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 3, 2020

enxebre reviewed Feb 4, 2020

View reviewed changes

controllers/machinehealthcheck_controller.go Outdated Show resolved Hide resolved

enxebre reviewed Feb 4, 2020

View reviewed changes

controllers/machinehealthcheck_targets.go Show resolved Hide resolved

enxebre reviewed Feb 4, 2020

View reviewed changes

controllers/machinehealthcheck_targets.go Outdated Show resolved Hide resolved

enxebre reviewed Feb 4, 2020

View reviewed changes

controllers/machinehealthcheck_targets.go Outdated Show resolved Hide resolved

ncdc added this to the v0.3.0 milestone Feb 4, 2020

JoelSpeed force-pushed the mhc-targets branch 4 times, most recently from fe2d440 to da82841 Compare February 5, 2020 14:20

enxebre reviewed Feb 6, 2020

View reviewed changes

controllers/machinehealthcheck_controller.go Outdated Show resolved Hide resolved

detiber reviewed Feb 6, 2020

View reviewed changes

api/v1alpha3/machinehealthcheck_webhook.go Outdated Show resolved Hide resolved

JoelSpeed force-pushed the mhc-targets branch from 2ff0a55 to e8dbb6a Compare February 7, 2020 11:03

JoelSpeed force-pushed the mhc-targets branch from e8dbb6a to 223f30c Compare February 14, 2020 15:25

k8s-ci-robot modified the milestones: v0.3.0, v0.3.0-rc.1 Feb 16, 2020

k8s-ci-robot modified the milestones: v0.3.0-rc.1, v0.3.0 Feb 19, 2020

k8s-ci-robot modified the milestones: v0.3.0, v0.3.0-rc.2 Feb 20, 2020

vincepri reviewed Feb 21, 2020

View reviewed changes

Use sync.Map for clusterNodeInformers

fbab1a2

JoelSpeed force-pushed the mhc-targets branch from 987bf92 to 09f4ca9 Compare February 26, 2020 15:20

Allow empty status fields

2ccd5c2

JoelSpeed force-pushed the mhc-targets branch from 09f4ca9 to 2ccd5c2 Compare February 26, 2020 15:22

Make hasMatchingLabels generic

f4409bf

JoelSpeed force-pushed the mhc-targets branch from ecf8450 to f4409bf Compare February 26, 2020 15:45

ncdc reviewed Feb 26, 2020

View reviewed changes

vincepri modified the milestones: v0.3.0-rc.2, v0.3.0 Feb 26, 2020

JoelSpeed added 2 commits February 27, 2020 11:41

Feedback kubernetes-sigs#3

451c720

Go back to map for informers

81f6c32

JoelSpeed commented Feb 27, 2020

View reviewed changes

ncdc reviewed Feb 27, 2020

View reviewed changes

JoelSpeed added 2 commits February 27, 2020 15:08

Don't use pointer for mutex

973f205

Add nodeMissing field to HealthCheckTarget

067f1e0

ncdc reviewed Feb 28, 2020

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 28, 2020

vincepri approved these changes Feb 28, 2020

View reviewed changes

k8s-ci-robot assigned vincepri Feb 28, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2020

vincepri modified the milestones: v0.3.0, v0.3.0-rc.3 Feb 28, 2020

k8s-ci-robot merged commit c340b22 into kubernetes-sigs:master Feb 28, 2020

JoelSpeed deleted the mhc-targets branch March 2, 2020 09:46

JoelSpeed mentioned this pull request Apr 13, 2021

✨ Allow users to disable MHC NodeStartupTimeout #4471

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add Health Check logic to MachineHealthCheck Reconciler #2250

✨ Add Health Check logic to MachineHealthCheck Reconciler #2250

JoelSpeed commented Feb 3, 2020

JoelSpeed commented Feb 3, 2020

JoelSpeed commented Feb 14, 2020

vincepri commented Feb 16, 2020

vincepri commented Feb 19, 2020

vincepri commented Feb 20, 2020

vincepri commented Feb 21, 2020

vincepri commented Feb 26, 2020

ncdc commented Feb 26, 2020

ncdc Feb 24, 2020

JoelSpeed Feb 27, 2020

ncdc Feb 26, 2020

ncdc Feb 26, 2020

JoelSpeed Feb 27, 2020

JoelSpeed left a comment •

edited

Loading

JoelSpeed Feb 27, 2020

JoelSpeed Feb 27, 2020

JoelSpeed Feb 27, 2020

JoelSpeed Feb 27, 2020

ncdc Feb 27, 2020

ncdc Feb 27, 2020

k8s-ci-robot commented Feb 27, 2020

ncdc left a comment

ncdc Feb 28, 2020

ncdc commented Feb 28, 2020

k8s-ci-robot commented Feb 28, 2020

vincepri left a comment

vincepri Feb 28, 2020

ncdc Feb 28, 2020

vincepri Feb 28, 2020

ncdc Feb 28, 2020

		// a node with only a name represents a
		// not found node in the target

	clusterNodeInformersLock *sync.RWMutex
	clusterNodeInformersLock sync.RWMutex

✨ Add Health Check logic to MachineHealthCheck Reconciler #2250

✨ Add Health Check logic to MachineHealthCheck Reconciler #2250

Conversation

JoelSpeed commented Feb 3, 2020

JoelSpeed commented Feb 3, 2020

JoelSpeed commented Feb 14, 2020

vincepri commented Feb 16, 2020

vincepri commented Feb 19, 2020

vincepri commented Feb 20, 2020

vincepri commented Feb 21, 2020

vincepri commented Feb 26, 2020

ncdc commented Feb 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoelSpeed left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 27, 2020

ncdc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncdc commented Feb 28, 2020

k8s-ci-robot commented Feb 28, 2020

vincepri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoelSpeed left a comment •

edited

Loading