Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add Health Check logic to MachineHealthCheck Reconciler #2250

Merged
merged 26 commits into from
Feb 28, 2020

Conversation

JoelSpeed
Copy link
Contributor

What this PR does / why we need it:

This PR adds logic to fetch targets from MachineHealthChecks and perform the health check on them to determine whether or not they should be remediated, but it does not yet remediate them since it is not yet implemented. I will follow up with this logic in a separate PR once I have had time to work on it.

I ran this on a cluster and verified that it was indeed reacting to events from Nodes/Machines correctly and that, if a node was unhealthy, the MachineHealthChecker followed the correct paths and logged as I was expecting.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Machine health check targeting logic from #1990

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 3, 2020
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 3, 2020
@JoelSpeed
Copy link
Contributor Author

/cc @ncdc (for continuity)

@ncdc ncdc added this to the v0.3.0 milestone Feb 4, 2020
@JoelSpeed JoelSpeed force-pushed the mhc-targets branch 4 times, most recently from fe2d440 to da82841 Compare February 5, 2020 14:20
@JoelSpeed
Copy link
Contributor Author

Rebased to resolve conflicts

@vincepri
Copy link
Member

/milestone v0.3.0-rc.1

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.0, v0.3.0-rc.1 Feb 16, 2020
@vincepri
Copy link
Member

/milestone v0.3.0

bumping this from today's release

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.0-rc.1, v0.3.0 Feb 19, 2020
@vincepri
Copy link
Member

/milestone v0.3.0-rc.2

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.0, v0.3.0-rc.2 Feb 20, 2020
@vincepri
Copy link
Member

Reviewing this now

api/v1alpha3/machinehealthcheck_types.go Outdated Show resolved Hide resolved
api/v1alpha3/machinehealthcheck_webhook.go Outdated Show resolved Hide resolved
api/v1alpha3/machinehealthcheck_webhook.go Outdated Show resolved Hide resolved
controllers/machinehealthcheck_controller.go Show resolved Hide resolved
controllers/machinehealthcheck_controller.go Show resolved Hide resolved
controllers/machinehealthcheck_controller.go Outdated Show resolved Hide resolved
controllers/machinehealthcheck_targets.go Outdated Show resolved Hide resolved
controllers/machinehealthcheck_controller.go Outdated Show resolved Hide resolved
controllers/machinehealthcheck_controller.go Outdated Show resolved Hide resolved
controllers/machinehealthcheck_controller.go Show resolved Hide resolved
@vincepri
Copy link
Member

@ncdc did you want to take another look?

@ncdc
Copy link
Contributor

ncdc commented Feb 26, 2020

@vincepri yes I have a review in progress

if m.Spec.NodeStartupTimeout != nil && m.Spec.NodeStartupTimeout.Seconds() < 30 {
allErrs = append(
allErrs,
field.Invalid(field.NewPath("spec", "nodeStartupTimeout"), m.Spec.NodeStartupTimeout, "must be greater at least 30s"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close 😄 - remove "greater"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh 🤦‍♂ 😂

controllers/machinehealthcheck_controller.go Show resolved Hide resolved
controllers/machinehealthcheck_controller.go Show resolved Hide resolved
controller controller.Controller
recorder record.EventRecorder
scheme *runtime.Scheme
clusterNodeInformers *sync.Map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this needs to be a pointer

Comment on lines 306 to 325
mhcList := &clusterv1.MachineHealthCheckList{}
if err := r.Client.List(
context.Background(),
mhcList,
&client.ListOptions{Namespace: machine.Namespace},
client.MatchingFields{mhcClusterNameIndex: machine.Spec.ClusterName},
); err != nil {
r.Log.Error(err, "Unable to list MachineHealthChecks", "node", node.Name, "machine", machine.Name, "namespace", machine.Namespace)
return nil
}

var requests []reconcile.Request
for k := range mhcList.Items {
mhc := &mhcList.Items[k]
if hasMatchingLabels(mhc.Spec.Selector, machine.Labels) {
key := util.ObjectKey(mhc)
requests = append(requests, reconcile.Request{NamespacedName: key})
}
}
return requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you want to extract this to a function for reuse here & in machineToMachineHealthCheck?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this method to call machineToMachineHealthCheck once it has the machine, I think that makes sense to do

controllers/machinehealthcheck_targets.go Outdated Show resolved Hide resolved
controllers/machinehealthcheck_targets.go Outdated Show resolved Hide resolved
controllers/machinehealthcheck_targets.go Show resolved Hide resolved
controllers/machinehealthcheck_targets_test.go Outdated Show resolved Hide resolved
@vincepri vincepri modified the milestones: v0.3.0-rc.2, v0.3.0 Feb 26, 2020
Copy link
Contributor Author

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncdc Thanks for the review again, I've addressed all of your feedback and left a few comments where appropriate

if m.Spec.NodeStartupTimeout != nil && m.Spec.NodeStartupTimeout.Seconds() < 30 {
allErrs = append(
allErrs,
field.Invalid(field.NewPath("spec", "nodeStartupTimeout"), m.Spec.NodeStartupTimeout, "must be greater at least 30s"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh 🤦‍♂ 😂

Comment on lines 306 to 325
mhcList := &clusterv1.MachineHealthCheckList{}
if err := r.Client.List(
context.Background(),
mhcList,
&client.ListOptions{Namespace: machine.Namespace},
client.MatchingFields{mhcClusterNameIndex: machine.Spec.ClusterName},
); err != nil {
r.Log.Error(err, "Unable to list MachineHealthChecks", "node", node.Name, "machine", machine.Name, "namespace", machine.Namespace)
return nil
}

var requests []reconcile.Request
for k := range mhcList.Items {
mhc := &mhcList.Items[k]
if hasMatchingLabels(mhc.Spec.Selector, machine.Labels) {
key := util.ObjectKey(mhc)
requests = append(requests, reconcile.Request{NamespacedName: key})
}
}
return requests
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this method to call machineToMachineHealthCheck once it has the machine, I think that makes sense to do

controllers/machinehealthcheck_targets.go Outdated Show resolved Hide resolved
controllers/machinehealthcheck_targets.go Show resolved Hide resolved
}

// durations should all be less than 1 Hour
minDuration := time.Hour
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good idea, I then started the range from durations[1:], not sure if that's good or not from a readability perspective

// Ensure that concurrent reconciles don't clash when setting up watches

key := util.ObjectKey(cluster)
if _, ok := r.loadClusterNodeInformer(key); ok {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone back to map and RWMutex, the logic is now:

  • Check if the informer exists in the map under RLock
  • If it doesn't, attempt to acquire write Lock
  • Once Lock is acquired, double check no one else updated the map in the mean time
  • Still under lock, set up informer and add to map
    I broke this into a couple of smaller methods so the locks could be scoped nicely, let me know what you think

Comment on lines 152 to 153
// a node with only a name represents a
// not found node in the target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is subtle, + when combined with L100. Do you think it would be clearer if we added a nodeMissing field to healthCheckTarget?

recorder record.EventRecorder
scheme *runtime.Scheme
clusterNodeInformers map[client.ObjectKey]cache.Informer
clusterNodeInformersLock *sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
clusterNodeInformersLock *sync.RWMutex
clusterNodeInformersLock sync.RWMutex

@k8s-ci-robot
Copy link
Contributor

@JoelSpeed: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-capd-e2e 067f1e0 link /test pull-cluster-api-capd-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@ncdc ncdc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

r.controller = controller
r.recorder = mgr.GetEventRecorderFor("machinehealthcheck-controller")
r.scheme = mgr.GetScheme()
r.clusterNodeInformers = make(map[client.ObjectKey]cache.Informer)
r.clusterNodeInformersLock = sync.RWMutex{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (non-blocking): the zero value of a RWMutex is ready for use, so this is not strictly necessary

@ncdc
Copy link
Contributor

ncdc commented Feb 28, 2020

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, ncdc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 28, 2020
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, can be tackled later.

Thank you for the great work @JoelSpeed!
:shipit: 🎉

/lgtm

}

var requests []reconcile.Request
for k := range mhcList.Items {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we're not paginating the list could not have all the results we need, although this seems unlikely for now, I'd add at least a TODO, wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming from the cache. I'm fairly certain we only need to paginate if we're doing live reads. Right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not 100% sure, what if there are more items? we should be able to test this one easily

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up from https://kubernetes.slack.com/archives/C0EG7JC6T/p1582925668071800: the apiserver does not force paging on clients, clients must request paging by setting options.Limit > 0. And the controller-runtime informer-based caching client we get from the Manager does a full List() without pagination.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2020
@vincepri vincepri modified the milestones: v0.3.0, v0.3.0-rc.3 Feb 28, 2020
@k8s-ci-robot k8s-ci-robot merged commit c340b22 into kubernetes-sigs:master Feb 28, 2020
@JoelSpeed JoelSpeed deleted the mhc-targets branch March 2, 2020 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants