✨ Adds healthcheck for workload clusters #2295

chuckha · 2020-02-08T21:12:39Z

Signed-off-by: Chuck Ha chuckh@vmware.com

Co-authored-by: Daniel Lipovetsky dlipovetsky@d2iq.com

What this PR does / why we need it:
This PR adds, but does not use, the etcd healthchecking for scaling up workload clusters. We decided to split #2193 into smaller chunks. This is the largest chunk. The next set of commits will be integrating this work and fixing the tests.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Related to #2243 #2241

/assign @dlipovetsky @detiber @randomvariable

chuckha · 2020-02-08T21:13:31Z

controlplane/kubeadm/internal/etcd/etcd.go

@@ -156,16 +156,16 @@ func (c *Client) Members(ctx context.Context) ([]*Member, error) {
 	}

 	clusterID := response.Header.GetClusterId()
-	members := make([]*Member, len(response.Members))


this removes nils from the members list.

detiber

Copied over a few comments from #2193

controlplane/kubeadm/internal/cluster.go

chuckha · 2020-02-10T15:51:55Z

@detiber Added some tests and flipped the logic, good catch, thanks for pointing that out 👍

chuckha · 2020-02-10T15:55:38Z

controlplane/kubeadm/internal/cluster_test.go

+	}
+}
+
+type fakeClient struct {


The fake.NewClient is being deprecated and this is a more true unit test solution.

I wouldn't worry too much about the deprecation until it happens, we're also getting more involved upstream, so it remains to be seen if it's going to go away or not

chuckha · 2020-02-10T16:10:04Z

e2e test failure is a flake and unrelated as this code does not touch any existing code and the failure is not build related.

detiber · 2020-02-10T16:11:01Z

/lgtm
/hold

Adding hold in case @dlipovetsky or @randomvariable want to chime in. Feel free to remove the hold when ready to merge.

vincepri · 2020-02-10T17:59:09Z

Reviewing

dlipovetsky · 2020-02-10T19:03:53Z

Reviewing

dlipovetsky

Some just nits and a question

controlplane/kubeadm/internal/cluster.go

dlipovetsky · 2020-02-10T19:24:27Z

controlplane/kubeadm/internal/cluster.go

+
+// generateEtcdTLSClientBundle builds an etcd client TLS bundle from the Etcd CA for this cluster.
+func (c *cluster) generateEtcdTLSClientBundle() (*tls.Config, error) {
+	clientCert, err := generateClientCert(c.etcdCACert, c.etcdCAkey)


Question: Would we want to create a long-lived client cert in a future PR?

I'd be a little hesitant about over-optimization. I think until this becomes a bottle neck, we won't need a long-lived cert unless there are other reasons to create one that I'm over looking.

controlplane/kubeadm/internal/cluster_test.go

dlipovetsky · 2020-02-10T19:56:21Z

Thank you @chuckha for reorganizing #2193 into manageable pieces, and for nice abstractions!

/lgtm

controlplane/kubeadm/internal/etcd/util/set.go

vincepri · 2020-02-10T20:25:24Z

controlplane/kubeadm/internal/etcd/etcd.go

@@ -156,16 +156,16 @@ func (c *Client) Members(ctx context.Context) ([]*Member, error) {
 	}

 	clusterID := response.Header.GetClusterId()
-	members := make([]*Member, len(response.Members))
-	for i, m := range response.Members {
+	members := make([]*Member, 0)


Suggested change

members := make([]*Member, 0)

members := make([]*Member, 0, len(response.Members))

If you wanted to pre-allocate slots, but not grow the array and use append.

controlplane/kubeadm/internal/etcd/util/set.go

controlplane/kubeadm/internal/cluster.go

vincepri · 2020-02-10T20:39:05Z

controlplane/kubeadm/internal/cluster.go

+	if err != nil {
+		return err
+	}
+	errorList := []error{}


You can use var err error here and kerrors.NewAggregate like we do in

cluster-api/controllers/cluster_controller.go

Line 130 in 5d3d0d1

reterr = kerrors.NewAggregate([]error{reterr, err})

, in the for loop.

Interesting, is this a style preference?

It doesn't appear to be convention because slightly farther down in the cluster_controller we're doing this:

cluster-api/controllers/cluster_controller.go

Lines 160 to 174 in 5d3d0d1

errs := []error{}

for _, err := range reconciliationErrors {

if requeueErr, ok := errors.Cause(err).(capierrors.HasRequeueAfterError); ok {

// Only record and log the first RequeueAfterError.

if !res.Requeue {

res.Requeue = true

res.RequeueAfter = requeueErr.GetRequeueAfter()

logger.Error(err, "Reconciliation for Cluster asked to requeue")

}

continue

}

errs = append(errs, err)

}

return res, kerrors.NewAggregate(errs)

I hadn't seen it before working on the webhook conversions to CAPA, so maybe a new(ish) convention, and we should mop up elsewhere.

We can probably take care of it later

I'm not sure we should adopt that convention outside of defer, otherwise we risk overwriting or otherwise mishandling the aggregated err var.

controlplane/kubeadm/internal/cluster.go

vincepri · 2020-02-10T20:41:09Z

controlplane/kubeadm/internal/cluster.go

+		if machine.Status.NodeRef == nil {
+			return errors.Errorf("control plane machine %q has no node ref", machine.Name)
+		}
+		if _, ok := resp[machine.Status.NodeRef.Name]; !ok {


Is this just a safety check? I wouldn't expect for a reference to be there without a Name

Ah, this part could use a comment, it's not super obvious.

This is checking known machines with the nodes that were checked during the health check. This prevents out-of-(cluster-api)band etcd members from existing.

controlplane/kubeadm/internal/cluster.go

controlplane/kubeadm/internal/etcd/util/util.go

Signed-off-by: Chuck Ha <chuckh@vmware.com> Co-authored-by: Daniel Lipovetsky <dlipovetsky@d2iq.com>

chuckha · 2020-02-11T14:55:52Z

@vincepri anything left to do before lgtm?

randomvariable · 2020-02-11T15:09:21Z

controlplane/kubeadm/internal/cluster.go

+		ResourceName: etcdStaticPodName(nodeName),
+		KubeConfig:   c.restConfig,
+		TLSConfig:    tlsConfig,
+		Port:         2379, // TODO: the pod doesn't expose a port. Is this a problem?


Not that I know of, especially because of the use of host networking, we can exclude behavioural differences across CNIs as a potential problem.

randomvariable · 2020-02-11T15:12:18Z

controlplane/kubeadm/internal/cluster.go

+	// This does not support external etcd.
+	p := proxy.Proxy{
+		Kind:         "pods",
+		Namespace:    "kube-system", // TODO, can etcd ever run in a different namespace?


Hard coded as a constant in metav1.NamespaceSystem

vincepri

/approve
/hold cancel

/assign @randomvariable
for final LGTM

k8s-ci-robot · 2020-02-11T15:14:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chuckha, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [chuckha,vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

randomvariable · 2020-02-11T15:28:00Z

We'll need to review generating certificates on the fly all of the time from a performance perspective, but otherwise great as a first pass of this.

/lgtm

dlipovetsky · 2020-02-11T17:57:00Z

We'll need to review generating certificates on the fly all of the time from a performance perspective, but otherwise great as a first pass of this.

I had the same thought.

chuckha · 2020-02-11T18:56:45Z

@randomvariable @dlipovetsky we should open an issue to track performance concerns, but we'll need graphs or metrics or both before it makes sense to try and optimize things

k8s-ci-robot assigned detiber and dlipovetsky Feb 8, 2020

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 8, 2020

k8s-ci-robot requested review from detiber and ncdc February 8, 2020 21:12

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 8, 2020

chuckha assigned randomvariable Feb 8, 2020

chuckha commented Feb 8, 2020

View reviewed changes

detiber reviewed Feb 10, 2020

View reviewed changes

controlplane/kubeadm/internal/cluster.go Show resolved Hide resolved

controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved

chuckha force-pushed the hcs branch from 6b91dc0 to 370d2db Compare February 10, 2020 15:51

chuckha commented Feb 10, 2020

View reviewed changes

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 10, 2020

chuckha mentioned this pull request Feb 10, 2020

🏃 KCP cleanup #2303

Merged

dlipovetsky approved these changes Feb 10, 2020

View reviewed changes

chuckha force-pushed the hcs branch from 370d2db to 1b441e7 Compare February 10, 2020 20:02

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2020

chuckha force-pushed the hcs branch from 1b441e7 to 9cd4313 Compare February 10, 2020 20:12

vincepri reviewed Feb 10, 2020

View reviewed changes

controlplane/kubeadm/internal/etcd/util/set.go Outdated Show resolved Hide resolved

vincepri reviewed Feb 10, 2020

View reviewed changes

controlplane/kubeadm/internal/etcd/util/set.go Show resolved Hide resolved

chuckha force-pushed the hcs branch from 9cd4313 to 898c85b Compare February 10, 2020 20:22

vincepri reviewed Feb 10, 2020

View reviewed changes

controlplane/kubeadm/internal/etcd/util/set.go Outdated Show resolved Hide resolved

chuckha force-pushed the hcs branch 3 times, most recently from afe2104 to 43a1c0b Compare February 10, 2020 20:29

vincepri reviewed Feb 10, 2020

View reviewed changes

Adds healthcheck for workload clusters

c6d3f31

Signed-off-by: Chuck Ha <chuckh@vmware.com> Co-authored-by: Daniel Lipovetsky <dlipovetsky@d2iq.com>

chuckha force-pushed the hcs branch from 43a1c0b to c6d3f31 Compare February 10, 2020 22:16

randomvariable reviewed Feb 11, 2020

View reviewed changes

vincepri approved these changes Feb 11, 2020

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 11, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2020

k8s-ci-robot merged commit 4d89d56 into kubernetes-sigs:master Feb 11, 2020

dlipovetsky mentioned this pull request Feb 14, 2020

✨ KubeadmControlPlane scale up serially #2193

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Adds healthcheck for workload clusters #2295

✨ Adds healthcheck for workload clusters #2295

chuckha commented Feb 8, 2020 •

edited

Loading

chuckha Feb 8, 2020

detiber left a comment

chuckha commented Feb 10, 2020

chuckha Feb 10, 2020

vincepri Feb 10, 2020

chuckha commented Feb 10, 2020

detiber commented Feb 10, 2020

vincepri commented Feb 10, 2020

dlipovetsky commented Feb 10, 2020

dlipovetsky left a comment

dlipovetsky Feb 10, 2020

chuckha Feb 10, 2020

dlipovetsky commented Feb 10, 2020

vincepri Feb 10, 2020

vincepri Feb 10, 2020

chuckha Feb 10, 2020

randomvariable Feb 11, 2020

vincepri Feb 11, 2020

detiber Feb 11, 2020

vincepri Feb 10, 2020

chuckha Feb 10, 2020

chuckha commented Feb 11, 2020

randomvariable Feb 11, 2020

randomvariable Feb 11, 2020

vincepri left a comment

k8s-ci-robot commented Feb 11, 2020

randomvariable commented Feb 11, 2020

dlipovetsky commented Feb 11, 2020

chuckha commented Feb 11, 2020

	members := make([]*Member, 0)
	members := make([]*Member, 0, len(response.Members))

	errs := []error{}
	for _, err := range reconciliationErrors {
	if requeueErr, ok := errors.Cause(err).(capierrors.HasRequeueAfterError); ok {
	// Only record and log the first RequeueAfterError.
	if !res.Requeue {
	res.Requeue = true
	res.RequeueAfter = requeueErr.GetRequeueAfter()
	logger.Error(err, "Reconciliation for Cluster asked to requeue")
	}
	continue
	}

	errs = append(errs, err)
	}
	return res, kerrors.NewAggregate(errs)

✨ Adds healthcheck for workload clusters #2295

✨ Adds healthcheck for workload clusters #2295

Conversation

chuckha commented Feb 8, 2020 • edited Loading

Choose a reason for hiding this comment

detiber left a comment

Choose a reason for hiding this comment

chuckha commented Feb 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chuckha commented Feb 10, 2020

detiber commented Feb 10, 2020

vincepri commented Feb 10, 2020

dlipovetsky commented Feb 10, 2020

dlipovetsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlipovetsky commented Feb 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chuckha commented Feb 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincepri left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 11, 2020

randomvariable commented Feb 11, 2020

dlipovetsky commented Feb 11, 2020

chuckha commented Feb 11, 2020

chuckha commented Feb 8, 2020 •

edited

Loading