Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Adds healthcheck for workload clusters #2295

Merged
merged 1 commit into from
Feb 11, 2020

Conversation

chuckha
Copy link
Contributor

@chuckha chuckha commented Feb 8, 2020

Signed-off-by: Chuck Ha chuckh@vmware.com

Co-authored-by: Daniel Lipovetsky dlipovetsky@d2iq.com

What this PR does / why we need it:
This PR adds, but does not use, the etcd healthchecking for scaling up workload clusters. We decided to split #2193 into smaller chunks. This is the largest chunk. The next set of commits will be integrating this work and fixing the tests.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Related to #2243 #2241

/assign @dlipovetsky @detiber @randomvariable

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 8, 2020
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 8, 2020
@@ -156,16 +156,16 @@ func (c *Client) Members(ctx context.Context) ([]*Member, error) {
}

clusterID := response.Header.GetClusterId()
members := make([]*Member, len(response.Members))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this removes nils from the members list.

Copy link
Member

@detiber detiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied over a few comments from #2193

controlplane/kubeadm/internal/cluster.go Show resolved Hide resolved
controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved
@chuckha
Copy link
Contributor Author

chuckha commented Feb 10, 2020

@detiber Added some tests and flipped the logic, good catch, thanks for pointing that out 👍

}
}

type fakeClient struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fake.NewClient is being deprecated and this is a more true unit test solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry too much about the deprecation until it happens, we're also getting more involved upstream, so it remains to be seen if it's going to go away or not

@chuckha
Copy link
Contributor Author

chuckha commented Feb 10, 2020

e2e test failure is a flake and unrelated as this code does not touch any existing code and the failure is not build related.

@detiber
Copy link
Member

detiber commented Feb 10, 2020

/lgtm
/hold

Adding hold in case @dlipovetsky or @randomvariable want to chime in. Feel free to remove the hold when ready to merge.

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 10, 2020
@chuckha chuckha mentioned this pull request Feb 10, 2020
@vincepri
Copy link
Member

Reviewing

1 similar comment
@dlipovetsky
Copy link
Contributor

Reviewing

Copy link
Contributor

@dlipovetsky dlipovetsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some just nits and a question

controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved

// generateEtcdTLSClientBundle builds an etcd client TLS bundle from the Etcd CA for this cluster.
func (c *cluster) generateEtcdTLSClientBundle() (*tls.Config, error) {
clientCert, err := generateClientCert(c.etcdCACert, c.etcdCAkey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Would we want to create a long-lived client cert in a future PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a little hesitant about over-optimization. I think until this becomes a bottle neck, we won't need a long-lived cert unless there are other reasons to create one that I'm over looking.

controlplane/kubeadm/internal/cluster_test.go Outdated Show resolved Hide resolved
@dlipovetsky
Copy link
Contributor

Thank you @chuckha for reorganizing #2193 into manageable pieces, and for nice abstractions!

/lgtm

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2020
@@ -156,16 +156,16 @@ func (c *Client) Members(ctx context.Context) ([]*Member, error) {
}

clusterID := response.Header.GetClusterId()
members := make([]*Member, len(response.Members))
for i, m := range response.Members {
members := make([]*Member, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
members := make([]*Member, 0)
members := make([]*Member, 0, len(response.Members))

If you wanted to pre-allocate slots, but not grow the array and use append.

controlplane/kubeadm/internal/cluster.go Show resolved Hide resolved
controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved
if err != nil {
return err
}
errorList := []error{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use var err error here and kerrors.NewAggregate like we do in

reterr = kerrors.NewAggregate([]error{reterr, err})
, in the for loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, is this a style preference?

It doesn't appear to be convention because slightly farther down in the cluster_controller we're doing this:

errs := []error{}
for _, err := range reconciliationErrors {
if requeueErr, ok := errors.Cause(err).(capierrors.HasRequeueAfterError); ok {
// Only record and log the first RequeueAfterError.
if !res.Requeue {
res.Requeue = true
res.RequeueAfter = requeueErr.GetRequeueAfter()
logger.Error(err, "Reconciliation for Cluster asked to requeue")
}
continue
}
errs = append(errs, err)
}
return res, kerrors.NewAggregate(errs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't seen it before working on the webhook conversions to CAPA, so maybe a new(ish) convention, and we should mop up elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably take care of it later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should adopt that convention outside of defer, otherwise we risk overwriting or otherwise mishandling the aggregated err var.

controlplane/kubeadm/internal/cluster.go Show resolved Hide resolved
if machine.Status.NodeRef == nil {
return errors.Errorf("control plane machine %q has no node ref", machine.Name)
}
if _, ok := resp[machine.Status.NodeRef.Name]; !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a safety check? I wouldn't expect for a reference to be there without a Name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this part could use a comment, it's not super obvious.

This is checking known machines with the nodes that were checked during the health check. This prevents out-of-(cluster-api)band etcd members from existing.

controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/cluster.go Show resolved Hide resolved
controlplane/kubeadm/internal/cluster.go Show resolved Hide resolved
controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/etcd/util/util.go Outdated Show resolved Hide resolved
Signed-off-by: Chuck Ha <chuckh@vmware.com>

Co-authored-by: Daniel Lipovetsky <dlipovetsky@d2iq.com>
@chuckha
Copy link
Contributor Author

chuckha commented Feb 11, 2020

@vincepri anything left to do before lgtm?

ResourceName: etcdStaticPodName(nodeName),
KubeConfig: c.restConfig,
TLSConfig: tlsConfig,
Port: 2379, // TODO: the pod doesn't expose a port. Is this a problem?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of, especially because of the use of host networking, we can exclude behavioural differences across CNIs as a potential problem.

// This does not support external etcd.
p := proxy.Proxy{
Kind: "pods",
Namespace: "kube-system", // TODO, can etcd ever run in a different namespace?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard coded as a constant in metav1.NamespaceSystem

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/hold cancel

/assign @randomvariable
for final LGTM

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 11, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chuckha, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@randomvariable
Copy link
Member

We'll need to review generating certificates on the fly all of the time from a performance perspective, but otherwise great as a first pass of this.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2020
@k8s-ci-robot k8s-ci-robot merged commit 4d89d56 into kubernetes-sigs:master Feb 11, 2020
@dlipovetsky
Copy link
Contributor

We'll need to review generating certificates on the fly all of the time from a performance perspective, but otherwise great as a first pass of this.

I had the same thought.

@chuckha
Copy link
Contributor Author

chuckha commented Feb 11, 2020

@randomvariable @dlipovetsky we should open an issue to track performance concerns, but we'll need graphs or metrics or both before it makes sense to try and optimize things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants