🌱 Use all available endpoints for etcd #2888

gab-satchi · 2020-04-09T20:50:32Z

What this PR does / why we need it:
This PR is an experiment to use all available etcd endpoints when creating an etcd client. KCP currently has 3 instances where it needs to create a client:

EtcdIsHealthy where it polls each node's etcd to confirm it's healthy before doing a scale up or down. This needs to continue connecting to a single node at a time
ForwardEtcdLeadership where we move the leader role to another node if the leader is about to be scaled down. The leader endpoint needs to be used when making this call so forLeader is used in this instance.
RemoveEtcdMemberForMachine where we remove an etcd member during a scale down. This is a call that can be made by any node but will still require consensus.

#2844
/hold

gab-satchi · 2020-04-09T21:24:09Z

controlplane/kubeadm/internal/cluster.go

@@ -107,7 +107,7 @@ func (m *Management) GetWorkloadCluster(ctx context.Context, clusterKey client.O
 		RootCAs:      caPool,
 		Certificates: []tls.Certificate{clientCert},
 	}
-
+	cfg.InsecureSkipVerify = true


This is being done because the endpoint used to be statically set to "127.0.0.1" which was in the SAN list for the etcd certs. With these changes, the TLS verify was failing silently as we now use a dynamic set of endpoints.

Would it be possible to get an example where the endpoint advertised by the cluster isn't included in the SAN list for the etcd certs? kubeadm should be generating the certificates to include both localhost and the host IP in the listed SANs.

Kubeadm is doing the right thing but because we have our own dialer, we don't use endpoints as they were meant to be used. Right now, we send the etcd pod names as endpoints which will of course not be included in the SAN list. Our dialer then uses the podname to figure out which pod to set up the port forwarding to.

Yup, what @gab-satchi said. The address that is being attempted to be verified is pod.namespace, API server itself is providing some guarantee of routing us to the correct place and mutual TLS authentication means we don't really need to verify the identity further.

controlplane/kubeadm/internal/etcd_client_generator.go

vincepri · 2020-04-13T20:41:33Z

/milestone v0.3.x

Setting the milestone to after v0.3.4 for now, until we have better e2e in place for this change to go in

vincepri · 2020-04-16T14:02:12Z

/assign @detiber @randomvariable

for reviews

detiber · 2020-04-16T14:23:56Z

controlplane/kubeadm/internal/cluster.go

@@ -107,7 +107,7 @@ func (m *Management) GetWorkloadCluster(ctx context.Context, clusterKey client.O
 		RootCAs:      caPool,
 		Certificates: []tls.Certificate{clientCert},
 	}
-
+	cfg.InsecureSkipVerify = true


Would it be possible to get an example where the endpoint advertised by the cluster isn't included in the SAN list for the etcd certs? kubeadm should be generating the certificates to include both localhost and the host IP in the listed SANs.

controlplane/kubeadm/internal/workload_cluster_etcd.go

controlplane/kubeadm/internal/etcd_client_generator.go

fabriziopandini · 2020-05-03T08:57:17Z

controlplane/kubeadm/internal/etcd_client_generator.go

 	}
 	dialer, err := proxy.NewDialer(p)
 	if err != nil {
 		return nil, err
 	}
-	etcdclient, err := etcd.NewEtcdClient("127.0.0.1", dialer.DialContextWithAddr, c.tlsConfig)
+	etcdclient, err := etcd.NewEtcdClient(endpoints, dialer.DialContextWithAddr, c.tlsConfig)


In Kubeadm, after connecting to etcd, we call client.Sync so we can get rid of discrepancies between the list of etcd endpoints (pods in this case) and the list of etcd members actually running.
should we do the same here as well?

Probably not because the real IPs aren't accessible by the management cluster. We always have to use API server to get the pods.

gab-satchi · 2020-05-25T19:02:30Z

/retitle 🏃 Use all available endpoints for etcd

gab-satchi · 2020-05-25T19:03:19Z

/unhold

sedefsavas · 2020-05-25T21:06:45Z

@gab-satchi I think you need to rebase.

k8s-ci-robot · 2020-05-25T22:09:50Z

@gab-satchi: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-cluster-api-capd-e2e	`efc73dd`	link	`/test pull-cluster-api-capd-e2e`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

sedefsavas · 2020-05-25T21:40:41Z

controlplane/kubeadm/internal/etcd_client_generator.go

 	"github.com/pkg/errors"
+	kerrors "k8s.io/apimachinery/pkg/util/errors"
+


nit. extra space.

sedefsavas · 2020-05-25T21:52:09Z

controlplane/kubeadm/internal/etcd/etcd.go

 	etcdClient, err := clientv3.New(clientv3.Config{
-		Endpoints:   []string{endpoint},
+		Endpoints:   endpoints,
 		DialTimeout: etcdTimeout,
 		DialOptions: []grpc.DialOption{
 			grpc.WithBlock(), // block until the underlying connection is up


A couple of lines below, do we need this call: etcdClient.Endpoints()
Maybe after making this call, we want to check if there are any endpoints returned and error if empty?

clientv3.New will error if it's given an empty slice so that should already be covered. Will remove the etcdClient.Endpoints()

gab-satchi · 2020-05-26T14:31:48Z

Putting it back on hold. Found a bug in leadership forwarding

gab-satchi · 2020-05-26T14:31:55Z

/hold

vincepri · 2020-06-02T19:40:21Z

/retitle 🌱 Use all available endpoints for etcd

- exclude node being removed from nodelist for etcd client

gab-satchi · 2020-06-08T17:27:09Z

/unhold

sedefsavas · 2020-06-08T17:50:48Z

lgtm

vincepri

/approve
/assign @detiber

k8s-ci-robot · 2020-06-10T20:07:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gab-satchi, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

benmoss · 2020-06-10T20:27:14Z

/lgtm

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 9, 2020

k8s-ci-robot requested review from detiber and JoelSpeed April 9, 2020 20:50

gab-satchi commented Apr 9, 2020

View reviewed changes

sedefsavas reviewed Apr 10, 2020

View reviewed changes

controlplane/kubeadm/internal/etcd_client_generator.go Outdated Show resolved Hide resolved

controlplane/kubeadm/internal/etcd_client_generator.go Outdated Show resolved Hide resolved

This was referenced Apr 10, 2020

🏃[KCP] Recover from a manual machine deletion #2841

Merged

[KCP] Get etcd client using all nodes in ReconcileEtcdMembers #2896

Closed

k8s-ci-robot added this to the v0.3.x milestone Apr 13, 2020

k8s-ci-robot assigned detiber and randomvariable Apr 16, 2020

detiber reviewed Apr 16, 2020

View reviewed changes

randomvariable reviewed Apr 21, 2020

View reviewed changes

controlplane/kubeadm/internal/etcd_client_generator.go Outdated Show resolved Hide resolved

randomvariable reviewed Apr 21, 2020

View reviewed changes

controlplane/kubeadm/internal/etcd_client_generator.go Outdated Show resolved Hide resolved

gab-satchi force-pushed the 2844-kcp-etcd branch from 0f180bc to efc73dd Compare May 1, 2020 20:01

fabriziopandini reviewed May 3, 2020

View reviewed changes

k8s-ci-robot changed the title ~~🏃 WIP - Use all available endpoints for etcd~~ 🏃 Use all available endpoints for etcd May 25, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 25, 2020

gab-satchi force-pushed the 2844-kcp-etcd branch from efc73dd to 4a4261f Compare May 25, 2020 21:29

sedefsavas reviewed May 25, 2020

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 26, 2020

k8s-ci-robot changed the title ~~🏃 Use all available endpoints for etcd~~ 🌱 Use all available endpoints for etcd Jun 2, 2020

Adds forNodes that will use all available etcd endpoints

2708f8e

- exclude node being removed from nodelist for etcd client

gab-satchi force-pushed the 2844-kcp-etcd branch from 1bb6561 to 2708f8e Compare June 8, 2020 13:13

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 8, 2020

vincepri approved these changes Jun 10, 2020

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2020

k8s-ci-robot assigned benmoss Jun 10, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2020

k8s-ci-robot merged commit a9fc88e into kubernetes-sigs:master Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 Use all available endpoints for etcd #2888

🌱 Use all available endpoints for etcd #2888

gab-satchi commented Apr 9, 2020 •

edited

Loading

gab-satchi Apr 9, 2020

detiber Apr 16, 2020

gab-satchi Apr 17, 2020

randomvariable Apr 21, 2020

vincepri commented Apr 13, 2020

vincepri commented Apr 16, 2020

detiber Apr 16, 2020

fabriziopandini May 3, 2020 •

edited

Loading

randomvariable May 6, 2020

gab-satchi commented May 25, 2020

gab-satchi commented May 25, 2020

sedefsavas commented May 25, 2020

k8s-ci-robot commented May 25, 2020 •

edited

Loading

sedefsavas May 25, 2020

sedefsavas May 25, 2020

gab-satchi May 26, 2020

gab-satchi commented May 26, 2020

gab-satchi commented May 26, 2020

vincepri commented Jun 2, 2020

gab-satchi commented Jun 8, 2020

sedefsavas commented Jun 8, 2020

vincepri left a comment

k8s-ci-robot commented Jun 10, 2020

benmoss commented Jun 10, 2020

		"github.com/pkg/errors"
		kerrors "k8s.io/apimachinery/pkg/util/errors"

🌱 Use all available endpoints for etcd #2888

🌱 Use all available endpoints for etcd #2888

Conversation

gab-satchi commented Apr 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincepri commented Apr 13, 2020

vincepri commented Apr 16, 2020

Choose a reason for hiding this comment

fabriziopandini May 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gab-satchi commented May 25, 2020

gab-satchi commented May 25, 2020

sedefsavas commented May 25, 2020

k8s-ci-robot commented May 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gab-satchi commented May 26, 2020

gab-satchi commented May 26, 2020

vincepri commented Jun 2, 2020

gab-satchi commented Jun 8, 2020

sedefsavas commented Jun 8, 2020

vincepri left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 10, 2020

benmoss commented Jun 10, 2020

gab-satchi commented Apr 9, 2020 •

edited

Loading

fabriziopandini May 3, 2020 •

edited

Loading

k8s-ci-robot commented May 25, 2020 •

edited

Loading