Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Use all available endpoints for etcd #2888

Merged
merged 1 commit into from
Jun 10, 2020

Conversation

gab-satchi
Copy link
Member

@gab-satchi gab-satchi commented Apr 9, 2020

What this PR does / why we need it:
This PR is an experiment to use all available etcd endpoints when creating an etcd client. KCP currently has 3 instances where it needs to create a client:

  • EtcdIsHealthy where it polls each node's etcd to confirm it's healthy before doing a scale up or down. This needs to continue connecting to a single node at a time
  • ForwardEtcdLeadership where we move the leader role to another node if the leader is about to be scaled down. The leader endpoint needs to be used when making this call so forLeader is used in this instance.
  • RemoveEtcdMemberForMachine where we remove an etcd member during a scale down. This is a call that can be made by any node but will still require consensus.

#2844
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 9, 2020
@@ -107,7 +107,7 @@ func (m *Management) GetWorkloadCluster(ctx context.Context, clusterKey client.O
RootCAs: caPool,
Certificates: []tls.Certificate{clientCert},
}

cfg.InsecureSkipVerify = true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being done because the endpoint used to be statically set to "127.0.0.1" which was in the SAN list for the etcd certs. With these changes, the TLS verify was failing silently as we now use a dynamic set of endpoints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to get an example where the endpoint advertised by the cluster isn't included in the SAN list for the etcd certs? kubeadm should be generating the certificates to include both localhost and the host IP in the listed SANs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubeadm is doing the right thing but because we have our own dialer, we don't use endpoints as they were meant to be used. Right now, we send the etcd pod names as endpoints which will of course not be included in the SAN list. Our dialer then uses the podname to figure out which pod to set up the port forwarding to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, what @gab-satchi said. The address that is being attempted to be verified is pod.namespace, API server itself is providing some guarantee of routing us to the correct place and mutual TLS authentication means we don't really need to verify the identity further.

@vincepri
Copy link
Member

/milestone v0.3.x

Setting the milestone to after v0.3.4 for now, until we have better e2e in place for this change to go in

@k8s-ci-robot k8s-ci-robot added this to the v0.3.x milestone Apr 13, 2020
@vincepri
Copy link
Member

/assign @detiber @randomvariable

for reviews

@@ -107,7 +107,7 @@ func (m *Management) GetWorkloadCluster(ctx context.Context, clusterKey client.O
RootCAs: caPool,
Certificates: []tls.Certificate{clientCert},
}

cfg.InsecureSkipVerify = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to get an example where the endpoint advertised by the cluster isn't included in the SAN list for the etcd certs? kubeadm should be generating the certificates to include both localhost and the host IP in the listed SANs.

controlplane/kubeadm/internal/workload_cluster_etcd.go Outdated Show resolved Hide resolved
}
dialer, err := proxy.NewDialer(p)
if err != nil {
return nil, err
}
etcdclient, err := etcd.NewEtcdClient("127.0.0.1", dialer.DialContextWithAddr, c.tlsConfig)
etcdclient, err := etcd.NewEtcdClient(endpoints, dialer.DialContextWithAddr, c.tlsConfig)
Copy link
Member

@fabriziopandini fabriziopandini May 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Kubeadm, after connecting to etcd, we call client.Sync so we can get rid of discrepancies between the list of etcd endpoints (pods in this case) and the list of etcd members actually running.
should we do the same here as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not because the real IPs aren't accessible by the management cluster. We always have to use API server to get the pods.

@gab-satchi
Copy link
Member Author

/retitle 🏃 Use all available endpoints for etcd

@k8s-ci-robot k8s-ci-robot changed the title 🏃 WIP - Use all available endpoints for etcd 🏃 Use all available endpoints for etcd May 25, 2020
@gab-satchi
Copy link
Member Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 25, 2020
@sedefsavas
Copy link

@gab-satchi I think you need to rebase.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented May 25, 2020

@gab-satchi: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-capd-e2e efc73dd link /test pull-cluster-api-capd-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

"github.com/pkg/errors"
kerrors "k8s.io/apimachinery/pkg/util/errors"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. extra space.

etcdClient, err := clientv3.New(clientv3.Config{
Endpoints: []string{endpoint},
Endpoints: endpoints,
DialTimeout: etcdTimeout,
DialOptions: []grpc.DialOption{
grpc.WithBlock(), // block until the underlying connection is up

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of lines below, do we need this call: etcdClient.Endpoints()
Maybe after making this call, we want to check if there are any endpoints returned and error if empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clientv3.New will error if it's given an empty slice so that should already be covered. Will remove the etcdClient.Endpoints()

@gab-satchi
Copy link
Member Author

Putting it back on hold. Found a bug in leadership forwarding

@gab-satchi
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 26, 2020
@vincepri
Copy link
Member

vincepri commented Jun 2, 2020

/retitle 🌱 Use all available endpoints for etcd

@k8s-ci-robot k8s-ci-robot changed the title 🏃 Use all available endpoints for etcd 🌱 Use all available endpoints for etcd Jun 2, 2020
- exclude node being removed from nodelist for etcd client
@gab-satchi
Copy link
Member Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 8, 2020
@sedefsavas
Copy link

lgtm

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/assign @detiber

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gab-satchi, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2020
@benmoss
Copy link

benmoss commented Jun 10, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2020
@k8s-ci-robot k8s-ci-robot merged commit a9fc88e into kubernetes-sigs:master Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants