Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm stacked etcd #69486

Merged
merged 2 commits into from
Oct 27, 2018

Conversation

fabriziopandini
Copy link
Member

@fabriziopandini fabriziopandini commented Oct 6, 2018

What this PR does / why we need it:
kubeadm now automatically creates a new stacked etcd member when joining a new control plane node (does not applies to external etcd)

Which issue(s) this PR fixes:
Fixes # kubernetes/kubeadm#1123

Special notes for your reviewer:
IMO two points deserve more attention:

  • The fact the local/stacked etcd client now uses the IP defined by API server advertise address
  • How the joining node discovers the list of endpoints for the existing etcd members

I will keep this in WIP until there is agreement on the above points

Release note:

kubeadm now automatically creates a new stacked etcd member when joining a new control plane node (does not applies to external etcd)

/sig cluster-lifecycle
/kind feature

/cc @timothysc
/cc @chuckha
/cc @detiber
@kubernetes/sig-cluster-lifecycle-pr-reviews

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Oct 6, 2018
@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm labels Oct 6, 2018
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR @fabriziopandini . added a couple of comments.

// advertise address
advertiseAddress := net.ParseIP(cfg.APIEndpoint.AdvertiseAddress)
if advertiseAddress == nil {
return nil, fmt.Errorf("error parsing APIEndpoint AdvertiseAddress %v: is not a valid textual representation of an IP address", cfg.APIEndpoint.AdvertiseAddress)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better %v: to be %q?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

// notifies the other members of the etcd cluster about the joining member
etcdPeerAddress := fmt.Sprintf("https://%s:2380", cfg.APIEndpoint.AdvertiseAddress)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, did we investigate if this approach is safe?
what happens if the port is already taken on that endpoint?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the default peer port and "shouldn't collide" but we should definitely add a check here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added check

// notifies the other members of the etcd cluster about the joining member
etcdPeerAddress := fmt.Sprintf("https://%s:2380", cfg.APIEndpoint.AdvertiseAddress)

glog.V(1).Infof("Adding etcd Member %s", etcdPeerAddress)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Member -> member:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
glog.V(1).Infof("Updated etcd member list %v", initialCluster)

glog.V(1).Infoln("creating local etcd static pod manifest file")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly creating should be uppercase for consistency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a recent issue about golang 1.11 as well needing to use Infof vs. Infoln.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using Infoln

}

if len(initialCluster) == 0 {
defaultArguments["initial-cluster"] = fmt.Sprintf("%s=https://%s:2380", cfg.GetNodeName(), cfg.APIEndpoint.AdvertiseAddress)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make such ports into constants?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should add them to the default consts file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in a separated PR to keep the scope of this PR as small as possible


ret := map[string]string{}
for _, m := range resp.Members {
// fixes the entry for of the joining member (that doesn't have a name set in the initialCluster returned by etcd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for of -> for or of only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

func (c Client) AddMember(name string, peerAddrs string) (map[string]string, error) {
cli, err := clientv3.New(clientv3.Config{
Endpoints: c.Endpoints,
DialTimeout: 5 * time.Second,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if the connection times increase with the number of etcd pods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would up the timeout for new connection. Default client inside the api-server is 20 seconds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increased timeout to 20

Copy link
Contributor

@chuckha chuckha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expectation that we've already copied over the etcd ca key/cert and generated the correct certificates?

@@ -36,10 +39,47 @@ const (
)

// CreateLocalEtcdStaticPodManifestFile will write local etcd static pod manifest file.
// This function is used by init (when there the etcd cluster is empty) or by kubeadm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"(when there the etcd cluster is empty)" => "(when the etcd cluster is empty)"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

// AddMember notifies an existing etcd cluster that a new member is joining
func (c Client) AddMember(name string, peerAddrs string) (map[string]string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost want this map[string]string to be a struct, something like etcdClusterConfiguration or similar. Do you think that would add anything here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly here b/c we're re-wrapping an etcd member struct. I would change it to [string]IP at a minimum though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also testability of this function will require using some of the etcd utils I built eons ago under apiserver/storage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done using a struct

return nil, err
}

// Note for reviewers: I'm not sure this is the best method for getting the endpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good approach with one minor concern: there is a race condition that the PodIP may change after the PodIP lookup and before our client access, but I'm ok living with that possibility.

I wonder if there is a clean way to solve this by managing etcd as a statefulset. The other thought I had was adding a Service object to the static manifest and managing that by hand (by kubeadm). The goal of these two ideas is to provide a stable DNS name instead of a PodIP lookup. Either way, those types of changes would be way out of scope for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the ClusterStatus should have the endpoints.
Or as discussed on the call put the meta-data in an annotation for the pod to make it easier to extract.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to using ClusterStatus or a pre-set annotation to use as a starting point.

After we have the initial endpoint list, we should use that to query the etcd cluster itself for the list of endpoints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, we should also raise an error if the queried endpoints do not match the expected endpoints, or if the cluster is not in a healthy state to start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done reading from cluster status + calling sync to get the real list of endpoints from etcd

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch of comments plus we should highlight that folks will need todo this on odd numbers for stacked deploys.

@@ -334,8 +338,8 @@ func GetEtcdPeerAltNames(cfg *kubeadmapi.InitConfiguration) (*certutil.AltNames,

// create AltNames with defaults DNSNames/IPs
altNames := &certutil.AltNames{
DNSNames: []string{cfg.NodeRegistration.Name, "localhost"},
IPs: []net.IP{advertiseAddress, net.IPv4(127, 0, 0, 1), net.IPv6loopback},
DNSNames: []string{cfg.NodeRegistration.Name},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing loopback from the SAN?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to dig through history but I remember us adding it on purpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sound strange, but restored

}

// notifies the other members of the etcd cluster about the joining member
etcdPeerAddress := fmt.Sprintf("https://%s:2380", cfg.APIEndpoint.AdvertiseAddress)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the default peer port and "shouldn't collide" but we should definitely add a check here.

if err != nil {
return err
}
glog.V(1).Infof("Updated etcd member list %v", initialCluster)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do a local client member check, to verify it's correct, and to list the current members in the log line.

}
glog.V(1).Infof("Updated etcd member list %v", initialCluster)

glog.V(1).Infoln("creating local etcd static pod manifest file")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a recent issue about golang 1.11 as well needing to use Infof vs. Infoln.

}

if len(initialCluster) == 0 {
defaultArguments["initial-cluster"] = fmt.Sprintf("%s=https://%s:2380", cfg.GetNodeName(), cfg.APIEndpoint.AdvertiseAddress)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should add them to the default consts file.

return nil, err
}

// Note for reviewers: I'm not sure this is the best method for getting the endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the ClusterStatus should have the endpoints.
Or as discussed on the call put the meta-data in an annotation for the pod to make it easier to extract.

func (c Client) AddMember(name string, peerAddrs string) (map[string]string, error) {
cli, err := clientv3.New(clientv3.Config{
Endpoints: c.Endpoints,
DialTimeout: 5 * time.Second,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would up the timeout for new connection. Default client inside the api-server is 20 seconds.

}

// AddMember notifies an existing etcd cluster that a new member is joining
func (c Client) AddMember(name string, peerAddrs string) (map[string]string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly here b/c we're re-wrapping an etcd member struct. I would change it to [string]IP at a minimum though.

}

// AddMember notifies an existing etcd cluster that a new member is joining
func (c Client) AddMember(name string, peerAddrs string) (map[string]string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also testability of this function will require using some of the etcd utils I built eons ago under apiserver/storage

@timothysc
Copy link
Member

/assign @timothysc @detiber

etcdPeerAddress := fmt.Sprintf("https://%s:2380", cfg.APIEndpoint.AdvertiseAddress)

glog.V(1).Infof("Adding etcd Member %s", etcdPeerAddress)
initialCluster, err := etcdClient.AddMember(cfg.NodeRegistration.Name, etcdPeerAddress)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method used during upgrade? If so, it seems odd to call 'AddMember()' for an instance that is already a member of the etcd cluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is used only when adding a new control plane instance

return nil, err
}

// Note for reviewers: I'm not sure this is the best method for getting the endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to using ClusterStatus or a pre-set annotation to use as a starting point.

After we have the initial endpoint list, we should use that to query the etcd cluster itself for the list of endpoints.

return nil, err
}

// Note for reviewers: I'm not sure this is the best method for getting the endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, we should also raise an error if the queried endpoints do not match the expected endpoints, or if the cluster is not in a healthy state to start.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2018
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 21, 2018
@fabriziopandini
Copy link
Member Author

@neolit123 @chuckha @timothysc @detiber Thanks for the valuable feedback!

Now

  1. The list of etcd members is initially discovered from cluster status, and then I query etcd for getting the real list of members.
  2. Now also upgrades works with stacked etcd

The latest open point to be addressed before removing WIP is the usage of API server advertise an address, that can lead to problems in case the user choose an advertise address that doesn't correspond to any IP address on the machine

@timothysc
Copy link
Member

@fabriziopandini Is this ready to go? If so can you remove the WIPs from this and other PRs.

@fabriziopandini
Copy link
Member Author

@timothysc last point pending is the discussion about usage of API server advertise for etcd

@timothysc
Copy link
Member

@fabriziopandini ^ want to update now?

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini, timothysc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2018
@fabriziopandini fabriziopandini changed the title [WIP] - kubeadm stacked etcd kubeadm stacked etcd Oct 26, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2018
@timothysc timothysc added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2018
@fabriziopandini
Copy link
Member Author

@timothysc this is ready to go for me now
If we can have some help on testing this before the end of the cycle it will be great...

@fabriziopandini
Copy link
Member Author

/test pull-kubernetes-e2e-kops-aws

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fabriziopandini
Copy link
Member Author

/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2018
@neolit123
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants