Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent CP join race condition for etcd join #2005

Closed
echu23 opened this issue Jan 17, 2020 · 21 comments · Fixed by kubernetes/kubernetes#87505
Closed

Concurrent CP join race condition for etcd join #2005

echu23 opened this issue Jan 17, 2020 · 21 comments · Fixed by kubernetes/kubernetes#87505
Assignees
Labels
area/etcd area/HA kind/bug Categorizes issue or PR as related to a bug. kind/design Categorizes issue or PR as related to design. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@echu23
Copy link

echu23 commented Jan 17, 2020

What keywords did you search in kubeadm issues before filing this one?

etcd join race

I did find #1319 #2001 #1793

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:18:19Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:19:15Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:16:41Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
    NAME="VMware Photon OS"
    VERSION="3.0"
    ID=photon
    VERSION_ID=3.0
    PRETTY_NAME="VMware Photon OS/Linux"
  • Kernel (e.g. uname -a):
    Linux 42332700031c0a2cfd7ef78ebc8af387 4.19.84-1.ph3-esx kubeadm join on slave node fails preflight checks #1-photon SMP Tue Nov 19 00:39:50 UTC 2019 x86_64 GNU/Linux
  • Others:

This should happen in any env if our analysis is valid.

What happened?

For setting up a 3 (or multi)-node cluster, we understand that the etcd learner is coming but we can't refactor our code for now to adopt that. And we really need to do this concurrently. So wondering if any workaround/suggestion can be offered.

We use kubeadm to boostrap a 3-CP cluster, however, we need to inject some customization along the way, so we are calling the join phase one by one instead of simply kubeadm join.
This issue only happens only when adding the 3rd member.

When we call this command on the 3rd node

kubeadm join phase control-plane-join etcd

Very rarely we observed that the generated etcd manifest (/etc/kubernetes/manifest/etcd.yaml) has incorrect --initial-cluster value.

Assuming etcd-0 is the first member, etcd-1 is the second and etcd-2 is the third. A correct --initial-cluster value for etcd-2 might look like this

--initial-cluster=etcd-0=https://192.168.0.1:2380,etcd-1=https://192.168.0.2:2380,etcd-2=https://192.168.0.3:2380

However, in this rare case, we are getting something like this

--initial-cluster=etcd-0=https://192.168.0.1:2380,etcd-2=https://192.168.0.2:2380,etcd-2=https://192.168.0.3:2380

Basically the name of etcd-1 was incorrectly configured as etcd-2, this incorrect manifest results in etcd container failed to start and complain:

etcdmain: error validating peerURLs {"ClusterID":"31c63dd3d7c3da6a","Members":[{"ID":"1b98ed58f9be3e7d","RaftAttributes":{"PeerURLs":["https://192.168.0.2:2380"]},"Attributes":{"Name":"etcd-1","ClientURLs":["https://192.168.0.2:2379"]}},{"ID":"6d631ff1c84da117","RaftAttributes":{"PeerURLs":["https://192.168.0.3:2380"]},"Attributes":{"Name":"","ClientURLs":[]}},{"ID":"f0c11b3401371571","RaftAttributes":{"PeerURLs":["https://192.168.0.1:2380"]},"Attributes":{"Name":"etcd-0","ClientURLs":["https://192.168.0.1:2379"]}}],"RemovedMemberIDs":[]}: member count is unequal\n","stream":"stderr","time":"2020-01-08T17:27:52.63704563Z"}

We think this error message was because the manifest --initial-cluster has only 2 unique names there while the etcd cluster actually has 3 members.

We spent some time tracking the code to see what could be the issue and we have a theory here.

  1. Calling kubeadm join phase control-plane-join etcd

  2. Then the above command calls this

  3. It then calls etcdClient.AddMember()

  4. func (c *Client) AddMember(name string, peerAddrs string) here name is the current master's Name and peerAddrs is the current master's peerURL.

Then in L290: resp, err = cli.MemberAdd(ctx, []string{peerAddrs})

it calls the real MemberAdd which will return a []Member that includes the currently-being-added one.

So the response of this MemberAdd() will have all previous members and current member.

Once AddMember() receives the response

 for _, m := range resp.Members {
  // fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
  if m.Name == "" {
   ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
  } else {
   ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
  }
 }

Here resp is the response from MemberAdd() as described above. And this section is to insert the given Name for the members that do not have Name. We think that it is expected only the currently-being-added member that should be the only one that does not have Name, it loops the resp.Members, find the member that does not have a Name and set name as the member Name.

But if the resp.Members, which in this case returned 3 members (because this happens in the 3rd member), there were somehow 2 Members that do not have "Name" because the 2nd member had just joined the etcd cluster, but the etcd container of 2nd is still coming up, in this case, "etcdctl member list" would return something like

cat ../../../../commands/etcdctl_member-list.txt
1b98ed58f9be3e7d, started, etcd-0, https://20.20.0.37:2380, https://192.168.0.1:2379
6d631ff1c84da117, unstarted, , https://192.168.0.2:2380, <-- this is etcd-1, but not started yet so no Name

In this case, there are 2 out of 3 Members that do not have Name, hence the above for loop inserted the 3rd Name(etcd-2) to both 2nd and 3rd Member.

We concluded that this issue only happens if the 3rd member that is running MemberAdd during which the 2nd Member is not yet started which is considered racy.

For this ticket, we want to understand that:

  1. Is this analysis valid? Could this really happen?
  2. What could be the fix here? We can fix it in our private repo and wait for adoption of etcd learner.

What you expected to happen?

The generated etcd manifest should have correct --initial-cluster value.

How to reproduce it (as minimally and precisely as possible)?

  1. Create a single-node k8s cluster
  2. Prepare 2 other nodes, not joining them yet
  3. Join them concurrently

Note that this happens really rarely, the frequency is probably 1/1000

Anything else we need to know?

@echu23
Copy link
Author

echu23 commented Jan 17, 2020

It's worth noting that the following is speculation:

In this case, there are 2 out of 3 Members that do not have Name, hence the above for loop inserted the 3rd Name(etcd-2) to both 2nd and 3rd Member.

It's also worth noting that we're not sure what the contract is with etcd for the names of the active peers returned from the MemberAdd call.
I don't think we can perform the 3rd member add before the second member has started and sync'd as the cluster wouldn't have quorum. I'd assume that means it has, by the time the 3rd member is added, updated the 2nd members name. So either (assuming we're correct in the cause):

  • kubeadm needs to handle this edge case, or
  • etcd is expected to know the names of all quorum members before the add can succeed and shouldn't be returning the 2nd member with an empty name.

Finally, it's worth noting that we do not yet have verbose logging for the issue, and that the logging in kubeadm currently will not allow us to directly confirm this hypothesis as we never log the actual values returned by etcd at any level.

@neolit123 neolit123 added this to the v1.18 milestone Jan 17, 2020
@neolit123
Copy link
Member

thanks for logging the issue.
/kind bug
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jan 17, 2020
@neolit123 neolit123 self-assigned this Jan 17, 2020
@fabriziopandini
Copy link
Member

/assign

@echu23 thanks for the detailed analysis
I don't have a simple way to validate your analysis, but your thesis seems reasonable to me.

I see a possible fix for this by changing the loop on resp.Members:

	for _, m := range resp.Members {
		// fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
		if m.Name == "" {
                       if m.PeerURLs[0] == peerAddrs {
		            ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
                       }
		} else {
			ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
		}
	}

With this change we are going to discard any member with empty name and PeerUrl different than the peer URL of the joining node (192.168.0.2:2380 in your example).

If the resulting list has two members instead of three it should not be a problem for the new etcd node, because it will be informed about the third member by the raft protocol after joining.

If instead the resulting list has one member this will be a problem, but according to your analysis this can't happen (and it seems reasonable to me because join starts only the first control plane/etcd is up and running)

@echu23 @neolit123 opinions?

If we agree on the solution I can send a patch a so we can check that this does not introduce regression, but for the real parallel join test we should rely on @echu23 infrastructure

@neolit123
Copy link
Member

can also be:

for _, m := range resp.Members {
	// fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
	if m.Name == "" && m.PeerURLs[0] == peerAddrs {
		ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
		continue
	}
	ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
}

In this case, there are 2 out of 3 Members that do not have Name, hence the above for loop inserted the 3rd Name(etcd-2) to both 2nd and 3rd Member.

from my understanding @fabriziopandini proposal can fix this problem.
it will set one of the names (instead of 2, which is the bug) and then hopefully the other call to etcdClient.AddMember() from the 3rd member will sets its name too.

after this change if the code is still flaky for other reasons, we have the option to fully serialize the etcd member add with a lock managed by kubeadm.

the alternative is to just wait until we move to etcd 3.5 and see how stable the learners are.

@neolit123
Copy link
Member

but something to note about etcd learners is that they are a currently a maximum of 1 and it is not clear when multiple learners will be supported:
etcd-io/etcd#11401

@neolit123
Copy link
Member

etcdadm is managing this slightly differently:
https://github.com/kubernetes-sigs/etcdadm/blob/master/cmd/join.go#L106-L135

post "add", it obtains the current member and other members from the response.
then appends to the initial cluster based on ID, which seems preferable.

@echu23
Copy link
Author

echu23 commented Jan 18, 2020

Just another quick question to confirm the intention of the code, so from this

	ret := []Member{}
	for _, m := range resp.Members {
		// fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
		if m.Name == "" {
			ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
		} else {
			ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
		}
	}

Does it expect that resp.Members must ONLY have 1 and only 1 member that does not have Name (which should have been the currently-being-added member)? Is this a true statement?
Is it even possible to have a resp.Members that contains more than 1 Member that do not have Name?

I am asking because I am trying to reproduce this issue using kind, and I was not able to reproduce it.
Of course as I mentioned, this only happens maybe 1/1000. I will keep trying.

My reproduction flow using kind is this:

  1. Use kind with custom config file to deploy a 10-CP cluster.
  2. Since kind configure the cluster sequentially, after the 3 node is successfully joined and the 4th is going on, perform
    2.1 docker exec into kind-control-plane, which is always the first member
    2.2 do etcdctl member list and note down kind-control-plane2 , or 3 or whatever has been joined.
    2.3 docker exec into the selected member from kubeadm init starts paused containers on ubuntu 16.04 #2.2, remove /etc/kubernetes/manifest/etcd.yaml, which in turn will shutdown etcd container on this member
    2.4 docker exec into the first member again, do etcdctl member remove and etcdctl member add again, this will put that new member in unstarted and observe if the issue reproduces.

So 2 questions here:

  1. Does it expect that resp.Members must ONLY have 1 and only 1 member that does not have Name (which should have been the currently-being-added member)? Is this a true statement?
    Is it even possible to have a resp.Members that contains more than 1 Member that do not have Name?
  2. Do you have any suggestion on how to reproduce this issue?
    Thanks.

@neolit123
Copy link
Member

neolit123 commented Jan 18, 2020

Does it expect that resp.Members must ONLY have 1 and only 1 member that does not have Name (which should have been the currently-being-added member)? Is this a true statement?
Is it even possible to have a resp.Members that contains more than 1 Member that do not have Name?

as you pointed out with the discoveries in the OP, it seems like more than one members without name can occur in resp.Members.

Do you have any suggestion on how to reproduce this issue?

might be easier with kinder:
https://github.com/kubernetes/kubeadm/tree/master/kinder

build kinder using go > 1.13

go build

and call:

./kinder test workflow ./ci/workflows/regular-1.15.yaml --dry-run

this will give you a set of commands.

  • execute all commands until (not including) the kinder do join command, but note that kinder create cluster allows a custom number of CP and worker nodes.
  • write down the join command that "kubeadm init" gives you (with token and ca-hash).
  • write a script that concurrently docker execs kubeadm join on all remaining CP nodes.

Use kind with custom config file to deploy a 10-CP cluster.

you seem to have a setup to envy.
but IMO it is better to have an ODD number of etcd/CP nodes.

@fabriziopandini
Copy link
Member

@neolit123 @echu23 if I got this thread right we agree on the proposed change? Can I send the patch (with the @neolit123 variant)?

@echu23
Copy link
Author

echu23 commented Jan 18, 2020

So the proposed change is to discard the member of the member has no Name and peerURLs does not match the current one, right? This means that if this issue happens, the —initial-cluster in etcd.yaml would have only 2 members: the first member and the current member. If this will still configure a valid etcd cluster I’m totally fine with it as long as it fixes the issue and form a health etcd cluster.

Sent with GitHawk

@neolit123
Copy link
Member

@fabriziopandini

if I got this thread right we agree on the proposed change?

the way etcdadm does the matching with member IDs, is overall better:
https://github.com/kubernetes-sigs/etcdadm/blob/master/cmd/join.go#L106-L135

but i don't see when matching peerAddrs shouldn't be fine too.

@echu23

So the proposed change is to discard the member of the member has no Name and peerURLs does not match the current one, right?

yes.

This means that if this issue happens, the —initial-cluster in etcd.yaml would have only 2 members: the first member and the current member. If this will still configure a valid etcd cluster I’m totally fine with it as long as it fixes the issue and form a health etcd cluster.

if the 2nd and 3rd members join at the same time, MemberAdd on the second CP might return 3 members - 2 of which don't have name (2nd and 3rd).

with the proposed patch the initial cluster on the second CP will end up only with the first and 2nd member.

but because this is racy, i think the same can happen for the MemberAdd on the 3rd CP node. if it also obtains 3 members, 2 of which don't have a name it will also end up with an initial cluster with 2 members.

so the etcd boostrap can look like this:

  • CP1: etcd-member1
  • CP2: etcd-member1, etcd-member2
  • CP3: etcd-member1, etcd-member3

@echu23
Copy link
Author

echu23 commented Jan 19, 2020

Ok so this

  • CP1: etcd-member1
  • CP2: etcd-member1, etcd-member2
  • CP3: etcd-member1, etcd-member3

Will form a health cluster, right? I’m totally fine with this.

Sent with GitHawk

@neolit123
Copy link
Member

i think it has side effects. i can try simulating this tomorrow.
@fabriziopandini have you experimented with this?

if the initial cluster does not matter, we can always do this in kubeadm:

CP1: etcd-member1
CP2: etcd-member1, etcd-member2
CP3: etcd-member1, etcd-member3
CPN: etcd-member1, etcd-memberN

yet something is telling me it's not right.

also i would like to bring something from the etcd docs:

If adding multiple members the best practice is to configure a single member at a time and verify it starts correctly before adding more new members.
https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/runtime-configuration.md

i.e. concurrent join is by design not supported...

@echu23
Copy link
Author

echu23 commented Jan 19, 2020

Yeah, we are aware of that concurrent etcd join is not recommended. However, if this issue is legit, this can also happen for sequential join.
Image we are configuring a 3-node etcd cluster, 1st and 2nd were successful, when joining the 3rd, somehow 2nd etcd is Unstarted, in this case, should the 3rd join pass or fail?
To make an existing member unstarted , we can first shutdown an etcd container and remove that member then add that member back.
I am still not fully convinced that my theory is true because I can’t reproduce it. But I also can not find another theory...

Sent with GitHawk

@neolit123
Copy link
Member

Image we are configuring a 3-node etcd cluster, 1st and 2nd were successful, when joining the 3rd, somehow 2nd etcd is Unstarted, in this case, should the 3rd join pass or fail?

as the docs say:

verify it starts correctly before adding more new members

so if the 2nd fails the 3rd should not be added.

I am still not fully convinced that my theory is true because I can’t reproduce it. But I also can not find another theory...

i think it's valid.

will double check this tomorrow:
#2005 (comment)

@neolit123
Copy link
Member

neolit123 commented Jan 19, 2020

i experimented and observed the following:

joined sequentially two kubeadm CP nodes to an existing CP node, with a modified version of kubeadm.

the modified version simulated the case of two members not having names:

CP1: etcd-member1
CP2: etcd-member1, etcd-member2
CP3: etcd-member1, etcd-member3

it makes it so that CP3 only includes the 1st and 3rd member.
but the etcd Pod on CP3 fails with the following:

2020-01-19 18:13:45.785514 C | etcdmain: error validating peerURLs {ClusterID:5af0857ece1ce0e5 Members:[&{ID:2ef094c895837ef3 RaftAttributes:{PeerURLs:[https://172.17.0.3:2380]} Attributes:{Name: ClientURLs:[]}} &{ID:95047bc387b5a81a RaftAttributes:{PeerURLs:[https://172.17.0.4:2380]} Attributes:{Name:kinder-regular-control-plane-2 ClientURLs:[https://172.17.0.4:2379]}} &{ID:952f31ff200093ba RaftAttributes:{PeerURLs:[https://172.17.0.5:2380]} Attributes:{Name:kinder-regular-control-plane-1 ClientURLs:[https://172.17.0.5:2379]}}] RemovedMemberIDs:[]}: member count is unequal

so it has to include all members:

CP3: etcd-member1, etcd-member2, etcd-member3

i tried investigating what is considered a valid --initial-cluster:

  • the current member has to added as part of the list
  • the names of the other members do not matter!

thus i ended up with the following change:

	ret := []Member{}
	for _, m := range resp.Members {
		if peerAddrs == m.PeerURLs[0] {
			ret = append(ret, Member{Name: name, PeerURL: peerAddrs})
			continue
		}
		ret = append(ret, Member{Name: strconv.FormatUint(m.ID, 16), PeerURL: m.PeerURLs[0]})
	}

this ends doing the following:

$ docker exec kinder-regular-control-plane-1 cat /etc/kubernetes/manifests/etcd.yaml | grep initial-cluster
    - --initial-cluster=kinder-regular-control-plane-1=https://172.17.0.5:2380
$ docker exec kinder-regular-control-plane-2 cat /etc/kubernetes/manifests/etcd.yaml | grep initial-cluster
    - --initial-cluster=kinder-regular-control-plane-2=https://172.17.0.4:2380,952f31ff200093ba=https://172.17.0.5:2380
$ docker exec kinder-regular-control-plane-3 cat /etc/kubernetes/manifests/etcd.yaml | grep initial-cluster
    - --initial-cluster=kinder-regular-control-plane-3=https://172.17.0.3:2380,7db52fe1fd991808=https://172.17.0.4:2380,952f31ff200093ba=https://172.17.0.5:2380

the list of members still respects the --name passed when each etcd instance is started:

etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/kubernetes/pki/etcd/ca.crt --cert-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key-file=/etc/kubernetes/pki/etcd/healthcheck-client.key member list
4216c54c1f9153b4: name=kinder-regular-control-plane-3 peerURLs=https://172.17.0.3:2380 clientURLs=https://172.17.0.3:2379 isLeader=false
7db52fe1fd991808: name=kinder-regular-control-plane-2 peerURLs=https://172.17.0.4:2380 clientURLs=https://172.17.0.4:2379 isLeader=false
952f31ff200093ba: name=kinder-regular-control-plane-1 peerURLs=https://172.17.0.5:2380 clientURLs=https://172.17.0.5:2379 isLeader=true

so this is my proposal to workaround the problem, but it introduces a minor change in the way kubeadm writes the etcd.yaml.

also i've only tested this with a total of 5 etcd members. 4 joining concurrently.

however on my setup i'm seeing other problems, which happen quite often:

"level":"warn","ts":"2020-01-19T20:20:33.300Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-875a285e-36a5-4221-a4f8-88d9a56ec293/172.17.0.5:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2020-01-19T20:20:36.533Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-875a285e-36a5-4221-a4f8-88d9a56ec293/172.17.0.5:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
etcdserver: re-configuration failed due to not enough started members
error creating local etcd static pod manifest file
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runEtcdPhase
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go:144
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:236
[download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
rpc error: code = Unavailable desc = etcdserver: leader changed
error downloading the secret
k8s.io/kubernetes/cmd/kubeadm/app/phases/copycerts.DownloadCerts
	/home/lubo-it/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/cmd/kubeadm/app/phases/copycerts/copycerts.go:227
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runControlPlanePrepareDownloadCertsPhaseLocal

@neolit123
Copy link
Member

neolit123 commented Jan 19, 2020

the following patch makes the 4 member concurrent join more reliable:

diff --git a/cmd/kubeadm/app/util/etcd/etcd.go b/cmd/kubeadm/app/util/etcd/etcd.go
index 9d3c6be046b..0e5ad4434e8 100644
--- a/cmd/kubeadm/app/util/etcd/etcd.go
+++ b/cmd/kubeadm/app/util/etcd/etcd.go
@@ -38,11 +38,11 @@ import (
        "k8s.io/kubernetes/cmd/kubeadm/app/util/config"
 )
 
-const etcdTimeout = 2 * time.Second
+const etcdTimeout = 20 * time.Second
 
 // Exponential backoff for etcd operations
 var etcdBackoff = wait.Backoff{
-       Steps:    9,
+       Steps:    16,
        Duration: 50 * time.Millisecond,
        Factor:   2.0,
        Jitter:   0.1,
@@ -130,7 +130,7 @@ func NewFromCluster(client clientset.Interface, certificatesDir string) (*Client
 // dialTimeout is the timeout for failing to establish a connection.
 // It is set to 20 seconds as times shorter than that will cause TLS connections to fail
 // on heavily loaded arm64 CPUs (issue #64649)
-const dialTimeout = 20 * time.Second
+const dialTimeout = 40 * time.Second
 
 // Sync synchronizes client's endpoints with the known endpoints from the etcd membership.
 func (c *Client) Sync() error {
@@ -303,12 +303,11 @@ func (c *Client) AddMember(name string, peerAddrs string) ([]Member, error) {
        // Returns the updated list of etcd members
        ret := []Member{}
        for _, m := range resp.Members {
-               // fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
-               if m.Name == "" {
-                       ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
-               } else {
-                       ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
+               if peerAddrs == m.PeerURLs[0] {
+                       ret = append(ret, Member{Name: name, PeerURL: peerAddrs})
+                       continue
                }
+               ret = append(ret, Member{Name: strconv.FormatUint(m.ID, 16), PeerURL: m.PeerURLs[0]})
        }
 
        // Add the new member client address to the list of endpoints

@echu23
Copy link
Author

echu23 commented Jan 19, 2020

Yes, this proposal was what we had thought of as well that there only need to be 3 unique Name. We are ok with this change if no regression is introduced. Thanks for the quick turnaround.

Sent with GitHawk

@neolit123 neolit123 added the kind/design Categorizes issue or PR as related to design. label Jan 20, 2020
@fabriziopandini
Copy link
Member

@neolit123 I'm ok as well, but If possible I would apply the random name only if the name is empty (so the created manifest remain unchanged for all the cases except the race condition discussed in this thread)

@neolit123
Copy link
Member

@fabriziopandini
this seems reasonable, given the potential rarity of the case, even if the "ID-names for all other members" is more deterministic.

it's unfortunate that given the nature of etcd bootstrap our etcd.yaml is overall non-deterministic WRT the --initial-cluster flag.

@neolit123
Copy link
Member

/lifecycle active

PR is here kubernetes/kubernetes#87505

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jan 24, 2020
hickeng pushed a commit to hickeng/kubernetes that referenced this issue Apr 17, 2020
This issue has been reported to upstream kubeadm: kubernetes/kubeadm#2005
And this patch implements the suggested workaround.

As mentioned in PR2487240, this issue happens very rarely in a very small gap where the 3rd member
is performing etcd join while the 2nd member had joined but not yet fully started.

The symptom is that only the 3rd member's etcd container is not coming up and complaining

member count is unequal

The root cause is that when kubeadm is performing etcd member join, it calls etcd client's
MemberAdd function with the local peerURL and this function returns a list of etcd Member.

kubeadm expects that there should be 1 and ONLY 1 Member in this list that does not have Name field,
which is the currently-being-added member. And then it will insert the provided local name to the
Name field of those Members that don't have Name.

However, this issue happens because somehow there are more than 1 Member that do not have Name.
In this case, kubeadm code incorrectly inserted local name to all those Members that don't have Name.
Therefore, the resulting etcd manifest contains the value of --initial-cluster with duplicate Names,
in a 3-node etcd cluster, the incorrect etcd manifest will have 2 unique Name and 3 unique peerURLs.

And when the local etcd starts, it looks at the --initial-cluster and thinks that there are only 2
members, due to the duplicate Names. But actually there are 3 members, hence the member count is
unequal issue.

The fix here is to assign the etcd ID to Name if the member has no Name and peerURL does not match.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd area/HA kind/bug Categorizes issue or PR as related to a bug. kind/design Categorizes issue or PR as related to design. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants