etcdserver: avoid deadlock caused by adding members with wrong peer URLs #3479

mitake · 2015-09-10T07:10:00Z

Current membership changing functionality of etcd seems to have a
problem which can cause deadlock.

How to produce:

construct N node cluster
add N new nodes with etcdctl member add, without starting the new members

What happens:
After finishing add N nodes, a total number of the cluster becomes 2 *
N and a quorum number of the cluster becomes N + 1. It means
membership change requires at least N + 1 nodes because Raft treats
membership information in its log like other ordinal log append
requests.

Assume the peer URLs of the added nodes are wrong because of miss
operation or bugs in wrapping program which launch etcd. In such a
case, both of adding and removing members are impossible because the
quorum isn't preserved. Of course ordinal requests cannot be
served. The cluster would seem to be deadlock.

Of course, the best practice of adding new nodes is adding one node
and let the node start one by one. However, the effect of this problem
is so serious. I think preventing the problem forcibly would be
valuable.

Solution:
This patch lets etcd forbid adding a new node if the operation changes
quorum and the number of changed quorum is larger than a number of
running nodes.

Fixes #3477

xiang90 · 2015-09-10T14:18:59Z

@mitake This is what we wanted to do for a while. We will review and get this merged after 2.2 release today. Thanks!

mitake · 2015-09-11T01:51:21Z

@xiang90 , thanks, I'm looking forward your review :)

BTW, how do you think about an offline tool (etcdck?) for handling this sort of problem e.g. check and manipulate membership forcibly, data integrity check, etc. If this idea is acceptable, I'd like to work on it.

adohe-zz · 2015-09-11T02:33:15Z

@mitake @xiang90 this seems a little bit violent.

yichengq · 2015-09-11T02:48:41Z

@adohe

this seems a little bit violent.

What is your thoughts about it? I think the idea is reasonable: no one would like to have a cluster which loses quorum and is unrecoverable.

xiang90 · 2015-09-11T02:50:29Z

@adohe This is actually pretty conservative. These member are not even started. So it is safe.

xiang90 · 2015-09-11T02:53:01Z

etcdserver/cluster.go

@@ -253,6 +253,23 @@ func (c *cluster) ValidateConfigurationChange(cc raftpb.ConfChange) error {
 				return ErrPeerURLexists
 			}
 		}
+
+		nrRunningNodes := 0


can we move this code block to a func called checkLiveness (maybe a better func name)?

we do not use node. we use member instead.

How about countRunningMembers()?

countStartedMembers() would be suitable, I think.

adohe-zz · 2015-09-11T03:28:40Z

@xiang90 @yichengq I agree with what @xiang90 commented, we should move this code block to checkLiveness func, this makes more sense. For operations which change quorum if liveness is down, it should be forbidden.

mitake · 2015-09-11T04:46:26Z

@xiang90 , thanks for your review. I updated this PR based on your comments.

xiang90 · 2015-09-11T04:50:07Z

etcdserver/cluster.go

@@ -253,6 +265,16 @@ func (c *cluster) ValidateConfigurationChange(cc raftpb.ConfChange) error {
 				return ErrPeerURLexists
 			}
 		}
+
+		nmembers := len(c.members) + 1


i should be more clear: i hope we can move all of the new code to a new func.

xiang90 · 2015-09-11T04:51:48Z

@mitake After discussing with @yichengq, we think this has to be done before the command is submitted into raft. This is for backward compatibility reason. Or by running the same set of commands, etcd 2.1 and etcd 2.2 might have different behavior after this pr merged into 2.2.

mitake · 2015-09-11T04:56:35Z

@xiang90 @yichengq Should this check be done in etcdctl?

BTW, is rolling update from 2.1 -> 2.2 supported? If so, I agree with not doing the check in server side. However, if the rolling update isn't supported, the check wouldn't be harmful.

In addition, seems that v2.2.0 is already released. Do you have a plan of backporting this patch to the stable releases? It also affects the compatibility problem.

xiang90 · 2015-09-11T04:57:48Z

@mitake We do support rolling upgrade. I need to think though this more.

mitake · 2015-09-11T05:02:22Z

@xiang90 How about check in both of server and clients (etcdctl and client/ go library)? The effect of this problem is critical, so checking in all components would be useful for defensive purpose. In addition, you can backport client side change to stable releases and prevent the problem without breaking rolling upgrade.

xiang90 · 2015-09-11T05:05:01Z

@mitake The problem is that we do not control all clients/ctls. We only control go client. We may add this checking into server side. However, the checking has to be done before sending the reconfiguration into raft.

mitake · 2015-09-11T05:08:55Z

@xiang90 So do you agree with checking in both of server and etcd official clients?

xiang90 · 2015-09-11T05:09:53Z

@mitake Let do it step by step. I think we can do it at server side before sending the command into raft.

mitake · 2015-09-11T05:14:41Z

@xiang90 ok, I agree with doing it step by step. So what I need to do in this PR is cleaning the code of calculating nmembers, right?

mitake · 2015-09-11T05:45:04Z

@xiang90 @yichengq updated this PR for unifying changes into single function.

mitake · 2015-09-11T06:44:21Z

@xiang90 ah, sorry, seems that I misunderstand your intention. Should I insert the check in outside of raft logic? e.g. EtcdServer.AddMember()?

yichengq · 2015-09-11T07:02:55Z

@mitake i think so. This helps to avoid raft-log replay mismatch in 2.1 and 2.2. If you add the check when applying log, it may happen that 2.2 rejects some add command while 2.1 executes it.

Another problem is that etcd needs to support adding 1 member to 1-member cluster because this is a necessary step in disaster recovery today.

mitake · 2015-09-11T07:15:46Z

@yichengq I understood. However, I think adding the check to raft side would be better for handling a case of member removal. The deadlock problem can happen even in the removal case like this:

a cluster has N + 1 started nodes and N unstarted nodes with wrong peer URLs, quorum is N + 1
user removes one started node: quorum remains N + 1
the deadlock can happen because a number of started node is less than the quorum

I think serializing the checking in raft state machine can prevent this problem. If we do it in requester's side, two removal request can break quorum potentially.

Am I too pessimistic? :) Of course I'd like to esteem maintainers' opinion.

BTW,

adding 1 member to 1-member cluster

what does it mean? I couldn't understand "1-member".

mitake · 2015-09-11T07:16:54Z

@yichengq BTW, is it impossible to avoid backporting this change to stable releases?
I know nothing about the release style of etcd, sorry :(

yichengq · 2015-09-11T07:37:10Z

@mitake We have to support rolling upgrade, which means that etcd cluster should ALWAYS be able to upgrade from 2.2 to 2.3. Assuming this PR is merged in 2.3, if we add check in apply process, it may break the upgrade path, and that is bad. That is the reason that avoid backporting doesn't solve this problem.

The idea about removing member is another thing, and it is not related to this PR. Let us do it step by step.

I refer 1-member cluster as the cluster that has only 1 member. Check this cluster restore procedure (https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#restoring-the-cluster), and you could find it needs to support changing cluster size from 1 to 2.

mitake · 2015-09-12T16:20:22Z

@yichengq updated again just for removing the log of etcdserver/server.go line 631

yichengq · 2015-09-12T16:38:27Z

@mitake How do you like the idea to make isReadyToAddNewMember a function of cluster struct? It only uses information in cluster.

xiang90 · 2015-09-12T22:43:55Z

etcdmain/config.go

@@ -177,6 +178,7 @@ func NewConfig() *config {
 		// Should never happen.
 		plog.Panicf("unexpected error setting up clusterStateFlag: %v", err)
 	}
+	fs.BoolVar(&cfg.strictReconfigCheck, "strict-reconfig-check", false, "Reject reconfiguration if it can cause potential deadlock")


I guess we need to improve the message "Reject reconfiguration if it can cause potential deadlock".

Potential deadlock is not that obvious to a lot users.

"Reject reconfiguration that might cause quorum loss"

mitake · 2015-09-13T00:16:09Z

@yichengq isReadyToAddNewMember depends on EtcdServer.cfg. I think the function should belong to EtcdServer for now.

mitake · 2015-09-13T00:19:03Z

updated the messages based on @xiang90 's comments

xiang90 · 2015-09-13T00:20:38Z

LGTM. Defer to @yichengq

xiang90 · 2015-09-13T00:22:08Z

@mitake Thanks so much for your patient and contribution!

jonboulle · 2015-09-13T00:22:51Z

etcdserver/errors.go

@@ -31,6 +31,7 @@ var (
 	ErrTimeout                    = errors.New("etcdserver: request timed out")
 	ErrTimeoutDueToLeaderFail     = errors.New("etcdserver: request timed out, possibly due to previous leader failure")
 	ErrTimeoutDueToConnectionLost = errors.New("etcdserver: request timed out, possibly due to connection lost")
+	ErrNotEnoughStartedMembers    = errors.New("etcdserver: re-configuration faild due to not enough started members")


thanks, I'll fix it.

mitake · 2015-09-13T00:25:21Z

@yichengq on the second thought, moving isReadyToAddNewMember to cluster struct would be a good idea. If we call the func from EtcdServer.AddMember and add a condition on cfg to AddMember, it only depends on cluster. I'll fix it.

Current membership changing functionality of etcd seems to have a problem which can cause deadlock. How to produce: 1. construct N node cluster 2. add N new nodes with etcdctl member add, without starting the new members What happens: After finishing add N nodes, a total number of the cluster becomes 2 * N and a quorum number of the cluster becomes N + 1. It means membership change requires at least N + 1 nodes because Raft treats membership information in its log like other ordinal log append requests. Assume the peer URLs of the added nodes are wrong because of miss operation or bugs in wrapping program which launch etcd. In such a case, both of adding and removing members are impossible because the quorum isn't preserved. Of course ordinal requests cannot be served. The cluster would seem to be deadlock. Of course, the best practice of adding new nodes is adding one node and let the node start one by one. However, the effect of this problem is so serious. I think preventing the problem forcibly would be valuable. Solution: This patch lets etcd forbid adding a new node if the operation changes quorum and the number of changed quorum is larger than a number of running nodes. If etcd is launched with a newly added option -strict-reconfig-check, the checking logic is activated. If the option isn't passed, default behavior of reconfig is kept. Fixes etcd-io#3477

mitake · 2015-09-13T00:36:33Z

updated, just moved isReadyToAddNewMember to cluster struct

mitake · 2015-09-13T00:42:57Z

@jonboulle should I add a unit test and update this PR soon? If so, it would take for a while because I'm new to testing in go, sorry :(

jonboulle · 2015-09-13T01:18:24Z

@mitake I think this should have a test before it is merged. To help you out I've started a test table here for you: jonboulle@9c3c665

While writing the tests I also discovered that https://github.com/coreos/etcd/pull/3479/files#diff-62824440a438e1c7efbb9cf76299624eR382 should also be checking the cluster size, so as you can see I updated that line too.

Can you cherry-pick my commit onto your branch and then perhaps add some more cases as you see fit? (I would suggest some even numbered clusters, edge cases like empty clusters, and so on). thanks!

yichengq · 2015-09-13T01:47:51Z

The code is totally awesome now.

@jonboulle Thanks for helping on unit test. ;)

@mitake As jon suggests, it is cool if you could add some unit test. Because we have separate the check logic in a single function, it is pretty simple to add a unit test for the new logic. Table testing is a common test method used in Go, and it could cover multiple test cases in an easy-to-read way.

mitake · 2015-09-13T02:15:26Z

@jonboulle thanks a lot for your help! I'll enhance your test soon.

@yichengq I'll update this PR after enhancing tests.

Also fixed check for special case of one-member cluster

mitake · 2015-09-13T02:53:04Z

@jonboulle @yichengq enhanced test for cases of even members (want success) and empty cluster

- a case of a cluster with even number members - a case of an empty cluster

mitake · 2015-09-13T03:30:30Z

updated for gofmt style problem

yichengq · 2015-09-13T05:05:21Z

LGTM.

Thanks for the contribution! The PR is quite fantastic now.

etcdserver: avoid deadlock caused by adding members with wrong peer URLs

mitake · 2015-09-13T06:27:39Z

Thanks a lot for your review and help, @xiang90 @yichengq @jonboulle !

mitake mentioned this pull request Sep 10, 2015

Potential deadlock by adding new members? #3477

Closed

xiang90 reviewed Sep 11, 2015
View reviewed changes

mitake force-pushed the membership branch from 3337a9a to 7f4e357 Compare September 11, 2015 04:46

xiang90 reviewed Sep 11, 2015
View reviewed changes

mitake force-pushed the membership branch from 7f4e357 to e593f04 Compare September 11, 2015 05:44

xiang90 reviewed Sep 12, 2015
View reviewed changes

mitake force-pushed the membership branch from f4341a2 to 8a77dcb Compare September 13, 2015 00:18

jonboulle reviewed Sep 13, 2015
View reviewed changes

mitake force-pushed the membership branch from 8a77dcb to 6974fc6 Compare September 13, 2015 00:36

etcdserver: add test for isReadyToAddNewMember

d9cf752

Also fixed check for special case of one-member cluster

etcdserver: enhance test cases for isReadyToAddNewMember

dad3264

- a case of a cluster with even number members - a case of an empty cluster

mitake force-pushed the membership branch from b60bb9a to dad3264 Compare September 13, 2015 03:30

yichengq added a commit that referenced this pull request Sep 13, 2015

Merge pull request #3479 from mitake/membership

0ca800f

etcdserver: avoid deadlock caused by adding members with wrong peer URLs

yichengq merged commit 0ca800f into etcd-io:master Sep 13, 2015

mitake deleted the membership branch December 30, 2015 17:22

etcdserver: avoid deadlock caused by adding members with wrong peer URLs #3479

etcdserver: avoid deadlock caused by adding members with wrong peer URLs #3479

Conversation

mitake commented Sep 10, 2015

xiang90 commented Sep 10, 2015

mitake commented Sep 11, 2015

adohe-zz commented Sep 11, 2015

yichengq commented Sep 11, 2015

xiang90 commented Sep 11, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adohe-zz commented Sep 11, 2015

mitake commented Sep 11, 2015

Choose a reason for hiding this comment

xiang90 commented Sep 11, 2015

mitake commented Sep 11, 2015

xiang90 commented Sep 11, 2015

mitake commented Sep 11, 2015

xiang90 commented Sep 11, 2015

mitake commented Sep 11, 2015

xiang90 commented Sep 11, 2015

mitake commented Sep 11, 2015

mitake commented Sep 11, 2015

mitake commented Sep 11, 2015

yichengq commented Sep 11, 2015

mitake commented Sep 11, 2015

mitake commented Sep 11, 2015

yichengq commented Sep 11, 2015

mitake commented Sep 12, 2015

yichengq commented Sep 12, 2015

Choose a reason for hiding this comment

mitake commented Sep 13, 2015

mitake commented Sep 13, 2015

xiang90 commented Sep 13, 2015

xiang90 commented Sep 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mitake commented Sep 13, 2015

mitake commented Sep 13, 2015

mitake commented Sep 13, 2015

jonboulle commented Sep 13, 2015

yichengq commented Sep 13, 2015

mitake commented Sep 13, 2015

mitake commented Sep 13, 2015

mitake commented Sep 13, 2015

yichengq commented Sep 13, 2015

mitake commented Sep 13, 2015