Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: avoid deadlock caused by adding members with wrong peer URLs #3479

Merged
merged 3 commits into from
Sep 13, 2015

Conversation

mitake
Copy link
Contributor

@mitake mitake commented Sep 10, 2015

Current membership changing functionality of etcd seems to have a
problem which can cause deadlock.

How to produce:

  1. construct N node cluster
  2. add N new nodes with etcdctl member add, without starting the new members

What happens:
After finishing add N nodes, a total number of the cluster becomes 2 *
N and a quorum number of the cluster becomes N + 1. It means
membership change requires at least N + 1 nodes because Raft treats
membership information in its log like other ordinal log append
requests.

Assume the peer URLs of the added nodes are wrong because of miss
operation or bugs in wrapping program which launch etcd. In such a
case, both of adding and removing members are impossible because the
quorum isn't preserved. Of course ordinal requests cannot be
served. The cluster would seem to be deadlock.

Of course, the best practice of adding new nodes is adding one node
and let the node start one by one. However, the effect of this problem
is so serious. I think preventing the problem forcibly would be
valuable.

Solution:
This patch lets etcd forbid adding a new node if the operation changes
quorum and the number of changed quorum is larger than a number of
running nodes.

Fixes #3477

@xiang90
Copy link
Contributor

xiang90 commented Sep 10, 2015

@mitake This is what we wanted to do for a while. We will review and get this merged after 2.2 release today. Thanks!

@mitake
Copy link
Contributor Author

mitake commented Sep 11, 2015

@xiang90 , thanks, I'm looking forward your review :)

BTW, how do you think about an offline tool (etcdck?) for handling this sort of problem e.g. check and manipulate membership forcibly, data integrity check, etc. If this idea is acceptable, I'd like to work on it.

@adohe-zz
Copy link

@mitake @xiang90 this seems a little bit violent.

@yichengq
Copy link
Contributor

@adohe

this seems a little bit violent.

What is your thoughts about it? I think the idea is reasonable: no one would like to have a cluster which loses quorum and is unrecoverable.

@xiang90
Copy link
Contributor

xiang90 commented Sep 11, 2015

@adohe This is actually pretty conservative. These member are not even started. So it is safe.

@@ -253,6 +253,23 @@ func (c *cluster) ValidateConfigurationChange(cc raftpb.ConfChange) error {
return ErrPeerURLexists
}
}

nrRunningNodes := 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this code block to a func called checkLiveness (maybe a better func name)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not use node. we use member instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about countRunningMembers()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

countStartedMembers() would be suitable, I think.

@adohe-zz
Copy link

@xiang90 @yichengq I agree with what @xiang90 commented, we should move this code block to checkLiveness func, this makes more sense. For operations which change quorum if liveness is down, it should be forbidden.

@mitake
Copy link
Contributor Author

mitake commented Sep 11, 2015

@xiang90 , thanks for your review. I updated this PR based on your comments.

@@ -253,6 +265,16 @@ func (c *cluster) ValidateConfigurationChange(cc raftpb.ConfChange) error {
return ErrPeerURLexists
}
}

nmembers := len(c.members) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i should be more clear: i hope we can move all of the new code to a new func.

@xiang90
Copy link
Contributor

xiang90 commented Sep 11, 2015

@mitake After discussing with @yichengq, we think this has to be done before the command is submitted into raft. This is for backward compatibility reason. Or by running the same set of commands, etcd 2.1 and etcd 2.2 might have different behavior after this pr merged into 2.2.

@mitake
Copy link
Contributor Author

mitake commented Sep 11, 2015

@xiang90 @yichengq Should this check be done in etcdctl?

BTW, is rolling update from 2.1 -> 2.2 supported? If so, I agree with not doing the check in server side. However, if the rolling update isn't supported, the check wouldn't be harmful.

In addition, seems that v2.2.0 is already released. Do you have a plan of backporting this patch to the stable releases? It also affects the compatibility problem.

@xiang90
Copy link
Contributor

xiang90 commented Sep 11, 2015

@mitake We do support rolling upgrade. I need to think though this more.

@mitake
Copy link
Contributor Author

mitake commented Sep 11, 2015

@xiang90 How about check in both of server and clients (etcdctl and client/ go library)? The effect of this problem is critical, so checking in all components would be useful for defensive purpose. In addition, you can backport client side change to stable releases and prevent the problem without breaking rolling upgrade.

@xiang90
Copy link
Contributor

xiang90 commented Sep 11, 2015

@mitake The problem is that we do not control all clients/ctls. We only control go client. We may add this checking into server side. However, the checking has to be done before sending the reconfiguration into raft.

@mitake
Copy link
Contributor Author

mitake commented Sep 11, 2015

@xiang90 So do you agree with checking in both of server and etcd official clients?

@xiang90
Copy link
Contributor

xiang90 commented Sep 11, 2015

@mitake Let do it step by step. I think we can do it at server side before sending the command into raft.

@mitake
Copy link
Contributor Author

mitake commented Sep 11, 2015

@xiang90 ok, I agree with doing it step by step. So what I need to do in this PR is cleaning the code of calculating nmembers, right?

@mitake
Copy link
Contributor Author

mitake commented Sep 11, 2015

@xiang90 @yichengq updated this PR for unifying changes into single function.

@mitake
Copy link
Contributor Author

mitake commented Sep 11, 2015

@xiang90 ah, sorry, seems that I misunderstand your intention. Should I insert the check in outside of raft logic? e.g. EtcdServer.AddMember()?

@yichengq
Copy link
Contributor

@mitake i think so. This helps to avoid raft-log replay mismatch in 2.1 and 2.2. If you add the check when applying log, it may happen that 2.2 rejects some add command while 2.1 executes it.

Another problem is that etcd needs to support adding 1 member to 1-member cluster because this is a necessary step in disaster recovery today.

@mitake
Copy link
Contributor Author

mitake commented Sep 11, 2015

@yichengq I understood. However, I think adding the check to raft side would be better for handling a case of member removal. The deadlock problem can happen even in the removal case like this:

  1. a cluster has N + 1 started nodes and N unstarted nodes with wrong peer URLs, quorum is N + 1
  2. user removes one started node: quorum remains N + 1
  3. the deadlock can happen because a number of started node is less than the quorum

I think serializing the checking in raft state machine can prevent this problem. If we do it in requester's side, two removal request can break quorum potentially.

Am I too pessimistic? :) Of course I'd like to esteem maintainers' opinion.

BTW,

adding 1 member to 1-member cluster

what does it mean? I couldn't understand "1-member".

@mitake
Copy link
Contributor Author

mitake commented Sep 11, 2015

@yichengq BTW, is it impossible to avoid backporting this change to stable releases?
I know nothing about the release style of etcd, sorry :(

@yichengq
Copy link
Contributor

@mitake We have to support rolling upgrade, which means that etcd cluster should ALWAYS be able to upgrade from 2.2 to 2.3. Assuming this PR is merged in 2.3, if we add check in apply process, it may break the upgrade path, and that is bad. That is the reason that avoid backporting doesn't solve this problem.

The idea about removing member is another thing, and it is not related to this PR. Let us do it step by step.

I refer 1-member cluster as the cluster that has only 1 member. Check this cluster restore procedure (https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#restoring-the-cluster), and you could find it needs to support changing cluster size from 1 to 2.

@mitake
Copy link
Contributor Author

mitake commented Sep 12, 2015

@yichengq updated again just for removing the log of etcdserver/server.go line 631

@yichengq
Copy link
Contributor

@mitake How do you like the idea to make isReadyToAddNewMember a function of cluster struct? It only uses information in cluster.

@@ -177,6 +178,7 @@ func NewConfig() *config {
// Should never happen.
plog.Panicf("unexpected error setting up clusterStateFlag: %v", err)
}
fs.BoolVar(&cfg.strictReconfigCheck, "strict-reconfig-check", false, "Reject reconfiguration if it can cause potential deadlock")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to improve the message "Reject reconfiguration if it can cause potential deadlock".

Potential deadlock is not that obvious to a lot users.

"Reject reconfiguration that might cause quorum loss"

@mitake
Copy link
Contributor Author

mitake commented Sep 13, 2015

@yichengq isReadyToAddNewMember depends on EtcdServer.cfg. I think the function should belong to EtcdServer for now.

@mitake
Copy link
Contributor Author

mitake commented Sep 13, 2015

updated the messages based on @xiang90 's comments

@xiang90
Copy link
Contributor

xiang90 commented Sep 13, 2015

LGTM. Defer to @yichengq

@xiang90
Copy link
Contributor

xiang90 commented Sep 13, 2015

@mitake Thanks so much for your patient and contribution!

@@ -31,6 +31,7 @@ var (
ErrTimeout = errors.New("etcdserver: request timed out")
ErrTimeoutDueToLeaderFail = errors.New("etcdserver: request timed out, possibly due to previous leader failure")
ErrTimeoutDueToConnectionLost = errors.New("etcdserver: request timed out, possibly due to connection lost")
ErrNotEnoughStartedMembers = errors.New("etcdserver: re-configuration faild due to not enough started members")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I'll fix it.

@mitake
Copy link
Contributor Author

mitake commented Sep 13, 2015

@yichengq on the second thought, moving isReadyToAddNewMember to cluster struct would be a good idea. If we call the func from EtcdServer.AddMember and add a condition on cfg to AddMember, it only depends on cluster. I'll fix it.

Current membership changing functionality of etcd seems to have a
problem which can cause deadlock.

How to produce:
 1. construct N node cluster
 2. add N new nodes with etcdctl member add, without starting the new members

What happens:
After finishing add N nodes, a total number of the cluster becomes 2 *
N and a quorum number of the cluster becomes N + 1. It means
membership change requires at least N + 1 nodes because Raft treats
membership information in its log like other ordinal log append
requests.

Assume the peer URLs of the added nodes are wrong because of miss
operation or bugs in wrapping program which launch etcd. In such a
case, both of adding and removing members are impossible because the
quorum isn't preserved. Of course ordinal requests cannot be
served. The cluster would seem to be deadlock.

Of course, the best practice of adding new nodes is adding one node
and let the node start one by one. However, the effect of this problem
is so serious. I think preventing the problem forcibly would be
valuable.

Solution:
This patch lets etcd forbid adding a new node if the operation changes
quorum and the number of changed quorum is larger than a number of
running nodes. If etcd is launched with a newly added option
-strict-reconfig-check, the checking logic is activated. If the option
isn't passed, default behavior of reconfig is kept.

Fixes etcd-io#3477
@mitake
Copy link
Contributor Author

mitake commented Sep 13, 2015

updated, just moved isReadyToAddNewMember to cluster struct

@mitake
Copy link
Contributor Author

mitake commented Sep 13, 2015

@jonboulle should I add a unit test and update this PR soon? If so, it would take for a while because I'm new to testing in go, sorry :(

@jonboulle
Copy link
Contributor

@mitake I think this should have a test before it is merged. To help you out I've started a test table here for you: jonboulle@9c3c665

While writing the tests I also discovered that https://github.com/coreos/etcd/pull/3479/files#diff-62824440a438e1c7efbb9cf76299624eR382 should also be checking the cluster size, so as you can see I updated that line too.

Can you cherry-pick my commit onto your branch and then perhaps add some more cases as you see fit? (I would suggest some even numbered clusters, edge cases like empty clusters, and so on). thanks!

@yichengq
Copy link
Contributor

The code is totally awesome now.

@jonboulle Thanks for helping on unit test. ;)

@mitake As jon suggests, it is cool if you could add some unit test. Because we have separate the check logic in a single function, it is pretty simple to add a unit test for the new logic. Table testing is a common test method used in Go, and it could cover multiple test cases in an easy-to-read way.

@mitake
Copy link
Contributor Author

mitake commented Sep 13, 2015

@jonboulle thanks a lot for your help! I'll enhance your test soon.

@yichengq I'll update this PR after enhancing tests.

Also fixed check for special case of one-member cluster
@mitake
Copy link
Contributor Author

mitake commented Sep 13, 2015

@jonboulle @yichengq enhanced test for cases of even members (want success) and empty cluster

 - a case of a cluster with even number members
 - a case of an empty cluster
@mitake
Copy link
Contributor Author

mitake commented Sep 13, 2015

updated for gofmt style problem

@yichengq
Copy link
Contributor

LGTM.

Thanks for the contribution! The PR is quite fantastic now.

yichengq added a commit that referenced this pull request Sep 13, 2015
etcdserver: avoid deadlock caused by adding members with wrong peer URLs
@yichengq yichengq merged commit 0ca800f into etcd-io:master Sep 13, 2015
@mitake
Copy link
Contributor Author

mitake commented Sep 13, 2015

Thanks a lot for your review and help, @xiang90 @yichengq @jonboulle !

@mitake mitake deleted the membership branch December 30, 2015 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants