Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standby_info interferes with cluster recovery #810

Closed
wereHamster opened this issue May 26, 2014 · 3 comments · Fixed by #818
Closed

standby_info interferes with cluster recovery #810

wereHamster opened this issue May 26, 2014 · 3 comments · Fixed by #818

Comments

@wereHamster
Copy link
Contributor

I got a three node cluster into a state where node A has standby_info with Running:true so it always starts in standby mode, and node B thinks A,B are the only surviving cluster nodes. B starts in peer mode and waits for the other nodes to join, so it can elect a leader. But A never joins because it remains in standby mode (it waits until the cluster has a leader, WARNING: fail getting leader from cluster (nodeA,nodeB)). If I delete standby_info on node A then it starts in peer mode and the cluster recovers.

@yichengq
Copy link
Contributor

@wereHamster The fact that A cannot join the cluster is an expected behavior for now, because A needs to get some metadata when switching to peer mode.
The way you hack is a good way to make it recover for now.
I wonder how this case happens. Do you have any log, or what operations are used to make it?

@wereHamster
Copy link
Contributor Author

I don't know the exact sequence of actions, but I was able to reliably reproduce locally with three running etcd nodes and killing/restarting them randomly.

Shouldn't B first try to start in peer mode before falling back to standby? Otherwise you have a classical deadlock. A is waiting for B and B is waiting for A.

@yichengq
Copy link
Contributor

@wereHamster I start three machines, and play a little with them:

  1. kill and restart a node
  2. kill and restart two nodes
  3. kill and restart all nodes
    They both work well. And nodes should not be in standby mode at any time. Could you give me some clue for it?

Theoretically, It should fall into standby mode only when the cluster asks it to do so, or it was in standby mode before killed before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants