standby_info interferes with cluster recovery #810

wereHamster · 2014-05-26T06:33:21Z

I got a three node cluster into a state where node A has standby_info with Running:true so it always starts in standby mode, and node B thinks A,B are the only surviving cluster nodes. B starts in peer mode and waits for the other nodes to join, so it can elect a leader. But A never joins because it remains in standby mode (it waits until the cluster has a leader, WARNING: fail getting leader from cluster (nodeA,nodeB)). If I delete standby_info on node A then it starts in peer mode and the cluster recovers.

The text was updated successfully, but these errors were encountered:

yichengq · 2014-05-27T16:44:01Z

@wereHamster The fact that A cannot join the cluster is an expected behavior for now, because A needs to get some metadata when switching to peer mode.
The way you hack is a good way to make it recover for now.
I wonder how this case happens. Do you have any log, or what operations are used to make it?

wereHamster · 2014-05-27T16:53:01Z

I don't know the exact sequence of actions, but I was able to reliably reproduce locally with three running etcd nodes and killing/restarting them randomly.

Shouldn't B first try to start in peer mode before falling back to standby? Otherwise you have a classical deadlock. A is waiting for B and B is waiting for A.

yichengq · 2014-05-27T18:53:11Z

@wereHamster I start three machines, and play a little with them:

kill and restart a node
kill and restart two nodes
kill and restart all nodes
They both work well. And nodes should not be in standby mode at any time. Could you give me some clue for it?

Theoretically, It should fall into standby mode only when the cluster asks it to do so, or it was in standby mode before killed before.

yichengq mentioned this issue May 30, 2014

fix(standby_server): able to join the cluster containing itself #818

Merged

yichengq closed this as completed in #818 May 31, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

standby_info interferes with cluster recovery #810

standby_info interferes with cluster recovery #810

wereHamster commented May 26, 2014

yichengq commented May 27, 2014

wereHamster commented May 27, 2014

yichengq commented May 27, 2014

standby_info interferes with cluster recovery #810

standby_info interferes with cluster recovery #810

Comments

wereHamster commented May 26, 2014

yichengq commented May 27, 2014

wereHamster commented May 27, 2014

yichengq commented May 27, 2014