-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster unable to elect a leader after restarting a follower and stopping the leader while the follower is down #15940
Comments
I re-ran these steps with Consul version 1.14.3 and was unable to reproduce the problem reliably. With the following changes to the steps, it should be reliably reproducible:
|
I seem to encounter the same issue. And I agree that it "seems that this is an important bug", but also want to confirm that this is not a configuration issue. According to the simulator at https://observablehq.com/@stwind/raft-consensus-simulator Steps to reproduce: I have three VMs running.
I then stopped the cluster, and started it without bootstrapping. The cluster forms correctly.
I now have members M1, M2, and M3.
The most relevant error seems to be |
It seems that this bug is related to Autopilot being enabled by default.
The idea of pruning is great, but can you explain why it cannot support a server rejoining the cluster? |
Related issue: hashicorp/raft#524 |
Hi! The issue describes a scenario where the cluster is allowed to lose quorum - i.e. only 1 of the 3 nodes is both available and part of the current consensus configuration so it's expected per Raft's guarantees that the cluster would be unable to recover without manual intervention. One way this can be mitigated in Consul is by configuring the I'm going to close this because I don't think there is a way Consul can behave differently and there is already the config mentioned above to make it never remove more servers than you intend. |
In a three-node Consul cluster, with server nodes 0,1 and 2, if I run the following test, the cluster cannot elect a leader:
Now the cluster can never elect a leader, even though it has a quorum, because in node 2's configuration, it only has itself and the old leader, node 0, so it will not accept/send vote requests from/to node 1, and node 0 is down.
I think this happens because only the leader can update the other followers' configuration, which will not happen if there's no leader.
To me it seems that this is an important bug, but I need someone to confirm that.
I caused this behaviour using the latest Consul Docker image. Here are the commands that should reproduce the issue:
After running this you should see similar to the following messages on node 2:
I also opened a bug in hashicorp/raft, as I think the problem here is with the Raft implementation rather than Consul: hashicorp/raft#535
The text was updated successfully, but these errors were encountered: