-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
questions about the behavior of version 0.7.0 #2368
Comments
Left instances do not appear to count toward the number of required instances for consensus. Otherwise it would be impossible to replace entire cluster replacing peers one by one (and it's certainly possible because I did it myself a couple of days ago, and on an older version too). Given your configuration ( |
so the key is the ("skip_leave_on_interrupt": false) configuration and it makes my previous testing a scaling scenario, not an outage scenario. right? I'm wondering how to simulate an outage scenario without hard/soft restarting the machine? If this is the case, I have to differentiate these two cases (instances were gracefully left or not) in my automate recover script. is there any way to achieve that in the script? |
is it possible to recover from "Failure of All Servers in a Multi-Server Cluster"? I tried w/o peers.json, both not work. |
That's my understanding, yes.
If their ips (or number) changes or you need to remove the failed server without ever bringing it up again. If you just bring up the failed peer with the same ip putting it into a cluster where it was you wouldn't need peers.json editing. |
the latest findings:
|
Updated:
["192.167.13.1:8500","192.167.13.3:8500","192.167.13.4:8500"] after changing to ["192.167.13.1:8300","192.167.13.3:8300","192.167.13.4:8300"] now it works. |
Sorry @hehailong5 - I just committed a change that puts the right port numbers into peers.info. |
Hi, I have two questions regarding the latest 0.7.0 release.
1.
I have bootstrapped a cluster with 3 instances all configured with below options:
{
"leave_on_terminate": true,
"skip_leave_on_interrupt": false
}
I then use Ctl + C to leave one instance at a time. when there is only one instance left, it can still elect itself as the leader, which makes the 3 instances cluster having failure tolerance 2, is this as expected?
as for the guideline for the outage recovery, it still not state what's ought to do when all the servers in the cluster are down?
in my testing, I use Ctl + C to make all 3 instances leave the cluster. and then normally run the command "consul agent -server -config-dir /config -data-dir /data -bind=xx.xx.xx.xx -client=0.0.0.0" at any one node with the same ip, and this instance can be up with itself as the leader. it looks to me in this case I can recover the whole cluster without working with the peers.json file.
I am wondering when do I need to provide the peers.json file as stated in the guideline to recover a complete cluster? in the case all the instances have different ips from the old ones?
looking forward to your reply.
Thanks,
Allen
The text was updated successfully, but these errors were encountered: