Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

questions about the behavior of version 0.7.0 #2368

Closed
hehailong5 opened this issue Sep 28, 2016 · 7 comments
Closed

questions about the behavior of version 0.7.0 #2368

hehailong5 opened this issue Sep 28, 2016 · 7 comments

Comments

@hehailong5
Copy link

Hi, I have two questions regarding the latest 0.7.0 release.
1.
I have bootstrapped a cluster with 3 instances all configured with below options:
{
"leave_on_terminate": true,
"skip_leave_on_interrupt": false
}
I then use Ctl + C to leave one instance at a time. when there is only one instance left, it can still elect itself as the leader, which makes the 3 instances cluster having failure tolerance 2, is this as expected?

as for the guideline for the outage recovery, it still not state what's ought to do when all the servers in the cluster are down?

in my testing, I use Ctl + C to make all 3 instances leave the cluster. and then normally run the command "consul agent -server -config-dir /config -data-dir /data -bind=xx.xx.xx.xx -client=0.0.0.0" at any one node with the same ip, and this instance can be up with itself as the leader. it looks to me in this case I can recover the whole cluster without working with the peers.json file.

I am wondering when do I need to provide the peers.json file as stated in the guideline to recover a complete cluster? in the case all the instances have different ips from the old ones?

looking forward to your reply.

Thanks,
Allen

@weirdan
Copy link

weirdan commented Sep 28, 2016

Left instances do not appear to count toward the number of required instances for consensus. Otherwise it would be impossible to replace entire cluster replacing peers one by one (and it's certainly possible because I did it myself a couple of days ago, and on an older version too).

Given your configuration ("skip_leave_on_interrupt": false), when you Ctrl+C (which sends SIGINT) the server it gracefully leaves the cluster, effectively scaling the cluster down.

@hehailong5
Copy link
Author

so the key is the ("skip_leave_on_interrupt": false) configuration and it makes my previous testing a scaling scenario, not an outage scenario. right?

I'm wondering how to simulate an outage scenario without hard/soft restarting the machine?
would that be the case if I replace it with ("skip_leave_on_interrupt": true) and use Ctl + C to shutdown all the instances? and to recover the entire cluster I then need to prepare the peers.json instead?

If this is the case, I have to differentiate these two cases (instances were gracefully left or not) in my automate recover script. is there any way to achieve that in the script?

@hehailong5
Copy link
Author

is it possible to recover from "Failure of All Servers in a Multi-Server Cluster"? I tried w/o peers.json, both not work.

@weirdan
Copy link

weirdan commented Sep 29, 2016

so the key is the ("skip_leave_on_interrupt": false) configuration and it makes my previous testing a scaling scenario, not an outage scenario. right?

That's my understanding, yes.

I'm wondering how to simulate an outage scenario without hard/soft restarting the machine?
would that be the case if I replace it with ("skip_leave_on_interrupt": true) and use Ctl + C to shutdown all the instances?

skip_leave_on_interrupt:true + leave_on_terminate: false should give you a server that never voluntarily leaves the cluster. I also think this is the default behavior of server node in consul 0.7 (so you can just remove those setting from config). Then you can Ctrl+C it, or kill -9 and it will appear failed to other peers/nodes.

and to recover the entire cluster I then need to prepare the peers.json instead?

If their ips (or number) changes or you need to remove the failed server without ever bringing it up again. If you just bring up the failed peer with the same ip putting it into a cluster where it was you wouldn't need peers.json editing.

@hehailong5
Copy link
Author

the latest findings:

  1. the EnableSingleNode is enabled by default in 0.7.0, this makes the last instance can still elect itself as the leader.

  2. with all the ips unchanged, I can recover the entire cluster easily by just issuing "consul agent -server" one by one.

  3. with all the ips changed, after providing the peers.json with the new ips, I am not able to recover the cluster. I get the following logs:

    2016/09/29 11:31:46 [INFO] consul: found peers.json file, recovering Raft configuration...
    2016/09/29 11:31:46 [INFO] consul.fsm: snapshot created in 21.858s
    2016/09/29 11:31:46 [INFO] snapshot: Creating new snapshot at /data/raft/snapshots/55-552-1475148706347.tmp
    2016/09/29 11:31:46 [INFO] consul: deleted peers.json file after successful recovery
    2016/09/29 11:31:46 [INFO] raft: Restored from snapshot 55-552-1475148706347
    2016/09/29 11:31:46 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:192.167.13.1:8500 Address:192.167.13.1:8500} {Suffrage:Voter ID:192.167.13.3:8500 Address:192.167.13.3:8500} {Suffrage:Voter ID:192.167.13.4:8500 Address:192.167.13.4:8500}]
    2016/09/29 11:31:46 [INFO] raft: Node at 192.167.13.1:8300 [Follower] entering Follower state (Leader: "")
    2016/09/29 11:31:46 [WARN] memberlist: Binding to public address without encryption!
    2016/09/29 11:31:46 [INFO] serf: EventMemberJoin: ha-1 192.167.13.1
    2016/09/29 11:31:46 [INFO] serf: Attempting re-join to previously known node: ha-2: 192.168.13.3:8301
    2016/09/29 11:31:46 [INFO] consul: Adding LAN server ha-1 (Addr: tcp/192.167.13.1:8300) (DC: dc1)
    2016/09/29 11:31:46 [WARN] serf: Failed to re-join any previously known node
    2016/09/29 11:31:46 [WARN] memberlist: Binding to public address without encryption!
    2016/09/29 11:31:46 [INFO] serf: EventMemberJoin: ha-1.dc1 192.167.13.1
    2016/09/29 11:31:46 [WARN] serf: Failed to re-join any previously known node
    2016/09/29 11:31:46 [INFO] consul: Adding WAN server ha-1.dc1 (Addr: tcp/192.167.13.1:8300) (DC: dc1)
    2016/09/29 11:31:53 [ERR] agent: failed to sync remote state: No cluster leader
    2016/09/29 11:31:54 [WARN] raft: not part of stable configuration, aborting election
    2016/09/29 11:31:59 [INFO] serf: EventMemberJoin: ha-2 192.167.13.3
    2016/09/29 11:31:59 [INFO] consul: Adding LAN server ha-2 (Addr: tcp/192.167.13.3:8300) (DC: dc1)
    ==> Failed to check for updates: Get https://checkpoint-api.hashicorp.com/v1/check/consul?arch=amd64&os=linux&signature=b69062b5-dd7b-2d32-c06e-0b67391549c1&version=0.7.0: dial tcp: lookup checkpoint-api.hashicorp.com on 192.167.13.2:53: server misbehaving
    2016/09/29 11:32:13 [ERR] agent: coordinate update error: No cluster leader
    2016/09/29 11:32:21 [ERR] agent: failed to sync remote state: No cluster leader
    2016/09/29 11:32:38 [ERR] agent: coordinate update error: No cluster leader
    2016/09/29 11:32:54 [ERR] agent: failed to sync remote state: No cluster leader
    2016/09/29 11:33:02 [ERR] agent: coordinate update error: No cluster leader
    2016/09/29 11:33:11 [INFO] serf: EventMemberJoin: ha-3 192.167.13.4
    2016/09/29 11:33:11 [INFO] consul: Adding LAN server ha-3 (Addr: tcp/192.167.13.4:8300) (DC: dc1)
    2016/09/29 11:33:16 [ERR] agent: failed to sync remote state: No cluster leader
    2016/09/29 11:33:35 [ERR] agent: coordinate update error: No cluster leader
    2016/09/29 11:33:45 [ERR] agent: failed to sync remote state: No cluster leader
    2016/09/29 11:34:00 [INFO] memberlist: Suspect ha-3 has failed, no acks received

@hehailong5
Copy link
Author

Updated:

  1. I was wrapping the wrong data in the peers.json (based on the peers.info):

["192.167.13.1:8500","192.167.13.3:8500","192.167.13.4:8500"]

after changing to

["192.167.13.1:8300","192.167.13.3:8300","192.167.13.4:8300"]

now it works.

@slackpad
Copy link
Contributor

slackpad commented Oct 6, 2016

Sorry @hehailong5 - I just committed a change that puts the right port numbers into peers.info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants