questions about the behavior of version 0.7.0 #2368

hehailong5 · 2016-09-28T14:45:08Z

Hi, I have two questions regarding the latest 0.7.0 release.
1.
I have bootstrapped a cluster with 3 instances all configured with below options:
{
"leave_on_terminate": true,
"skip_leave_on_interrupt": false
}
I then use Ctl + C to leave one instance at a time. when there is only one instance left, it can still elect itself as the leader, which makes the 3 instances cluster having failure tolerance 2, is this as expected?

as for the guideline for the outage recovery, it still not state what's ought to do when all the servers in the cluster are down?

in my testing, I use Ctl + C to make all 3 instances leave the cluster. and then normally run the command "consul agent -server -config-dir /config -data-dir /data -bind=xx.xx.xx.xx -client=0.0.0.0" at any one node with the same ip, and this instance can be up with itself as the leader. it looks to me in this case I can recover the whole cluster without working with the peers.json file.

I am wondering when do I need to provide the peers.json file as stated in the guideline to recover a complete cluster? in the case all the instances have different ips from the old ones?

looking forward to your reply.

Thanks,
Allen

weirdan · 2016-09-28T15:47:14Z

Left instances do not appear to count toward the number of required instances for consensus. Otherwise it would be impossible to replace entire cluster replacing peers one by one (and it's certainly possible because I did it myself a couple of days ago, and on an older version too).

Given your configuration ("skip_leave_on_interrupt": false), when you Ctrl+C (which sends SIGINT) the server it gracefully leaves the cluster, effectively scaling the cluster down.

hehailong5 · 2016-09-29T01:27:05Z

so the key is the ("skip_leave_on_interrupt": false) configuration and it makes my previous testing a scaling scenario, not an outage scenario. right?

I'm wondering how to simulate an outage scenario without hard/soft restarting the machine?
would that be the case if I replace it with ("skip_leave_on_interrupt": true) and use Ctl + C to shutdown all the instances? and to recover the entire cluster I then need to prepare the peers.json instead?

If this is the case, I have to differentiate these two cases (instances were gracefully left or not) in my automate recover script. is there any way to achieve that in the script?

hehailong5 · 2016-09-29T03:42:07Z

is it possible to recover from "Failure of All Servers in a Multi-Server Cluster"? I tried w/o peers.json, both not work.

weirdan · 2016-09-29T10:03:26Z

so the key is the ("skip_leave_on_interrupt": false) configuration and it makes my previous testing a scaling scenario, not an outage scenario. right?

That's my understanding, yes.

I'm wondering how to simulate an outage scenario without hard/soft restarting the machine?
would that be the case if I replace it with ("skip_leave_on_interrupt": true) and use Ctl + C to shutdown all the instances?

skip_leave_on_interrupt:true + leave_on_terminate: false should give you a server that never voluntarily leaves the cluster. I also think this is the default behavior of server node in consul 0.7 (so you can just remove those setting from config). Then you can Ctrl+C it, or kill -9 and it will appear failed to other peers/nodes.

and to recover the entire cluster I then need to prepare the peers.json instead?

If their ips (or number) changes or you need to remove the failed server without ever bringing it up again. If you just bring up the failed peer with the same ip putting it into a cluster where it was you wouldn't need peers.json editing.

hehailong5 · 2016-09-29T10:50:45Z

the latest findings:

the EnableSingleNode is enabled by default in 0.7.0, this makes the last instance can still elect itself as the leader.
with all the ips unchanged, I can recover the entire cluster easily by just issuing "consul agent -server" one by one.
with all the ips changed, after providing the peers.json with the new ips, I am not able to recover the cluster. I get the following logs:

2016/09/29 11:31:46 [INFO] consul: found peers.json file, recovering Raft configuration...
2016/09/29 11:31:46 [INFO] consul.fsm: snapshot created in 21.858s
2016/09/29 11:31:46 [INFO] snapshot: Creating new snapshot at /data/raft/snapshots/55-552-1475148706347.tmp
2016/09/29 11:31:46 [INFO] consul: deleted peers.json file after successful recovery
2016/09/29 11:31:46 [INFO] raft: Restored from snapshot 55-552-1475148706347
2016/09/29 11:31:46 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:192.167.13.1:8500 Address:192.167.13.1:8500} {Suffrage:Voter ID:192.167.13.3:8500 Address:192.167.13.3:8500} {Suffrage:Voter ID:192.167.13.4:8500 Address:192.167.13.4:8500}]
2016/09/29 11:31:46 [INFO] raft: Node at 192.167.13.1:8300 [Follower] entering Follower state (Leader: "")
2016/09/29 11:31:46 [WARN] memberlist: Binding to public address without encryption!
2016/09/29 11:31:46 [INFO] serf: EventMemberJoin: ha-1 192.167.13.1
2016/09/29 11:31:46 [INFO] serf: Attempting re-join to previously known node: ha-2: 192.168.13.3:8301
2016/09/29 11:31:46 [INFO] consul: Adding LAN server ha-1 (Addr: tcp/192.167.13.1:8300) (DC: dc1)
2016/09/29 11:31:46 [WARN] serf: Failed to re-join any previously known node
2016/09/29 11:31:46 [WARN] memberlist: Binding to public address without encryption!
2016/09/29 11:31:46 [INFO] serf: EventMemberJoin: ha-1.dc1 192.167.13.1
2016/09/29 11:31:46 [WARN] serf: Failed to re-join any previously known node
2016/09/29 11:31:46 [INFO] consul: Adding WAN server ha-1.dc1 (Addr: tcp/192.167.13.1:8300) (DC: dc1)
2016/09/29 11:31:53 [ERR] agent: failed to sync remote state: No cluster leader
2016/09/29 11:31:54 [WARN] raft: not part of stable configuration, aborting election
2016/09/29 11:31:59 [INFO] serf: EventMemberJoin: ha-2 192.167.13.3
2016/09/29 11:31:59 [INFO] consul: Adding LAN server ha-2 (Addr: tcp/192.167.13.3:8300) (DC: dc1)
==> Failed to check for updates: Get https://checkpoint-api.hashicorp.com/v1/check/consul?arch=amd64&os=linux&signature=b69062b5-dd7b-2d32-c06e-0b67391549c1&version=0.7.0: dial tcp: lookup checkpoint-api.hashicorp.com on 192.167.13.2:53: server misbehaving
2016/09/29 11:32:13 [ERR] agent: coordinate update error: No cluster leader
2016/09/29 11:32:21 [ERR] agent: failed to sync remote state: No cluster leader
2016/09/29 11:32:38 [ERR] agent: coordinate update error: No cluster leader
2016/09/29 11:32:54 [ERR] agent: failed to sync remote state: No cluster leader
2016/09/29 11:33:02 [ERR] agent: coordinate update error: No cluster leader
2016/09/29 11:33:11 [INFO] serf: EventMemberJoin: ha-3 192.167.13.4
2016/09/29 11:33:11 [INFO] consul: Adding LAN server ha-3 (Addr: tcp/192.167.13.4:8300) (DC: dc1)
2016/09/29 11:33:16 [ERR] agent: failed to sync remote state: No cluster leader
2016/09/29 11:33:35 [ERR] agent: coordinate update error: No cluster leader
2016/09/29 11:33:45 [ERR] agent: failed to sync remote state: No cluster leader
2016/09/29 11:34:00 [INFO] memberlist: Suspect ha-3 has failed, no acks received

hehailong5 · 2016-09-30T02:11:14Z

Updated:

I was wrapping the wrong data in the peers.json （based on the peers.info）:

["192.167.13.1:8500","192.167.13.3:8500","192.167.13.4:8500"]

after changing to

["192.167.13.1:8300","192.167.13.3:8300","192.167.13.4:8300"]

now it works.

slackpad · 2016-10-06T01:10:43Z

Sorry @hehailong5 - I just committed a change that puts the right port numbers into peers.info.

slackpad mentioned this issue Oct 6, 2016

Fixes port numbers in peers.info. #2391

Merged

slackpad closed this as completed in #2391 Oct 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questions about the behavior of version 0.7.0 #2368

questions about the behavior of version 0.7.0 #2368

hehailong5 commented Sep 28, 2016

weirdan commented Sep 28, 2016 •

edited

Loading

hehailong5 commented Sep 29, 2016

hehailong5 commented Sep 29, 2016

weirdan commented Sep 29, 2016

hehailong5 commented Sep 29, 2016

hehailong5 commented Sep 30, 2016

slackpad commented Oct 6, 2016

questions about the behavior of version 0.7.0 #2368

questions about the behavior of version 0.7.0 #2368

Comments

hehailong5 commented Sep 28, 2016

weirdan commented Sep 28, 2016 • edited Loading

hehailong5 commented Sep 29, 2016

hehailong5 commented Sep 29, 2016

weirdan commented Sep 29, 2016

hehailong5 commented Sep 29, 2016

hehailong5 commented Sep 30, 2016

slackpad commented Oct 6, 2016

weirdan commented Sep 28, 2016 •

edited

Loading