Auto Pilot breaks Quorum Invariant #5922

nahratzah · 2019-06-05T15:57:29Z

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

Consul Auto Pilot Dead Server Cleanup allows consul to run reads, writes, etc without quorum.

Reproduction Steps

We are using a cluster of 5 nodes.
A cluster with 5 nodes requires at least 3 nodes to maintain quorum.

Create a docker file `docker-compose.yml` with this contents.

version: "2"
services:
  consul_dc1:
    image: consul
    command: "agent -log-level debug -ui -server -client 0.0.0.0 -datacenter=dc1 -bootstrap-expect=5 -join consul_dc1"
  consul_nodes_dc1:
    image: consul
    command: "agent -log-level debug -server -bootstrap-expect=5 -datacenter=dc1 -join consul_dc1"
    depends_on:
      - consul_dc1

Start docker:

docker-compose -p quorum up -d --scale consul_nodes_dc1=4

Wait until the cluster is fully operational and healthy.

docker-compose -p quorum exec consul_dc1 consul info

Remove 1 server by scaling down the followers to 3.

docker-compose -p quorum scale consul_nodes_dc1=3

The server will enter the failed status, but auto pilot Dead Server Cleanup will change this to the left status.
This can be observed by looking at the members:

docker-compose -p quorum exec consul_dc1 consul members

Once the server has entered the left status, remove another server by scaling the followers to 2.

docker-compose -p quorum scale consul_nodes_dc1=2

Once again, this removed server will enter the failed status, but auto pilot Dead Server Cleanup will change this to the left status.
This can be observed by looking at the members:

docker-compose -p quorum exec consul_dc1 consul members

Once again, we wait until the server has entered the left status.
At this moment, we're running with 3 servers, and with 2 left servers.
The consul members output will look something like this:

Node          Address          Status  Type    Build  Protocol  DC   Segment
0b6017669513  172.21.0.5:8301  left    server  1.4.4  2         dc1  <all>
2602b8156afc  172.21.0.2:8301  alive   server  1.4.4  2         dc1  <all>
4feed0909559  172.21.0.6:8301  left    server  1.4.4  2         dc1  <all>
bedf4c26482a  172.21.0.4:8301  alive   server  1.4.4  2         dc1  <all>
f9dde50de3b7  172.21.0.3:8301  alive   server  1.4.4  2         dc1  <all>

According to quorum rules, our 5 node cluster should have a failure tolerance of 0 and we can no longer lose any servers without losing quorum.
So let's remove another one. :)

docker-compose -p quorum scale consul_nodes_dc1=1

Our membership will look like this:

2 alive servers
2 left sersers
1 failed server

Node          Address          Status  Type    Build  Protocol  DC   Segment
0b6017669513  172.21.0.5:8301  left    server  1.4.4  2         dc1  <all>
2602b8156afc  172.21.0.2:8301  alive   server  1.4.4  2         dc1  <all>
4feed0909559  172.21.0.6:8301  left    server  1.4.4  2         dc1  <all>
bedf4c26482a  172.21.0.4:8301  alive   server  1.4.4  2         dc1  <all>
f9dde50de3b7  172.21.0.3:8301  failed  server  1.4.4  2         dc1  <all>

If we test setting and getting a value, this succeeds:

[avandersteldt] ~/docker_img (1)# docker-compose -p quorum exec consul_dc1 consul kv put redis/config/connections 5
Success! Data written to: redis/config/connections
[avandersteldt] ~/docker_img (0)# docker-compose -p quorum exec consul_dc1 consul kv get redis/config/connections
5

(Instructions on how to clean up the cluster under this fold.)

docker-compose -p quorum stop
docker-compose -p quorum kill
docker-compose -p quorum rm -sf

Expected Behaviour

The last server loss would have caused quorum loss, stopped the leader being a leader, and the write operation would have failed.

Consul info for both Client and Server

Server info

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease = 
        revision = ea5210a3
        version = 1.4.4
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = true
        leader_addr = 172.21.0.2:8300
        server = true
raft:
        applied_index = 107
        commit_index = 107
        fsm_pending = 0
        last_contact = 0
        last_log_index = 107
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:a4ac8ce3-7f6f-0406-6ca8-1ccb3de9ea35 Address:172.21.0.2:8300} {Suffrage:Voter ID:aa7c826a-caa2-f091-4e40-6a4b6bcf060a Address:172.21.0.4:8300} {Suffrage:Voter ID:3f6bcd8b-f9b9-30eb-328e-f09c70556f04 Address:172.21.0.3:8300}]
        latest_configuration_index = 23
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 2
runtime:
        arch = amd64
        cpu_count = 4
        goroutines = 96
        max_procs = 4
        os = linux
        version = go1.11.4
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 2
        failed = 1
        health_score = 0
        intent_queue = 0
        left = 2
        member_time = 9
        members = 5
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 3
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 12
        members = 5
        query_queue = 0
        query_time = 1

Operating system and Environment details

uname -a says: Linux 4.15.0-50-generic #54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Log Fragments

2019/06/05 15:32:37 [INFO] autopilot: Attempting removal of failed server node "4feed0909559"
2019/06/05 15:32:57 [INFO] autopilot: Attempting removal of failed server node "0b6017669513"

Notes

The scenario requires servers to go away slowly enough for auto-pilot to transition one from the failed to the left state, before the next one fails.

We discovered this during testing of an upgrade script that tested both the healthy and failure_tolerance telemetry to ascertain if a cluster was healthy. Both these health indicators indicate that everthing is fine with servers in the left state.

Work-around is to turn off dead server removal and only enable it during brief windows where an upgrade happens.
I think the long term fix in code would be that dead server removal only occurs in cases where a failed server can be replaced by a newly joining server. Whereas a failed server that is not replaced would remain in the failed state indefinitely, or until a human operator has removed the server using force-leave.

I speculate the same scenario can occur with Dead Server Removal disabled, if the servers stay in the failed state for 3 days and the automatic cleanup kicks in to remove them.

The text was updated successfully, but these errors were encountered:

mkeeler · 2019-06-05T16:27:37Z

@nahratzah Thank you for the thorough bug report.

This is complicated but I am sure there is something we could do. Raft decides its quorum size based on the number of voting peers. The cleaning up of dead servers is only allowed if less than half of the number of peers would be affected by the removals. The halving of the peers is rounded down so with 5 peers autopilot will allow removing 1. At 4 peers it allows removing 1 as well. However it would not allow you to drop from 5 to 3 all at once.

So what is happening is that when you get rid of 2 servers, the calculated quorum size then becomes 2, because there are only 3 voting servers still around. You can check this with the consul operator raft list-peers command. So when you kill one of the remaining 3 technically raft still thinks it meets the quorum size of 2 nodes.

Obviously this is not the behavior you are looking for but I have yet to determine on whether it is totally incorrect or what the fix should look like. If we didn't allow this behavior then if you actually wanted to transition from 5 servers to 3 for whatever reason you would have to use the force-leave command to get rid of the failed node (or gracefully have the server leave the cluster). That is probably fine as reducing your cluster size probably isn't too common.

What seems really wrong is that the code shouldn't auto cleanup the servers if you go from 5 servers to 3 at once but it will let you drop from 5 to 4 and then 4 to 3. That alone makes me think that this behavior was unintended and should be fixed.

slackpad · 2019-06-05T17:02:21Z

If we didn't allow this behavior then if you actually wanted to transition from 5 servers to 3 for whatever reason you would have to use the force-leave command to get rid of the failed node (or gracefully have the server leave the cluster). That is probably fine as reducing your cluster size probably isn't too common.

IIRC that was the original rationale for allowing this; folks would get into a lot of trouble trying to size clusters down, so autopilot makes that safer to do by reducing the quorum size when it's safe to do so, since adding new servers back will increase it automatically. One thing we thought about but didn't do was have a way to configure autopilot with the desired target cluster size, which could maybe give it some context about failed servers vs. assuming you are shrinking your cluster.

mkeeler · 2019-06-05T19:31:33Z

@slackpad Yeah, doing some more digging and talking with the rest of the team brought up some similar scenarios. Like if you add new servers before reaping the old ones which could temporarily increase the required quorum size when really it shouldn't stay that way.

I think the sanest way of being able to prevent too many servers from being cleaned up and at the same time allowing for the clusters size to temporarily grow and then shrink would be to have the user specify a minimum quorum size within the autopilot configuration. We could automatically populate this from the bootstrap_expect configuration parameter if set or more explicitly actually in the configuration. Either way it should fix the problem without breaking things for users who might be relying on the current behavior.

keynslug · 2019-07-13T18:41:37Z

Not strictly related but I guess I'm experiencing another kind of possibly unsafe behaviour caused by overly aggressive dead servers cleanup logic.

I have a system-under-test setup upon which I periodically wreak havoc with a jepsen test. The SUT consists of 5 nodes (n1, ..., n5), with a Consul server and my app on each node. Every app instance use sessions and locks APIs from its local Consul instance heavily to make sure there's at most one process in a cluster with a unique ID doing its job.

consul.json

{
    "server": true,
    "bind_addr": "0.0.0.0",
    "bootstrap_expect": 5,
    "retry_join": ["n1"],
    "performance": {
        "raft_multiplier": 1
    }
}

This one time I observed following sequence of events. Please note that I picked just these events relevant to the issue from these logs I have at hand.

At 17:08:59.855 app instance on n3 created a session with TTL = 30 seconds and kept on renewing it every 10 seconds.
At 17:09:43.952 app instance on n3 acquired The Lock with his session.
Then there was a lot of bad things happening in the cluster caused by jepsen randomly partitioning it every 50 seconds or so but The Lock was held by app instance on n3 all this time.
At 17:11:45 every Consul instance reported that new leader was elected: n1.
At 17:12:19.928 app instance on n3 successfully renewed his session for another 30 seconds with lock delay of 10 seconds.
At 17:12:24.311 jepsen test runner isolated nodes (n1 n3) from (n2 n4 n5).
At 17:12:26 Consul on n2 righteously proclaimed himself the leader after winning election.
At 17:12:34 Consul on n2 marked n3 as failed.
At 17:12:36 Consul's autopilot on n2 removed the failed server node n3.
From then on every Consul instance rejected lock acquisition attempts noting that lock delay will expire at 17:12:46.579.
At 17:12:44.311 jepsen runner healed the cluster partitions
At 17:12:47.648 app instance on n2 successfully acquired The Lock with its session and started making progress.
Up until 17:12:49.944 app instance on n3 thought that he was in charge of The Lock and too was making progress.
Finally at 17:12:49.944 app instance on n3 failed to renew his session one more time and then died, thus stopping from making progress.

So there was ~2 seconds during which two processes with the same ID were up and running. I was quite bewildered that the autopilot made a decision to consider n3 as left after mere 2 seconds since she was deemed as failed, even though she had a session still alive at that moment. I'm also kinda curious about what exaclty kicked autopilot's logic in.

Server info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 1
	services = 1
build:
	prerelease = 
	revision = 40cec984
	version = 1.5.1
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 172.25.0.4:8300
	server = true
raft:
	applied_index = 134
	commit_index = 134
	fsm_pending = 0
	last_contact = 46.196847ms
	last_log_index = 134
	last_log_term = 2
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:02b3980b-0827-ecfa-b0c5-72d75a05f34e Address:172.25.0.4:8300} {Suffrage:Voter ID:19e1c36d-b066-77b6-2e67-5198da4b8126 Address:172.25.0.9:8300} {Suffrage:Voter ID:ddb325a6-d6f3-38bb-207f-0d4fd3da256c Address:172.25.0.7:8300} {Suffrage:Voter ID:6d00f848-9f02-9601-2ec1-53d2dd917640 Address:172.25.0.11:8300} {Suffrage:Voter ID:a2134945-006d-d90c-55b9-1b22c27edd8c Address:172.25.0.5:8300}]
	latest_configuration_index = 1
	num_peers = 4
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 2
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 94
	max_procs = 8
	os = linux
	version = go1.12.1
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 6
	members = 5
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 10
	members = 5
	query_queue = 0
	query_time = 1

If someone consider this a separate issue I would happily make it into one and provide more details if needed.

mkeeler added the type/bug Feature does not function as expected label Jun 5, 2019

mkeeler modified the milestone: 1.5.2 Jun 5, 2019

mkeeler mentioned this issue Jun 11, 2019

Leader still elected when below quorum #5947

Closed

banks added this to the Upcoming milestone Aug 16, 2019

pearkes modified the milestones: Upcoming, 1.6.x Aug 20, 2019

mikemorris assigned schristoff Sep 23, 2019

schristoff mentioned this issue Oct 18, 2019

Set MinQuorum variable in Autopilot #6654

Merged

schristoff closed this as completed in #6654 Oct 29, 2019

keynslug mentioned this issue Nov 14, 2019

Autopilot reaps dead servers too aggressively #6793

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto Pilot breaks Quorum Invariant #5922

Auto Pilot breaks Quorum Invariant #5922

nahratzah commented Jun 5, 2019

mkeeler commented Jun 5, 2019

slackpad commented Jun 5, 2019

mkeeler commented Jun 5, 2019

keynslug commented Jul 13, 2019 •

edited

Loading

Auto Pilot breaks Quorum Invariant #5922

Auto Pilot breaks Quorum Invariant #5922

Comments

nahratzah commented Jun 5, 2019

Overview of the Issue

Reproduction Steps

Expected Behaviour

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

Notes

mkeeler commented Jun 5, 2019

slackpad commented Jun 5, 2019

mkeeler commented Jun 5, 2019

keynslug commented Jul 13, 2019 • edited Loading

keynslug commented Jul 13, 2019 •

edited

Loading