Inconsistent data between quorum nodes #11737

jcrowthe · 2020-03-31T23:43:09Z

While launching new Kubernetes clusters, we occasionally get into a state where leader election of kube-scheduler and kube-controller-manager get into a deadlock state. After investigating, we believe the source of this problem to be data inconsistency between etcd, akin to "split brain".

The following shows the k8s leader election key inside the etcd db. Notice the renewTime value, which is inconsistent on only one of the three members of the quorum -- the etcd leader.

$ etcdctl get /registry/services/endpoints/kube-system/kube-scheduler --endpoints (done per node)

core@vm-etcd-aaacrowtherkube81-0 ~ $ sudo -E etcdctl get /registry/services/endpoints/kube-system/kube-scheduler --prefix --endpoints https://10.0.0.11:2379

(control-plane.alpha.kubernetes.io/leader�{"holderIdentity":"vmss-master-aaacrowtherkube81-gmbvm000000_1abe8a46-b460-4ac1-89d9-731a2730dfd4","leaseDurationSeconds":15,"acquireTime":"2020-03-31T23:10:17Z","renewTime":"2020-03-31T23:18:16Z","leaderTransitions":3}z"


core@vm-etcd-aaacrowtherkube81-1 ~ $ sudo -E etcdctl get /registry/services/endpoints/kube-system/kube-scheduler --prefix --endpoints https://10.0.0.12:2379

(control-plane.alpha.kubernetes.io/leader�{"holderIdentity":"vmss-master-aaacrowtherkube81-gmbvm000001_1266503f-cc40-4233-966e-38e28a8b1bc3","leaseDurationSeconds":15,"acquireTime":"2020-03-31T22:15:05Z","renewTime":"2020-03-31T22:15:05Z","leaderTransitions":0}z"

core@vm-etcd-aaacrowtherkube81-2 ~ $ sudo -E etcdctl get /registry/services/endpoints/kube-system/kube-scheduler --prefix --endpoints https://10.0.0.13:2379

(control-plane.alpha.kubernetes.io/leader�{"holderIdentity":"vmss-master-aaacrowtherkube81-gmbvm000000_1abe8a46-b460-4ac1-89d9-731a2730dfd4","leaseDurationSeconds":15,"acquireTime":"2020-03-31T23:10:17Z","renewTime":"2020-03-31T23:18:16Z","leaderTransitions":3}z"

The leader node, vm-etcd-aaacrowtherkube81-1 in this case, has the value "renewTime":"2020-03-31T22:15:05Z", which is stuck in the past. For context, the primary kube-scheduler process will update this etcd key every 15 seconds to prove that it is still healthy and alive. While the value of this renewTime string gets updated every 15 seconds on both etcd-0 and etcd-2 nodes, on etcd-1 the value appears to be stuck in the past (an hour ago in this case, which is when this cluster was launched).

Apparent Problem: Even though a commit is made to etcd, the leader node fails to actually update its values while the other two etcd nodes do properly update their values.

Context:

Full endpoint status output:

$ etcdctl endpoint status --write-out="json"

core@vm-etcd-aaacrowtherkube81-2 ~ $ sudo -E etcdctl endpoint status --write-out="json"  | jq -r .
[
{
  "Endpoint": "https://10.0.0.11:2379",
  "Status": {
    "header": {
      "cluster_id": 2550552932188152300,
      "member_id": 17326583157074640000,
      "revision": 2995,
      "raft_term": 5
    },
    "version": "3.4.3",
    "dbSize": 4628480,
    "leader": 10750337588698340000,
    "raftIndex": 4484,
    "raftTerm": 5
  }
},
{
  "Endpoint": "https://10.0.0.12:2379",
  "Status": {
    "header": {
      "cluster_id": 2550552932188152300,
      "member_id": 10750337588698340000,
      "revision": 1398,
      "raft_term": 5
    },
    "version": "3.4.3",
    "dbSize": 3444736,
    "leader": 10750337588698340000,
    "raftIndex": 4484,
    "raftTerm": 5
  }
},
{
  "Endpoint": "https://10.0.0.13:2379",
  "Status": {
    "header": {
      "cluster_id": 2550552932188152300,
      "member_id": 16351775401754712000,
      "revision": 2995,
      "raft_term": 5
    },
    "version": "3.4.3",
    "dbSize": 4612096,
    "leader": 10750337588698340000,
    "raftIndex": 4484,
    "raftTerm": 5
  }
}
]

Inconsistent revision count:

$ etcdctl endpoint status --write-out="json" | jq -r '.[].Status.header.revision'

core@vm-etcd-aaacrowtherkube81-1 ~ $ sudo -E etcdctl endpoint status --write-out="json"  | jq -r '.[].Status.header.revision'
3062
1422
3062

core@vm-etcd-aaacrowtherkube81-1 ~ $ sudo -E etcdctl endpoint status --write-out="table"
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.0.0.11:2379 | f0746394c8b33e13 |   3.4.3 |  4.6 MB |     false |         5 |       4816 |
| https://10.0.0.12:2379 | 9530df1445e4d591 |   3.4.3 |  3.4 MB |      true |         5 |       4816 |
| https://10.0.0.13:2379 | e2ed2d17ac8f5a0c |   3.4.3 |  4.6 MB |     false |         5 |       4816 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+

Environment:

Cloud: Azure
3 node etcd cluster
Reproducibility: Unfortunately this is very difficult to do. We only notice this error in our CI environment, which launches a test cluster and runs automated tests against it on every new PR. Maybe 1 out of every 20 clusters launches exhibit this failure type. We were finally able to reproduce this on a developer's test cluster, hence opening this ticket with details now. We will keep this cluster around for at least a week to aid in debugging, so feel free to request debugging info.
Etcd version: v3.4.3 though this has been noticed intermittently for months on v3.3.15
Journalctl logs (note etcd-setup.sh simply manages quorum membership automatically):
vm-etcd-aaacrowtherkube81-0.log
vm-etcd-aaacrowtherkube81-1.log
vm-etcd-aaacrowtherkube81-2.log
One oddity that we hope is not related. The leader node, which exhibits the issue, booted 15 seconds prior to the other two etcd nodes.

The text was updated successfully, but these errors were encountered:

tangcong · 2020-04-01T00:31:29Z

There are three known issues #11651 #11689 #11613 which can cause data inconsistency. Do you enable auth? Is your etcd cluster version consistent all the time? Did you upgrade your etcd cluster? Did you run defrag command? If not,you can compare which keys are missing from each node. etcd log has very little useful information when data is inconsistent, you have to replace a special debug version.

tangcong · 2020-04-01T00:49:25Z

How do you reproduce this inconsistency? Can you provide detailed operation information? As you said, it is really difficult to reproduce the inconsistency, but we can try to run chaos monkey in the test environment to reproduce it? We have to find the first command that caused the inconsistency. thanks. @jcrowthe

stale · 2020-06-30T12:18:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale bot added the stale label Jun 30, 2020

stale bot closed this as completed Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent data between quorum nodes #11737

Inconsistent data between quorum nodes #11737

jcrowthe commented Mar 31, 2020 •

edited

Loading

tangcong commented Apr 1, 2020 •

edited

Loading

tangcong commented Apr 1, 2020 •

edited

Loading

stale bot commented Jun 30, 2020

Inconsistent data between quorum nodes #11737

Inconsistent data between quorum nodes #11737

Comments

jcrowthe commented Mar 31, 2020 • edited Loading

tangcong commented Apr 1, 2020 • edited Loading

tangcong commented Apr 1, 2020 • edited Loading

stale bot commented Jun 30, 2020

jcrowthe commented Mar 31, 2020 •

edited

Loading

tangcong commented Apr 1, 2020 •

edited

Loading

tangcong commented Apr 1, 2020 •

edited

Loading