Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent data between quorum nodes #11737

Closed
jcrowthe opened this issue Mar 31, 2020 · 3 comments
Closed

Inconsistent data between quorum nodes #11737

jcrowthe opened this issue Mar 31, 2020 · 3 comments
Labels

Comments

@jcrowthe
Copy link

jcrowthe commented Mar 31, 2020

While launching new Kubernetes clusters, we occasionally get into a state where leader election of kube-scheduler and kube-controller-manager get into a deadlock state. After investigating, we believe the source of this problem to be data inconsistency between etcd, akin to "split brain".

The following shows the k8s leader election key inside the etcd db. Notice the renewTime value, which is inconsistent on only one of the three members of the quorum -- the etcd leader.

$ etcdctl get /registry/services/endpoints/kube-system/kube-scheduler --endpoints (done per node)
core@vm-etcd-aaacrowtherkube81-0 ~ $ sudo -E etcdctl get /registry/services/endpoints/kube-system/kube-scheduler --prefix --endpoints https://10.0.0.11:2379

(control-plane.alpha.kubernetes.io/leader�{"holderIdentity":"vmss-master-aaacrowtherkube81-gmbvm000000_1abe8a46-b460-4ac1-89d9-731a2730dfd4","leaseDurationSeconds":15,"acquireTime":"2020-03-31T23:10:17Z","renewTime":"2020-03-31T23:18:16Z","leaderTransitions":3}z"


core@vm-etcd-aaacrowtherkube81-1 ~ $ sudo -E etcdctl get /registry/services/endpoints/kube-system/kube-scheduler --prefix --endpoints https://10.0.0.12:2379

(control-plane.alpha.kubernetes.io/leader�{"holderIdentity":"vmss-master-aaacrowtherkube81-gmbvm000001_1266503f-cc40-4233-966e-38e28a8b1bc3","leaseDurationSeconds":15,"acquireTime":"2020-03-31T22:15:05Z","renewTime":"2020-03-31T22:15:05Z","leaderTransitions":0}z"

core@vm-etcd-aaacrowtherkube81-2 ~ $ sudo -E etcdctl get /registry/services/endpoints/kube-system/kube-scheduler --prefix --endpoints https://10.0.0.13:2379

(control-plane.alpha.kubernetes.io/leader�{"holderIdentity":"vmss-master-aaacrowtherkube81-gmbvm000000_1abe8a46-b460-4ac1-89d9-731a2730dfd4","leaseDurationSeconds":15,"acquireTime":"2020-03-31T23:10:17Z","renewTime":"2020-03-31T23:18:16Z","leaderTransitions":3}z"

The leader node, vm-etcd-aaacrowtherkube81-1 in this case, has the value "renewTime":"2020-03-31T22:15:05Z", which is stuck in the past. For context, the primary kube-scheduler process will update this etcd key every 15 seconds to prove that it is still healthy and alive. While the value of this renewTime string gets updated every 15 seconds on both etcd-0 and etcd-2 nodes, on etcd-1 the value appears to be stuck in the past (an hour ago in this case, which is when this cluster was launched).

Apparent Problem: Even though a commit is made to etcd, the leader node fails to actually update its values while the other two etcd nodes do properly update their values.

Context:

Full endpoint status output:

$ etcdctl endpoint status --write-out="json"
core@vm-etcd-aaacrowtherkube81-2 ~ $ sudo -E etcdctl endpoint status --write-out="json"  | jq -r .
[
{
  "Endpoint": "https://10.0.0.11:2379",
  "Status": {
    "header": {
      "cluster_id": 2550552932188152300,
      "member_id": 17326583157074640000,
      "revision": 2995,
      "raft_term": 5
    },
    "version": "3.4.3",
    "dbSize": 4628480,
    "leader": 10750337588698340000,
    "raftIndex": 4484,
    "raftTerm": 5
  }
},
{
  "Endpoint": "https://10.0.0.12:2379",
  "Status": {
    "header": {
      "cluster_id": 2550552932188152300,
      "member_id": 10750337588698340000,
      "revision": 1398,
      "raft_term": 5
    },
    "version": "3.4.3",
    "dbSize": 3444736,
    "leader": 10750337588698340000,
    "raftIndex": 4484,
    "raftTerm": 5
  }
},
{
  "Endpoint": "https://10.0.0.13:2379",
  "Status": {
    "header": {
      "cluster_id": 2550552932188152300,
      "member_id": 16351775401754712000,
      "revision": 2995,
      "raft_term": 5
    },
    "version": "3.4.3",
    "dbSize": 4612096,
    "leader": 10750337588698340000,
    "raftIndex": 4484,
    "raftTerm": 5
  }
}
]

Inconsistent revision count:

$ etcdctl endpoint status --write-out="json" | jq -r '.[].Status.header.revision'
core@vm-etcd-aaacrowtherkube81-1 ~ $ sudo -E etcdctl endpoint status --write-out="json"  | jq -r '.[].Status.header.revision'
3062
1422
3062

core@vm-etcd-aaacrowtherkube81-1 ~ $ sudo -E etcdctl endpoint status --write-out="table"
+------------------------+------------------+---------+---------+-----------+-----------+------------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.0.0.11:2379 | f0746394c8b33e13 |   3.4.3 |  4.6 MB |     false |         5 |       4816 |
| https://10.0.0.12:2379 | 9530df1445e4d591 |   3.4.3 |  3.4 MB |      true |         5 |       4816 |
| https://10.0.0.13:2379 | e2ed2d17ac8f5a0c |   3.4.3 |  4.6 MB |     false |         5 |       4816 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+

Environment:

  • Cloud: Azure
  • 3 node etcd cluster
  • Reproducibility: Unfortunately this is very difficult to do. We only notice this error in our CI environment, which launches a test cluster and runs automated tests against it on every new PR. Maybe 1 out of every 20 clusters launches exhibit this failure type. We were finally able to reproduce this on a developer's test cluster, hence opening this ticket with details now. We will keep this cluster around for at least a week to aid in debugging, so feel free to request debugging info.
  • Etcd version: v3.4.3 though this has been noticed intermittently for months on v3.3.15
  • Journalctl logs (note etcd-setup.sh simply manages quorum membership automatically):
    vm-etcd-aaacrowtherkube81-0.log
    vm-etcd-aaacrowtherkube81-1.log
    vm-etcd-aaacrowtherkube81-2.log
  • One oddity that we hope is not related. The leader node, which exhibits the issue, booted 15 seconds prior to the other two etcd nodes.
@tangcong
Copy link
Contributor

tangcong commented Apr 1, 2020

There are three known issues #11651 #11689 #11613 which can cause data inconsistency. Do you enable auth? Is your etcd cluster version consistent all the time? Did you upgrade your etcd cluster? Did you run defrag command? If not,you can compare which keys are missing from each node. etcd log has very little useful information when data is inconsistent, you have to replace a special debug version.

@tangcong
Copy link
Contributor

tangcong commented Apr 1, 2020

How do you reproduce this inconsistency? Can you provide detailed operation information? As you said, it is really difficult to reproduce the inconsistency, but we can try to run chaos monkey in the test environment to reproduce it? We have to find the first command that caused the inconsistency. thanks. @jcrowthe

@stale
Copy link

stale bot commented Jun 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 30, 2020
@stale stale bot closed this as completed Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants