a data corruption bug in all etcd3 version when authentication is enabled #11651

tangcong · 2020-02-24T16:13:52Z

What happened:

Recently, our team(TencentCloud k8s team) encountered a serious etcd data inconsistency bug. k8s resources such as node, pods, service, and deployment were not found when you use kubectl to get resource. The cluster did not work when you deploy/update workload.

How to trouble-shooting it:

the cluster status information is as follows. you can see that node-1, node-2, node-3 have same raftIndex，but node-2's revision is different from others. the number of keys per node is also inconsistent.for example, some keys are on the leader, but do not exist on the follower node. after adding the simple debug log, we found that the reason the follower node failed to apply command is that its auth revision is smaller than the leader.

[
  {
    "Endpoint": "https://node-1:2379",
    "Status": {
      "header": {
        "cluster_id": 4202817850163301400,
        "member_id": 8591302216124782000,
        "revision": 11,
        "raft_term": 5
      },
      "version": "3.4.3",
      "dbSize": 45056,
      "leader": 8717530244098866000,
      "raftIndex": 392,
      "raftTerm": 5,
      "raftAppliedIndex": 392,
      "dbSizeInUse": 32768
    }
  },
  {
    "Endpoint": "https://node-2:2379",
    "Status": {
      "header": {
        "cluster_id": 4202817850163301400,
        "member_id": 12692962784632463000,
        "revision": 1,
        "raft_term": 5
      },
      "version": "3.4.3",
      "dbSize": 49152,
      "leader": 8717530244098866000,
      "raftIndex": 392,
      "raftTerm": 5,
      "raftAppliedIndex": 392,
      "dbSizeInUse": 45056
    }
  },
  {
    "Endpoint": "https://node-3:2379",
    "Status": {
      "header": {
        "cluster_id": 4202817850163301400,
        "member_id": 8717530244098866000,
        "revision": 11,
        "raft_term": 5
      },
      "version": "3.4.3",
      "dbSize": 45056,
      "leader": 8717530244098866000,
      "raftIndex": 392,
      "raftTerm": 5,
      "raftAppliedIndex": 392,
      "dbSizeInUse": 32768
    }
  }
]

follower can received leader's command, we exclude cluster split brain and the implementation of raft algorithm bugs. why follower node auth revision is less than leader node?

we add debugging log and develop a simple chaos monkey tool to reproduce it. after running for a few days, we successfully reproduced.
in our debugging log, we can see that the consistentIndex is repeated and apply some commands again when the etcd restarted:

We found that when executing auth command, the consistent index is not persistent. some commands(for example, GrantRolePermission) applyed again,it will also increase auth revision.

How to reproduce it (as minimally and precisely as possible):

start etcd
// enable auth
etcdctl user add root:xxx
etcdctl role add root
etcdctl user grant-role root root
etcdctl auth enable

etcdctl user add hello:world
etcdctl role add hello
etcdctl role grant-permission --prefix hello readwrite / (etcd role del hello/etcd user del user also can reproduce issue)
stop etcd
get auth_revision by using etcd-dump-db tool（we optimize the print format of auth bucket in this pr, you can use it to get better display）
cp -Rf /data/etcd.data etcd.bak; 
./etcd-dump-db iterate-bucket --decode etcd.bak auth
start etcd
get auth_revision by using etcd-dump-db tool

now you can found that the auth revision of this node have increased after restart etcd, though we didn't do any auth operation. and in other node which not be restarted, the auth revision is not changed.

After that, if leader auth revision is smaller than follower auth revision, follower will fail to apply command, and there won't be any error message in etcd log. then, the node will have inconsistent data and different revision, and you may get data from one node is ok but another one is not, like this:

How to fix it:

we will submit a pr to address this serious bug. it will persist consistentIndex into the backend store when executing auth commands.

Impact:

it is possible to encounter data inconsistency/loss for all etcd3 version when enable auth.

The above description is a little unclear. add the following description:
whether a write request command can be executed successfully depends on which node the client is connected to. It has nothing to do with which is the leader.

for example. there are three nodes(A,B,C), A auth revision is 1, B is 2, C is 3.

client connected nodes\node apply entry result	A	B	C
A(write request auth revision:1)	SUCC	FAIL	FAIL
B(write request auth revision:2	SUCC	SUCC	FAIL
C(write request auth revision:3)	SUCC	SUCC	SUCC

if node A send write request,the request entry auth revision is 1, node B,C fail to apply entry command.

func (as *authStore) isOpPermitted(userName string, revision uint64, key, rangeEnd []byte, permTyp authpb.Permission_Type) error {
	// TODO(mitake): this function would be costly so we need a caching mechanism
	if !as.IsAuthEnabled() {
		return nil
	}

	// only gets rev == 0 when passed AuthInfo{}; no user given
	if revision == 0 {
		return ErrUserEmpty
	}

	if revision < as.Revision() {
		return ErrAuthOldRevision
	}

The text was updated successfully, but these errors were encountered:

jingyih · 2020-02-24T16:31:34Z

cc @mitake

wswcfan · 2020-02-24T16:36:19Z

#11643
#9630
#10407
this may be caused by the same bug

jingyih · 2020-03-01T10:41:23Z

Fixed by #11652.

jingyih · 2020-03-01T10:43:52Z

@tangcong Thanks again for fixing this! Could you expand the following statement, I am trying to understand if / how etcd fails to apply a command but reports no error. Does etcd client receive error when this happens?

After that, if leader auth revision is smaller than follower auth revision, follower will fail to apply command, and there won't be any error message in etcd log.

tangcong · 2020-03-01T10:50:28Z

func (a *applierV3backend) Apply(r *pb.InternalRaftRequest) *applyResult {
	ar := &applyResult{}
	defer func(start time.Time) {
		warnOfExpensiveRequest(a.s.getLogger(), start, &pb.InternalRaftStringer{Request: r}, ar.resp, ar.err)
	}(time.Now())

ar.err msg is "auth revision is old" when auth revision is inconsistent. i think it is necessary to optimize it by adding an error log and etcd_server_proposals_fail_applied_total metrics. how do you think? i can submit another pr to optimize it.

The client did not receive an error because the apply command for the node to which the client connected was successful.

@jingyih

jingyih · 2020-03-01T13:40:06Z

I see. So client may or may not see error "auth revision is old", depending on which etcd in the cluster serves it. It is the same raft log entry replicated to all 3 etcd members, but they apply it differently due to the fact that their auth store revisions are inconsistent. It feels like the inconsistency issue is amplified, in the sense that it starts with re-applying one or few auth related raft log entries, but later this leads to growing inconsistency in mvcc store.

Adding a log warning when "auth revision is old" happens sounds good to me.

Not sure about etcd_server_proposals_fail_applied_total metric. Technically, the raft entry is successfully applied (but differently on each server). I checked the existing metrics, there is etcd_mvcc_put_total. Maybe we could have something like etcd_debugging_mvcc_failed_put_total? debugging just means the metric is new and is not stable yet. Later we can choose to make it stable or remove it, depending on actually how useful it is. (EDIT: I just realized this is not failing in mvcc)

jingyih · 2020-03-01T13:46:57Z

When auth store inconsistency leads to inconsistency in mvcc store, we should be able to tell from mvcc metrics such as etcd_mvcc_put_total? @tangcong

tangcong · 2020-03-01T15:53:33Z

Yes. etcd_mvcc_put_total metric is useful, it is different in every etcd member when auth store is inconsistent. however, it is also different when there are many write requests. it is a little difficult for us to choose a reasonable alarm threshold when we configure alarm rules. @jingyih

jingyih · 2020-03-02T18:00:33Z

Yes. etcd_mvcc_put_total metric is useful, it is different in every etcd member when auth store is inconsistent. however, it is also different when there are many write requests. it is a little difficult for us to choose a reasonable alarm threshold when we configure alarm rules. @jingyih

Right, I understand etcd_mvcc_put_total is different on each member, depending on the progress of each member's apply. When auth store revision is corrupted on a member, will all the following key-value requests on this member be rejected due to auth revision is old? If this is the case, we can expect to see the that mvcc metric to diverge very fast?

jingyih · 2020-03-02T18:18:19Z

@tangcong Maybe also consider trying alpha feature --experimental-corrupt-check-time if you upgrade to v3.4.4 in future. It does periodic corruption checks among servers in a cluster.

tangcong · 2020-03-03T02:48:50Z

@jingyih yes.we can expect to see the that mvcc metric to diverge fast.

At that time, our ETCD monitoring system issued an alert by comparing the number of KEYs at each node.

tangcong · 2020-03-03T03:07:20Z

When will this bugfix be ported to release 3.3 / release 3.4? @jingyih

tangcong · 2020-03-03T03:31:36Z

@jingyih i try to enable corruption check and find that it is very slow and often timed out when there are a lots of keys(for example,1 million keys). the timeout is unmodifiable. can the timeout parameter be configured?
#10893 incremental corruption check sounds good.

jingyih · 2020-03-03T15:13:11Z

i try to enable corruption check and find that it is very slow and often timed out when there are a lots of keys(for example,1 million keys). the timeout is unmodifiable. can the timeout parameter be configured?

Unfortunately the timeout is not configurable for now. There should be 5+ seconds for each remote API call for fetching the hash from peers, is it not enough? Could you try etcdctl endpoint hashkv and see how long it takes?

#10893 incremental corruption check sounds good.

Yes all ideas on making corruption check better in etcd are welcome.

tangcong · 2020-03-03T17:00:15Z

etcd version is 3.4.3, three nodes, initial corruption check takes 30 second.

14730644:Mar  4 00:46:39 localhost etcd[7346]: 2a111ffbb45ec018 starting initial corruption check with timeout 15s...
14730689:Mar  4 00:47:09 localhost etcd[7346]: 2a111ffbb45ec018 succeeded on initial corruption checking: no corruption
14730691:Mar  4 00:47:09 localhost etcd[7346]: enabled corruption checking with 3m0s interval

periodic corruption check also has error logs:

Mar 04 00:47:09 VM-0-105-ubuntu etcd[7346]: 2a111ffbb45ec018 hash-kv error "context deadline exceeded" on peer "https://x.x.x.149:2380" with revision 395483904
Mar 04 00:46:54 VM-0-105-ubuntu etcd[7346]: 2a111ffbb45ec018 hash-kv error "context deadline exceeded" on peer "https://x.x.x.53:2380" with revision 395483904

however,etcdctl endpoint hashkv is very fast~

root@VM-0-105-ubuntu:~/# time ETCDCTL_API=3 etcdctl  endpoint  hashkv --endpoints https://x.x.x.53:2379 --cluster
https://x.105:2379, 1938650550
https://x.53:2379, 885147713
https://x.149:2379, 1965074925
real    0m0.228s
user    0m0.032s
sys     0m0.004s

@jingyih

tangcong · 2020-03-04T03:03:31Z

etcd version is 3.4.3, three nodes, initial corruption check takes 30 second.

14730644:Mar  4 00:46:39 localhost etcd[7346]: 2a111ffbb45ec018 starting initial corruption check with timeout 15s...
14730689:Mar  4 00:47:09 localhost etcd[7346]: 2a111ffbb45ec018 succeeded on initial corruption checking: no corruption
14730691:Mar  4 00:47:09 localhost etcd[7346]: enabled corruption checking with 3m0s interval

periodic corruption check also has error logs:

Mar 04 00:47:09 VM-0-105-ubuntu etcd[7346]: 2a111ffbb45ec018 hash-kv error "context deadline exceeded" on peer "https://x.x.x.149:2380" with revision 395483904
Mar 04 00:46:54 VM-0-105-ubuntu etcd[7346]: 2a111ffbb45ec018 hash-kv error "context deadline exceeded" on peer "https://x.x.x.53:2380" with revision 395483904

however,etcdctl endpoint hashkv is very fast~

root@VM-0-105-ubuntu:~/# time ETCDCTL_API=3 etcdctl  endpoint  hashkv --endpoints https://x.x.x.53:2379 --cluster
https://x.105:2379, 1938650550
https://x.53:2379, 885147713
https://x.149:2379, 1965074925
real    0m0.228s
user    0m0.032s
sys     0m0.004s

@jingyih

i see. leader failed to connect other etcd member in getPeerHashKVs function. corruption check is not expensive when cluster has 1 million keys. you have fixed it in this pr #11636(v3.4.4). most of our cluster versions are now 3.3.17. In the future, we will try this feature after upgrading according to the actual situation. thanks.

jingyih · 2020-03-04T12:27:15Z

@tangcong Good to hear. Thanks for sharing the info.

liukun · 2021-04-20T09:29:13Z

Found the fix was at v3.4.8 but need v3.4.9 to successfully run a cluster.

tangcong mentioned this issue Feb 24, 2020

auth/store: save consistentIndex to fix a data corruption bug #11652

Merged

jingyih closed this as completed Mar 1, 2020

johscheuer mentioned this issue Mar 2, 2020

etcdv3 revision glitch #11643

Closed

tangcong mentioned this issue Mar 4, 2020

Inconsistent Revisions Across Members ( v3.3.3 ) #10594

Closed

tangcong mentioned this issue Mar 12, 2020

a data corruption bug in revoking lease when upgrading cluster from v3.2 to v3.3/v3.4+ #11689

Closed

tangcong mentioned this issue Apr 1, 2020

Inconsistent data between quorum nodes #11737

Closed

tangcong mentioned this issue Aug 18, 2020

etcd cluster has different backend datastore sizes. #12218

Closed

ahrtr mentioned this issue Mar 23, 2022

Data inconsistency in etcd version 3.3.11 #13503

Closed

This was referenced Apr 23, 2022

Auth failed with error "auth: revision in header is old" #13976

Closed

err is auth: revision in header is old in ETCD 3.4.16 #13984

Closed

ahrtr mentioned this issue Aug 3, 2022

CHANGELOG: update version recommendation #14306

Merged

ahrtr mentioned this issue Sep 8, 2022

Very frequent Leader Election with High compaction time? #14071

Closed

tjungblu mentioned this issue Jun 20, 2023

One of three nodes revision inconsistency cause the node compaction abnormal #16106

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a data corruption bug in all etcd3 version when authentication is enabled #11651

a data corruption bug in all etcd3 version when authentication is enabled #11651

tangcong commented Feb 24, 2020 •

edited

Loading

jingyih commented Feb 24, 2020

wswcfan commented Feb 24, 2020

jingyih commented Mar 1, 2020

jingyih commented Mar 1, 2020

tangcong commented Mar 1, 2020 •

edited

Loading

jingyih commented Mar 1, 2020 •

edited

Loading

jingyih commented Mar 1, 2020

tangcong commented Mar 1, 2020 •

edited

Loading

jingyih commented Mar 2, 2020 •

edited

Loading

jingyih commented Mar 2, 2020

tangcong commented Mar 3, 2020 •

edited

Loading

tangcong commented Mar 3, 2020 •

edited

Loading

tangcong commented Mar 3, 2020

jingyih commented Mar 3, 2020

tangcong commented Mar 3, 2020

tangcong commented Mar 4, 2020

jingyih commented Mar 4, 2020

liukun commented Apr 20, 2021 •

edited

Loading

a data corruption bug in all etcd3 version when authentication is enabled #11651

a data corruption bug in all etcd3 version when authentication is enabled #11651

Comments

tangcong commented Feb 24, 2020 • edited Loading

jingyih commented Feb 24, 2020

wswcfan commented Feb 24, 2020

jingyih commented Mar 1, 2020

jingyih commented Mar 1, 2020

tangcong commented Mar 1, 2020 • edited Loading

jingyih commented Mar 1, 2020 • edited Loading

jingyih commented Mar 1, 2020

tangcong commented Mar 1, 2020 • edited Loading

jingyih commented Mar 2, 2020 • edited Loading

jingyih commented Mar 2, 2020

tangcong commented Mar 3, 2020 • edited Loading

tangcong commented Mar 3, 2020 • edited Loading

tangcong commented Mar 3, 2020

jingyih commented Mar 3, 2020

tangcong commented Mar 3, 2020

tangcong commented Mar 4, 2020

jingyih commented Mar 4, 2020

liukun commented Apr 20, 2021 • edited Loading

tangcong commented Feb 24, 2020 •

edited

Loading

tangcong commented Mar 1, 2020 •

edited

Loading

jingyih commented Mar 1, 2020 •

edited

Loading

tangcong commented Mar 1, 2020 •

edited

Loading

jingyih commented Mar 2, 2020 •

edited

Loading

tangcong commented Mar 3, 2020 •

edited

Loading

tangcong commented Mar 3, 2020 •

edited

Loading

liukun commented Apr 20, 2021 •

edited

Loading