-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a data corruption bug in all etcd3 version when authentication is enabled #11651
Comments
cc @mitake |
Fixed by #11652. |
@tangcong Thanks again for fixing this! Could you expand the following statement, I am trying to understand if / how etcd fails to apply a command but reports no error. Does etcd client receive error when this happens?
|
ar.err msg is "auth revision is old" when auth revision is inconsistent. i think it is necessary to optimize it by adding an error log and etcd_server_proposals_fail_applied_total metrics. how do you think? i can submit another pr to optimize it. The client did not receive an error because the apply command for the node to which the client connected was successful. |
I see. So client may or may not see error "auth revision is old", depending on which etcd in the cluster serves it. It is the same raft log entry replicated to all 3 etcd members, but they apply it differently due to the fact that their auth store revisions are inconsistent. It feels like the inconsistency issue is amplified, in the sense that it starts with re-applying one or few auth related raft log entries, but later this leads to growing inconsistency in mvcc store. Adding a log warning when "auth revision is old" happens sounds good to me. Not sure about |
When auth store inconsistency leads to inconsistency in mvcc store, we should be able to tell from mvcc metrics such as |
Yes. etcd_mvcc_put_total metric is useful, it is different in every etcd member when auth store is inconsistent. however, it is also different when there are many write requests. it is a little difficult for us to choose a reasonable alarm threshold when we configure alarm rules. @jingyih |
Right, I understand |
@tangcong Maybe also consider trying alpha feature |
|
When will this bugfix be ported to release 3.3 / release 3.4? @jingyih |
Unfortunately the timeout is not configurable for now. There should be 5+ seconds for each remote API call for fetching the hash from peers, is it not enough? Could you try
Yes all ideas on making corruption check better in etcd are welcome. |
etcd version is 3.4.3, three nodes, initial corruption check takes 30 second.
periodic corruption check also has error logs:
however,etcdctl endpoint hashkv is very fast~
|
i see. leader failed to connect other etcd member in getPeerHashKVs function. corruption check is not expensive when cluster has 1 million keys. you have fixed it in this pr #11636(v3.4.4). most of our cluster versions are now 3.3.17. In the future, we will try this feature after upgrading according to the actual situation. thanks. |
@tangcong Good to hear. Thanks for sharing the info. |
Found the fix was at v3.4.8 but need v3.4.9 to successfully run a cluster. |
What happened:
Recently, our team(TencentCloud k8s team) encountered a serious etcd data inconsistency bug. k8s resources such as node, pods, service, and deployment were not found when you use kubectl to get resource. The cluster did not work when you deploy/update workload.
How to trouble-shooting it:
the cluster status information is as follows. you can see that node-1, node-2, node-3 have same raftIndex,but node-2's revision is different from others. the number of keys per node is also inconsistent.for example, some keys are on the leader, but do not exist on the follower node. after adding the simple debug log, we found that the reason the follower node failed to apply command is that its auth revision is smaller than the leader.
follower can received leader's command, we exclude cluster split brain and the implementation of raft algorithm bugs. why follower node auth revision is less than leader node?
we add debugging log and develop a simple chaos monkey tool to reproduce it. after running for a few days, we successfully reproduced.
in our debugging log, we can see that the consistentIndex is repeated and apply some commands again when the etcd restarted:
We found that when executing auth command, the consistent index is not persistent. some commands(for example, GrantRolePermission) applyed again,it will also increase auth revision.
How to reproduce it (as minimally and precisely as possible):
now you can found that the auth revision of this node have increased after restart etcd, though we didn't do any auth operation. and in other node which not be restarted, the auth revision is not changed.
After that, if leader auth revision is smaller than follower auth revision, follower will fail to apply command, and there won't be any error message in etcd log. then, the node will have inconsistent data and different revision, and you may get data from one node is ok but another one is not, like this:
How to fix it:
we will submit a pr to address this serious bug. it will persist consistentIndex into the backend store when executing auth commands.
Impact:
it is possible to encounter data inconsistency/loss for all etcd3 version when enable auth.
The above description is a little unclear. add the following description:
whether a write request command can be executed successfully depends on which node the client is connected to. It has nothing to do with which is the leader.
for example. there are three nodes(A,B,C), A auth revision is 1, B is 2, C is 3.
if node A send write request,the request entry auth revision is 1, node B,C fail to apply entry command.
The text was updated successfully, but these errors were encountered: