Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a data corruption bug in revoking lease when upgrading cluster from v3.2 to v3.3/v3.4+ #11689

Closed
tangcong opened this issue Mar 11, 2020 · 8 comments

Comments

@tangcong
Copy link
Contributor

tangcong commented Mar 11, 2020

What happened:

recently, our team(TencentCloud k8s team) encountered another serious data inconsistency bug when upgrading the cluster(3.2->3.3). the number of keys every node is inconsistent. The cluster does not work when you deploy/update workload.

How to trouble-shooting it:

we add debugging log and use simple chaos monkey tool to reproduce it. we successfully reproduced it again. etcd is very hard to troubleshooting data inconsistency due to lack of log.

node A(3.2+,Leader)

637802:Mar 11 21:52:23 localhost etcd[28065]: LeaseRevoke:{LeaseRevoke 24 0  ID:2229124347589447546 } {entry-index 17 25  <nil>} {entry-term 17 2  <nil>} {rev 11 2014102077  <nil>}
637803:Mar 11 21:52:23 localhost etcd[28065]: revoke lessor,lessor id:2229124347589447546
637804:Mar 11 21:52:23 localhost etcd[28065]: revoke lessor,lessor id:2229124347589447546,key:/masterleases/A
637805:Mar 11 21:52:23 localhost etcd[28065]: apply request:header:<ID:16363671783383239993 > lease_revoke:<ID:2229124347589447546 > ,response:&{0xc000d545e0 <nil> <nil>}

node B(3.2+,Follower)

637802:Mar 11 21:52:23 localhost etcd[28065]: LeaseRevoke:{LeaseRevoke 24 0  ID:2229124347589447546 } {entry-index 17 25  <nil>} {entry-term 17 2  <nil>} {rev 11 2014102077  <nil>}
637803:Mar 11 21:52:23 localhost etcd[28065]: revoke lessor,lessor id:2229124347589447546
637804:Mar 11 21:52:23 localhost etcd[28065]: revoke lessor,lessor id:2229124347589447546,key:/masterleases/A
637805:Mar 11 21:52:23 localhost etcd[28065]: apply request:header:<ID:16363671783383239993 > lease_revoke:<ID:2229124347589447546 > ,response:&{0xc000d545e0 <nil> <nil>}

node C(3.3+,Follower)

1392010:Mar 11 21:56:42 localhost etcd[26312]: applyEntryNormal:check index, entry-index: 25, consistent-index: 24
1392011:Mar 11 21:56:42 localhost etcd[26312]: shouldApplyV3, entry-index: 25,  consistent-index: 25
1392012:Mar 11 21:56:42 localhost etcd[26312]: LeaseRevoke: ID:2229124347589447546 , entry-index: 25, entry-term: 2, rev: 2014102070
1392014:Mar 11 21:56:42 localhost etcd[26312]: request "header:<ID:16363671783383239993 > lease_revoke:<id:1eef70c8a23e4f7a>" with result "error:auth: user name is empty" took too long (1.464µs) to execute

node C(3.3+) failed to apply lease_revoke command(error:auth: user name is empty). this error will continue to amplify, causing the mvcc revision to diverge very fast, failing to execute txn command, and data corruption.

How to fix it:

In the upgrade documentation, it is better to add this bug description. Users must backup data and be careful of the cluster upgrade operation, it is a high-risk operation.
we have added a pr #11691 to release-3.2 to ensure that auth info is not nil.
if user want to upgrade cluster from 3.2 to 3.3/3.4, user can firstly upgrade the cluster to the 3.2 latest version.
do you have any other better suggestions?
@jingyih @mitake

How this bug was introduced:

pr #8031(protecting lease revoking with auth) limits the users who can revoke leases. If the user isn't granted write permission of keys which are attached the lease, the revoking request will be denied.
3.0+/3.1+/3.2+ do not limit the users who can revoke leases,so user name is empty.

Impact:

it is possible to encounter it when authentication is enabled and upgrading cluster from v3.0/v3.1/v3.2 to v3.3/v3.4.

wswcfan added a commit to wswcfan/etcd that referenced this issue Mar 11, 2020
LeaseRevoke may fail to apply when authentication is enabled and upgrading cluster from etcd-3.2 go etcd-3.3
wswcfan added a commit to wswcfan/etcd that referenced this issue Mar 11, 2020
LeaseRevoke may fail to apply when authentication is enabled and upgrading cluster from etcd-3.2 to etcd-3.3
wswcfan added a commit to wswcfan/etcd that referenced this issue Mar 11, 2020
LeaseRevoke may fail to apply when authentication is enabled and upgrading cluster from etcd-3.2 to etcd-3.3
@jpbetz
Copy link
Contributor

jpbetz commented Mar 12, 2020

Since only impacts cluster using auth and leases so Kubernetes is not impacted, right?

@tangcong
Copy link
Contributor Author

@jpbetz it affects our Kubernetes cluster because our etcd cluster enables auth.

@jpbetz
Copy link
Contributor

jpbetz commented Mar 12, 2020

@jpbetz it affects our Kubernetes cluster because our etcd cluster enables auth.

Ah

@tangcong
Copy link
Contributor Author

tangcong commented Mar 12, 2020

If authentication is enabled,it has a certain probability to encounter this issue in the k8s cluster scenario.

@tangcong
Copy link
Contributor Author

tangcong commented Mar 12, 2020

Recently we encountered two data inconsistent issues(#11651,#11689) caused by authentication. Is GKE not using etcd authentication? @jpbetz @jingyih

wswcfan added a commit to wswcfan/etcd that referenced this issue Mar 12, 2020
LeaseRevoke may fail to apply when authentication is enabled and upgrading cluster from etcd-3.2 to etcd-3.3
tangcong added a commit to tangcong/etcd that referenced this issue Apr 6, 2020
xiang90 pushed a commit that referenced this issue Apr 6, 2020
tangcong added a commit to tangcong/etcd that referenced this issue Apr 10, 2020
@jingyih
Copy link
Contributor

jingyih commented May 21, 2020

1392014:Mar 11 21:56:42 localhost etcd[26312]: request "header:<ID:16363671783383239993 > lease_revoke:id:1eef70c8a23e4f7a" with result "error:auth: user name is empty" took too long (1.464µs) to execute

Lease revoke may come from v3rpc user request [1] or internally in etcdserver [2]. Does this happen in both cases?

[1]

func (ls *LeaseServer) LeaseRevoke(ctx context.Context, rr *pb.LeaseRevokeRequest) (*pb.LeaseRevokeResponse, error) {

[2]

_, lerr := s.LeaseRevoke(ctx, &pb.LeaseRevokeRequest{ID: int64(lid)})

@wswcfan
Copy link
Contributor

wswcfan commented May 21, 2020

@jingyih Yes, this happen in both cases. But lease revoke internally in etcdserver [2] may be easier to trigger this bug, because once the lease expires, the lease revoke call is triggered, and it doesn't carry the authentication information.

@tangcong
Copy link
Contributor Author

pr #11691 fixed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants