Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd server DB out of sync undetected #12535

Closed
frans42 opened this issue Dec 9, 2020 · 28 comments · Fixed by #13676
Closed

etcd server DB out of sync undetected #12535

frans42 opened this issue Dec 9, 2020 · 28 comments · Fixed by #13676

Comments

@frans42
Copy link

frans42 commented Dec 9, 2020

I run 3 etcd server instances and I frequently observe that the data bases get out of sync without the etcd cluster detecting a cluster health issue. Repeated get-requests of a key will return different (versions of) values depending on which server the local proxy queries.

The only way to detect this problem is to compare the hashkv of all endpoints. All other health checks return "healthy". A typical status check looks like this (using v3 api):

# etcdctl endpoint health --cluster:

http://ceph-03:2379 is healthy: successfully committed proposal: took = 10.594707ms
http://ceph-02:2379 is healthy: successfully committed proposal: took = 942.709µs
http://ceph-01:2379 is healthy: successfully committed proposal: took = 857.871µs

# etcdctl endpoint status --cluster -w table:
+---------------------+------------------+---------+---------+-----------+-----------+------------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------+------------------+---------+---------+-----------+-----------+------------+
| http://ceph-03:2379 | e01b8ec12d1d3b22 |  3.3.11 |  635 kB |      true |       305 |      16942 |
| http://ceph-01:2379 | e13cf5b0920c4769 |  3.3.11 |  627 kB |     false |       305 |      16943 |
| http://ceph-02:2379 | fd3d871bd684ee85 |  3.3.11 |  631 kB |     false |       305 |      16944 |
+---------------------+------------------+---------+---------+-----------+-----------+------------+

# etcdctl endpoint hashkv --cluster:
http://ceph-03:2379, 2236472732
http://ceph-01:2379, 1950304164
http://ceph-02:2379, 494595250

I couldn't figure out how to check the actual cluster health with etcdctl. All I seem to be able to do is check endpoint health, but the implication "all end points healthy" -> "cluster is healthy" does clearly not hold.

The cluster is running etcd 3.3.11 on Centos 7.7 with the stock packages. I attached a number of files:

etcd.info.txt - some info collected according to instructions.
etcd-ceph-01.conf.txt, etcd-ceph-02.conf.txt, etcd-ceph-03.conf.txt - the 3 config files of the etcd members
etcd-gnosis.conf.txt - a config file for a client using the etcd proxy service
etcd.log - the result of grep -e "etcd:" /var/log/messages, the etcd log goes to syslog; this log should cover at least one occasion of loss of DB consistency

I cannot attach the data bases, because they contain credentials. However, I can - for some time - run commands on the data bases and post results. I have a little bit of time before I need to synchronize the servers again.

In the mean time, I could use some help with recovering. This is the first time all 3 instances are different. Usually, I have only 1 out-of-sync server and can get back to normal by removing and re-adding it. However, here I have now 3 different instances and it is no longer trivial to decide which copy is the latest. Therefore, if you could help me with these questions, I would be grateful:

  1. How can I print all keys with current revision number that a member holds?
  2. Is there a way to force the members to sync?

Thanks for your help.

@agargi
Copy link
Contributor

agargi commented Dec 31, 2020

@frans42 There are number of failures in logs alluding to fact that nodes have a problem communicating with each other. Can you check your network? I understand, I'm not answering your questions directly but would suggest looking at fixing the root cause.

@frans42
Copy link
Author

frans42 commented Jan 4, 2021

Hi @agargi, thanks for looking at this. The ups and downs of individual members are expected and the result of maintenance on the hosts that run the members. Maybe a bit more background is useful.

The three member-hosts also run ceph monitor and manager daemons. If there was a networking problem, ceph would tell me. The network outages you see are the result firmware+OS updates that involved a shut-down of each host. In fact, I would assume that a noisy network should never be a root cause as the whole point of a distributed data base (KV store) is to survive noisy connections.

The problem I report here is something I also observed without any maintenance. It happens quite regularly and I do not even have much IO requests on the etcd cluster. It is that the etcd members seem not to compare checksums of their respective data bases as part of the health check. It seems that a member can loose a commit but move to the latest raft without noticing, leading to a silent corruption of the data base.

I don't have a test etcd cluster at hand to reproduce this for debugging, unfortunately, and was hoping that some clue about the de-synchronisation is in the log or current state of the DB.

I need to get the etcd cluster synchronised very soon. If there are any diagnostic commands I can run on the individual DBs or members, please let me know as soon as possible.

I would like to mention that I have found many reports of the same issue, a silent corruption of the combined data (members out of sync), which seems to be a long-standing issue with etcd. I'm somewhat surprised that a simple test for checksum seems not to be part of the internal health check. I can detect the issue with the output of "etcdctl endpoint hashkv --cluster", so why can't the cluster?

Best regards, Frank

@frans42
Copy link
Author

frans42 commented Jan 5, 2021

I would like to add some context to my previous comment. What I'm referring to when I say that the networking is not relevant here is a mismatch between observed and expected behaviour. I understand that networking problems could be the reason for the de-synchronisation of etcd, but this is actually not the point of this ticket. The point is, that an etcd cluster in this broken state should not serve requests.

What is the expected behaviour? I will make the example with 3 member instances. I will only consider the case with all members up but the DB increasingly inconsistent.

Case A: All DBs identical: all members serve read-write requests.

Case B: One DB out of sync: the remaining 2 members in quorum serve read-write requests; the out-of-sync member goes into health-err state and stops serving any requests. (In fact, I would expect a periodic re-sync attempt to be able to join quorum again.)

Any operation with etcdctl should print a warning message and return with a special error code indicating that the operation was OK, but the cluster is not. This can be done with OR-ing a bit into the usual return codes to allow returning 2 pieces of information with one byte value.

Case C: all three DBs out of sync: all members go into health-err state and every member stops serving requests.

My etcd cluster clearly is in case C and should stop serving requests. This is completely independent of why it got into this state. In fact, if etcd is not build for self-healing in case B, I would actually expect that it stops serving requests already when 1 member is down/1 DB copy is out of sync to protect the copies in quorum. Then, I would now not be in a situation with 3 different copies, where nobody knows any more which one is authoritative.

If what I describe in cases A-C is the expected behaviour, then it is not implemented and I would fix it. If cases A-C are not expected behaviour, then please let me know what is and how I can check for and ensure consistent copies of the DB instances.

Thanks and best regards, Frank

@stale
Copy link

stale bot commented Apr 5, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 5, 2021
@frans42
Copy link
Author

frans42 commented Apr 6, 2021

Hi, I added additional information as requested. Could someone please look at this?

@stale stale bot removed the stale label Apr 6, 2021
@ptabor
Copy link
Contributor

ptabor commented Apr 7, 2021

As long as you are using 'linearizable' reads, they should be served only if you have an active quorum.
Serializable reads might return 'stale' data if you are quering a lagging member.

@jonaz
Copy link

jonaz commented Apr 14, 2021

I think we are running into similar issue. One or many of our members loose their sync.

Create a election
etcdctl --endpoints http://endpoint1:2379 elect test asdf

Listen to election on same endpoint
etcdctl --endpoints http://endpoint1:2379 elect -l test
and we get asdf back as value

etcdctl --endpoints http://endpoint2:2379 elect -l test
here we dont get asdf but an old value.

This is a graph from etcd_debugging_mvcc_current_revision where we can clearly see that the revision is not in sync anymore.

2021-04-14-115303_1425x678_scrot

The only logs we can see in about that time is this on endpoint1. Which seems to be when our leaderelector client tries to query this endpoint but it does not contain the same leases as the other 2 nodes.

Apr 14 09:40:33 endpoint1 etcd[1226]: failed to apply request "header:<ID:7179433300884323974 > lease_revoke:<id:-2d53f2a26ed98729>" with response "size:26" took (60.907µs) to execute, err is lease not found
Apr 14 09:40:33 endpoint1 etcd[1226]: failed to apply request "header:<ID:7179433300884323975 > lease_revoke:<id:-1fb3372f24800bcc>" with response "size:26" took (62.104µs) to execute, err is lease not found

@frans42
Copy link
Author

frans42 commented May 3, 2021

Hi @jonaz, thanks for adding this data. This is exactly what I'm observing too. The members get out of sync and stay out of sync, it is not a temporary "catching up" as mentioned by @ptabor. Future commits can lead to members serving the same data again, but apart from the commit history being inconsistent I also observe that the members tend to be more and more out of sync (the frequency of failed uniform commits increases).

@xonvanetta
Copy link

I was debugging this now and trying some config flags and stuff to see something broken...

And low and behold it started and was synced again, the difference was that it purged a snap file when started.. the node was online but we have turned off etcd for like a week...
May 14 14:43:56 etcd1 etcd[264300]: purged file /var/lib/etcd/member/snap/0000000000000005-0000000000c4d725.snap successfully

We also found out that the service file could be tuned a little bit like this
https://github.com/etcd-io/etcd/blob/main/contrib/systemd/etcd.service

We will try with this service file and hope it works

@jonaz
Copy link

jonaz commented May 25, 2021

The above did not help. We have 3 clusters with this problem. They are only used for leases and have very little traffic. All 3 gets out of sync periodicly and we'll have to stop entire cluster and scrap data folder and start them again to make them work. The symptom is that lease is instant on one node but the other two lags 30-60 seconds, sometimes more. We are running 3.4.14

@frans42
Copy link
Author

frans42 commented May 25, 2021

As long as you are using 'linearizable' reads, they should be served only if you have an active quorum.
Serializable reads might return 'stale' data if you are quering a lagging member.

Hi @ptabor, I don't know what a serializable read would be and if our access pattern qualifies. However, the members were not "lagging" in the usual sense of being in the process of synchronization. The members stayed out of sync for days. The only way to recover is to wipe the out-of-sync-members as @jonaz writes above and re-deploy fresh instances.

My impression is that the members actually believe they are in sync. That's why they don't even make an attempt to sync. However, the condition of being out of sync is extremely easy to detect by comparing the haskkv values and I'm surprised that (1) this is not part of the internal cluster health check and (2) that there is no etcdctl command to force a re-sync from a specific source instance. This would help a lot with fixing a broken cluster.

@jonaz
Copy link

jonaz commented Jun 16, 2021

It happened again now at 14:22 in one of our production clusters. Nothing happend to the VM. No vmotion at that time. No logs in etcd execept for this exactly when the revision started to get out of sync:

failed to apply request "header:<ID:9133152155410395282 > lease_revoke:<id:-6f4bd77510edab61>" with response "size:27" took (69.202µs) to execute, err is lease not found

screenshot from prometheus:
2021-06-16-154746_701x669_scrot

We restarted the out of sync node with ETCD_EXPERIMENTAL_INITIAL_CORRUPT_CHECK=true
and the node wont start anymore due to data inconsistency

Jun 16 15:50:35 etcd-prod001-db001 systemd[1]: Starting Etcd Server...
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://10.10.16.30:2379
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_ELECTION_TIMEOUT=5000
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_EXPERIMENTAL_INITIAL_CORRUPT_CHECK=true
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_HEARTBEAT_INTERVAL=500
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.10.16.30:2380
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd-prod001-db001=https://10.10.16.30:2380,etcd-prod001-db001=https://10.12.16.30:2380,etcd-prod001-db001=https://10.13.16.30:2380
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=d8bf8cc6-5158-11e6-8f13-3b32f4935bde
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://10.10.16.30:2379,http://127.0.0.1:2379
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_NAME=etcd-prod001-db001
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/ssl/host.crt
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_PEER_CLIENT_CERT_AUTH=true
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/ssl/host.key
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/etc/ipa/ca.crt
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: etcd Version: 3.4.14
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: Git SHA: 8a03d2e96
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: Go Version: go1.12.17
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: Go OS/Arch: linux/amd64
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: setting maximum number of CPUs to 2, total number of available CPUs is 2
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: the server is already initialized as member before, starting as etcd member...
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: peerTLS: cert = /etc/ssl/host.crt, key = /etc/ssl/host.key, trusted-ca = /etc/ipa/ca.crt, client-cert-auth = true, crl-file = 
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: name = etcd-prod001-db001
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: data dir = /var/lib/etcd
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: member dir = /var/lib/etcd/member
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: heartbeat = 500ms
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: election = 5000ms
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: snapshot count = 100000
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: advertise client URLs = http://10.10.16.30:2379
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: initial advertise peer URLs = https://10.10.16.30:2380
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: initial cluster = 
Jun 16 15:50:35 etcd-prod001-db001 etcd[2646000]: check file permission: directory "/var/lib/etcd" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: recovered store from snapshot at index 24475647
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: restarting member bd342e357e6a7f1 in cluster 6ced47e36fa4abed at commit index 24547718
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: raft2021/06/16 15:50:36 INFO: bd342e357e6a7f1 switched to configuration voters=(538209710598815423 852098298702899185 17576977890215617442)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: raft2021/06/16 15:50:36 INFO: bd342e357e6a7f1 became follower at term 28
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: raft2021/06/16 15:50:36 INFO: newRaft bd342e357e6a7f1 [peers: [7781ae156e87ebf,bd342e357e6a7f1,f3edf843ab3263a2], term: 28, commit: 24547718, applied: 24475647, lastindex: 24547718, lastterm: 28]
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: enabled capabilities for version 3.4
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: added member 7781ae156e87ebf [https://10.13.16.30:2380] to cluster 6ced47e36fa4abed from store
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: added member bd342e357e6a7f1 [https://10.10.16.30:2380] to cluster 6ced47e36fa4abed from store
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: added member f3edf843ab3263a2 [https://10.12.16.30:2380] to cluster 6ced47e36fa4abed from store
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: set the cluster version to 3.4 from store
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: simple token is not cryptographically signed
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: unexpected Attach error: lease not found
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: starting peer 7781ae156e87ebf...
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: started HTTP pipelining with peer 7781ae156e87ebf
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: started peer 7781ae156e87ebf
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: added peer 7781ae156e87ebf
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: starting peer f3edf843ab3263a2...
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: started HTTP pipelining with peer f3edf843ab3263a2
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: started streaming with peer 7781ae156e87ebf (stream MsgApp v2 reader)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: started streaming with peer 7781ae156e87ebf (stream Message reader)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: started streaming with peer 7781ae156e87ebf (writer)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: started streaming with peer 7781ae156e87ebf (writer)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: started streaming with peer f3edf843ab3263a2 (writer)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: started streaming with peer f3edf843ab3263a2 (writer)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: started peer f3edf843ab3263a2
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: added peer f3edf843ab3263a2
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: bd342e357e6a7f1 starting initial corruption check with timeout 15s...
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: started streaming with peer f3edf843ab3263a2 (stream Message reader)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: started streaming with peer f3edf843ab3263a2 (stream MsgApp v2 reader)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: peer 7781ae156e87ebf became active
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: established a TCP streaming connection with peer 7781ae156e87ebf (stream MsgApp v2 reader)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: established a TCP streaming connection with peer 7781ae156e87ebf (stream Message reader)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: peer f3edf843ab3263a2 became active
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: established a TCP streaming connection with peer f3edf843ab3263a2 (stream Message reader)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: established a TCP streaming connection with peer f3edf843ab3263a2 (stream MsgApp v2 reader)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: bd342e357e6a7f1's hash 202913300 != 0's hash 3146108001 (revision 246, peer revision 247, compact revision -1)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: bd342e357e6a7f1's hash 202913300 != 0's hash 3146108001 (revision 246, peer revision 247, compact revision -1)
Jun 16 15:50:36 etcd-prod001-db001 etcd[2646000]: bd342e357e6a7f1 found data inconsistency with peers
Jun 16 15:50:36 etcd-prod001-db001 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE

A thing to note is that this cluster is only ever used for leadeelections and nothing else.

@jonaz
Copy link

jonaz commented Jun 17, 2021

@frans42 are you also using leader election? Do you have a custom software built for it like we do? We suspect that something we do in the client is causing this.

@frans42
Copy link
Author

frans42 commented Jul 8, 2021

Hi @jonaz, we use a standard set-up and don't do anything special. Leaders are elected by whatever etcd has implemented.

It looks like the devs are not going to address this. I plan to move away from etcd, it simply is too unreliable with this simple health-check and self-healing missing. Even if I purposefully corrupted the DB, it should detect and, if possible, fix it. However, it doesn't even detect an accidental corruption and with the lack of interest to fix this, I cannot rely on it.

The same applies to your situation. It is completely irrelevant how you manage to throw an etcd instance off. The etcd cluster should detect and fix the problem or err out. However, it simply continues to answer requests and happily distributes inconsistent data. That's not acceptable for a distributed data base.

If the idea of the devs is that a user adds a cron job for health checks herself, then the least that should be available is a command to force a re-sync of an instance against the others without the pain of removing and re-adding an instance.

@wilsonwang371
Copy link
Contributor

@ptabor we saw a similar issue today in our company's internal deployed cluster. We may need to do some further investigation on this.

@stale
Copy link

stale bot commented Nov 27, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 27, 2021
@jonaz
Copy link

jonaz commented Nov 30, 2021

We have not seen this lately. But we have also not seen major VM problems which we think was the initial cause. We have also stopped calculating our own leaseid and let concurrency.NewSession do it for us.

@stale stale bot removed the stale label Nov 30, 2021
@chaochn47
Copy link
Member

chaochn47 commented Feb 8, 2022

@jonaz aws team faced the same data inconsistency problem as well. After log diving, it appeared the lease_revoke apply failed due to leaseID not found in one node. That node mvcc current revision is the smallest one compared with the other two nodes. The other two node log did not have the lease_revoke failure for this leaseID.

Under this condition, the lessor Revoke will fail to delete the associated key values cause it will error out in Line 314. So the revision will start to diverge at this point.

etcd/lease/lessor.go

Lines 308 to 332 in a905430

func (le *lessor) Revoke(id LeaseID) error {
le.mu.Lock()
l := le.leaseMap[id]
if l == nil {
le.mu.Unlock()
return ErrLeaseNotFound
}
defer close(l.revokec)
// unlock before doing external work
le.mu.Unlock()
if le.rd == nil {
return nil
}
txn := le.rd()
// sort keys so deletes are in same order among all members,
// otherwise the backened hashes will be different
keys := l.Keys()
sort.StringSlice(keys).Sort()
for _, key := range keys {
txn.DeleteRange([]byte(key), nil)
}

Due to the kube-apiserver usage (specifically optimistic lock on the ResourceVersion) of etcd Txn like the following, the revision difference will be amplified and cause more serious cascading failures like kube-scheduler and kube-controll-er-manager cannot acquire or update its endpoint lease.

txnResp, err := s.client.KV.Txn(ctx).If(
	clientv3.Compare(clientv3.ModRevision(key), "=", origState.rev),
).Then(
	clientv3.OpPut(key, string(newData), opts...),
).Else(
	clientv3.OpGet(key),
).Commit()
metrics.RecordEtcdRequestLatency("update", getTypeName(out), startTime)
if err != nil {
	return err
}
trace.Step("Transaction committed")
if !txnResp.Succeeded {
	getResp := (*clientv3.GetResponse)(txnResp.Responses[0].GetResponseRange())
	klog.V(4).Infof("GuaranteedUpdate of %s failed because of a conflict, going to retry", key)
	origState, err = s.getState(getResp, key, v, ignoreNotFound)
	if err != nil {
		return err
	}
	trace.Step("Retry value restored")
	mustCheckData = false
	continue
}

https://github.com/kubernetes/kubernetes/blob/release-1.22/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L422-L436


From your log post in the previous comment

failed to apply request "header:<ID:9133152155410395282 > lease_revoke:<id:-6f4bd77510edab61>" with response "size:27" took (69.202µs) to execute, err is lease not found

The lease ID -6f4bd77510edab61 technically will never be negative because etcd server guaranteed leaseID must be positive when processing LeaseGrant requests.

func (s *EtcdServer) LeaseGrant(ctx context.Context, r *pb.LeaseGrantRequest) (*pb.LeaseGrantResponse, error) {
// no id given? choose one
for r.ID == int64(lease.NoLease) {
// only use positive int64 id's
r.ID = int64(s.reqIDGen.Next() & ((1 << 63) - 1))
}
resp, err := s.raftRequestOnce(ctx, pb.InternalRaftRequest{LeaseGrant: r})
if err != nil {
return nil, err
}
return resp.(*pb.LeaseGrantResponse), nil
}


For this "bad" etcd cluster, we also dumped the db file, inspected the lease bucket and found out there were multiple corrupted leases.

etcd-dump-db iterate-bucket /mnt/xvdb/datadir/member/snap/db lease --limit 3 --decode=true
lease ID=6eae7bcbcf28f962, TTL=15s
lease ID=6eae7bcbcf1cbf04, TTL=15s
panic: proto: illegal wireType 6

goroutine 1 [running]:
main.leaseDecoder(0x7f088db0caae, 0x8, 0x8, 0x7f088db0cab6, 0xd, 0xd)
	/home/ec2-user/go.etcd.io/etcd/tools/etcd-dump-db/backend.go:95 +0x195
main.iterateBucket.func1(0xc0000cc2a0, 0xc000000100, 0xc0000cc2a0)
	/home/ec2-user/go.etcd.io/etcd/tools/etcd-dump-db/backend.go:147 +0x373
go.etcd.io/bbolt.(*DB).View(0xc000118200, 0xc000121c10, 0x0, 0x0)
	/root/go/pkg/mod/go.etcd.io/bbolt@v1.3.3/db.go:725 +0x96
main.iterateBucket(0x7ffc733fc757, 0x20, 0x7ffc733fc778, 0x5, 0x1, 0x1, 0x0, 0x0)
	/home/ec2-user/go.etcd.io/etcd/tools/etcd-dump-db/backend.go:134 +0x1e5
main.iterateBucketCommandFunc(0xdd62a0, 0xc000080e10, 0x2, 0x5)
	/home/ec2-user/go.etcd.io/etcd/tools/etcd-dump-db/main.go:104 +0x11a
github.com/spf13/cobra.(*Command).execute(0xdd62a0, 0xc000080dc0, 0x5, 0x5, 0xdd62a0, 0xc000080dc0)
	/root/go/pkg/mod/github.com/spf13/cobra@v0.0.3/command.go:766 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xdd5de0, 0xa37d84, 0xc000121f78, 0x407d65)
	/root/go/pkg/mod/github.com/spf13/cobra@v0.0.3/command.go:852 +0x2fe
github.com/spf13/cobra.(*Command).Execute(...)
	/root/go/pkg/mod/github.com/spf13/cobra@v0.0.3/command.go:800
main.main()
	/home/ec2-user/go.etcd.io/etcd/tools/etcd-dump-db/main.go:65 +0x31

After editing the etcd-dump-db to print the error instead of panic right away. It shows the lease ID (bolt key) is negative integer and value is not compatible with the lease proto.

etcd-dump-db iterate-bucket i-0185616eef78879ae-2022-01-30T09-05-00.etcd/member/snap/db lease --decode=true
lease ID=6eae7bcbcf28f962, TTL=15s, RemainingTTL=0s
lease ID=6eae7bcbcf1cbf04, TTL=15s, RemainingTTL=0s
proto: illegal wireType 6
lease ID=-28912891f76d7011, TTL=0s, RemainingTTL=0s
lease ID=-7430661f711f712, TTL=0s, RemainingTTL=0s
unexpected EOF
lease ID=7bcbcf1bc75308d3, TTL=0s, RemainingTTL=0s
lease ID=6eae7bcbd230ab72, TTL=15s, RemainingTTL=0s

Interestingly, the corrupted leases (lease ID < 0) will be ignored when recovering lessor from the db file. So the key values associated those corrupted leases will never be deleted.

etcd/lease/lessor.go

Lines 770 to 776 in a905430

func (le *lessor) initAndRecover() {
tx := le.b.BatchTx()
tx.Lock()
tx.UnsafeCreateBucket(leaseBucketName)
_, vs := tx.UnsafeRange(leaseBucketName, int64ToBytes(0), int64ToBytes(math.MaxInt64), 0)
// TODO: copy vs and do decoding outside tx lock if lock contention becomes an issue.

/cc @wilsonwang371 @ptabor Does the above explanation make sense to you?

@serathius
Copy link
Member

@chaochn47 Thanks for digging into the issue. I would be interesting into looking into this more deeply. Looks like this bug was firstly detected on v3.3 so it affects last 3 releases. To make sure we correctly fix it we should look into adding a test that can verify the fix across multiple branches.

Could you prepare a minimal reproduction steps or a script that would trigger the issue?

@chaochn47
Copy link
Member

chaochn47 commented Feb 8, 2022

@serathius Thanks for looking into this issue!

Even though the negative leaseID existence may be a result of network (peer to peer raft entries replication) or disk IO (lessor to backend persistence) bitflip corruption, I think we could still edit the clientv3 LeaseGrant implementation to provide a negative leaseID explicitly to mimic the corruption in a live kubernetes cluster.

etcd/clientv3/lease.go

Lines 214 to 227 in a905430

func (l *lessor) Grant(ctx context.Context, ttl int64) (*LeaseGrantResponse, error) {
r := &pb.LeaseGrantRequest{TTL: ttl}
resp, err := l.remote.LeaseGrant(ctx, r, l.callOpts...)
if err == nil {
gresp := &LeaseGrantResponse{
ResponseHeader: resp.GetHeader(),
ID: LeaseID(resp.ID),
TTL: resp.TTL,
Error: resp.Error,
}
return gresp, nil
}
return nil, toErr(ctx, err)
}

Could you prepare a minimal reproduction steps or a script that would trigger the issue?

Yeah, we will update on this issue very soon.

@chaochn47
Copy link
Member

chaochn47 commented Feb 8, 2022

Update: Just attached PR chaochn47#5 in my forked etcd repository for a reproduce integration test. PTAL, thanks!!

The fix can be as easy as to change
from

_, vs := tx.UnsafeRange(leaseBucketName, int64ToBytes(0), int64ToBytes(math.MaxInt64), 0)

to
_, vs := tx.UnsafeRange(leaseBucketName, int64ToBytes(math.MinInt64), int64ToBytes(math.MaxInt64), 0)

and then let corrupted lease value to fail Unmarshal when lessor recovering from db file

For runtime data consistency check, we might need to start creating a hashkv like function (for example hashlease) and raise a corrupt alarm (introduced in #8420)

@serathius

@chaochn47
Copy link
Member

chaochn47 commented Feb 9, 2022

Run a simulation script in a 1.21 kubernetes cluster.

curl -X POST -sL http://localhost:2379/v3beta/lease/grant -d '{"TTL": 360, "ID": -1}'
echo ""
etcdctl put foo bar --lease="-1" -w json
etcdctl get foo -w json

curl -X POST -sL http://localhost:2379/v3beta/lease/grant -d '{"TTL": 360, "ID": -2}'
echo ""
etcdctl put bar foo --lease="-2" -w json
etcdctl get bar -w json

curl -X POST -sL http://localhost:2379/v3beta/lease/grant -d '{"TTL": 360, "ID": -3}'
echo ""
etcdctl put hello world --lease="-3" -w json
etcdctl get hello -w json

curl -X POST -sL http://localhost:2379/v3beta/lease/grant -d '{"TTL": 360, "ID": -4}'
echo ""
etcdctl put world hello --lease="-4" -w json
etcdctl get world -w json

Followed by Remove member and Add member runtime reconfiguration process one by one with 3 times.

The revision gap across 3 etcd nodes became larger and larger.

Screen Shot 2022-02-08 at 5 04 30 PM

The kube-controller-manager in kubernetes control plane node cannot acquire and renew its lease. The state of world and spec is out of sync over time.

E0209 00:43:37.322749      10 leaderelection.go:361] Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "kube-controller-manager": the object has been modified; please apply your changes to the latest version and try again

@wilsonwang371
Copy link
Contributor

@chaochn47 Hi Chao, which verison of etcd did you use. Did you use the latest etcd? There was a patch fixing this issue. #13505

@chaochn47
Copy link
Member

chaochn47 commented Feb 9, 2022

@wilsonwang371 Hi wilson, we are using etcd version 3.4.18. release-3.4 don't have the issue #13505 was trying to fix.

// UnsafeRange returns keys and values. And in boltdb, keys are revisions.
// values are actual key-value pairs in backend.
tx := s.store.b.ReadTx()
tx.RLock()
revs, vs := tx.UnsafeRange(keyBucketName, minBytes, maxBytes, 0)
var evs []mvccpb.Event
if s.store != nil && s.store.lg != nil {
evs = kvsToEvents(s.store.lg, wg, revs, vs)
} else {
// TODO: remove this in v3.5
evs = kvsToEvents(nil, wg, revs, vs)
}
tx.RUnlock()

xref. https://github.com/etcd-io/etcd/blob/v3.4.18/mvcc/watchable_store.go#L351-L363

For my education, is the lease bucket data corruption a victim of deep copying during boltdb re-mmap? How will read impact the write path?

@wilsonwang371
Copy link
Contributor

@wilsonwang371 Hi wilson, we are using etcd version 3.4.18. release-3.4 don't have the issue #13505 was trying to fix.

// UnsafeRange returns keys and values. And in boltdb, keys are revisions.
// values are actual key-value pairs in backend.
tx := s.store.b.ReadTx()
tx.RLock()
revs, vs := tx.UnsafeRange(keyBucketName, minBytes, maxBytes, 0)
var evs []mvccpb.Event
if s.store != nil && s.store.lg != nil {
evs = kvsToEvents(s.store.lg, wg, revs, vs)
} else {
// TODO: remove this in v3.5
evs = kvsToEvents(nil, wg, revs, vs)
}
tx.RUnlock()

xref. https://github.com/etcd-io/etcd/blob/v3.4.18/mvcc/watchable_store.go#L351-L363

For my education, is the lease bucket data corruption a victim of deep copying during boltdb re-mmap? How will read impact the write path?

No, actually I am not sure about this yet. The root cause of our internal observation of DB out of sync is due to #13505

@serathius
Copy link
Member

Seems like the root cause for Etcd diverging is lease with negative ID. As such leases are not loaded during init, restarted members fail to revoke those leases.

Questions still remains, how such leases were created? Is this bug in server lease id generation, lease persisting, client or application. I don't think that it's data disk corruption as flipping sign bit is too specific. I will try to look through Etcd and Kubernetes lease ID handling code.

@chaochn47
Copy link
Member

how such leases were created?

We will report back to community once we know how it happened

Is this bug in client or application?

From my understanding, Kubernetes just use the client v3 Grant API to cache the last used leaseID generated from the etcd server.

etcd/clientv3/lease.go

Lines 214 to 227 in a905430

func (l *lessor) Grant(ctx context.Context, ttl int64) (*LeaseGrantResponse, error) {
r := &pb.LeaseGrantRequest{TTL: ttl}
resp, err := l.remote.LeaseGrant(ctx, r, l.callOpts...)
if err == nil {
gresp := &LeaseGrantResponse{
ResponseHeader: resp.GetHeader(),
ID: LeaseID(resp.ID),
TTL: resp.TTL,
Error: resp.Error,
}
return gresp, nil
}
return nil, toErr(ctx, err)
}

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/lease_manager.go#L49-L66

// leaseManager is used to manage leases requested from etcd. If a new write
// needs a lease that has similar expiration time to the previous one, the old
// lease will be reused to reduce the overhead of etcd, since lease operations
// are expensive. In the implementation, we only store one previous lease,
// since all the events have the same ttl.
type leaseManager struct {
	client                  *clientv3.Client // etcd client used to grant leases
	leaseMu                 sync.Mutex
	prevLeaseID             clientv3.LeaseID
	prevLeaseExpirationTime time.Time
	// The period of time in seconds and percent of TTL that each lease is
	// reused. The minimum of them is used to avoid unreasonably large
	// numbers.
	leaseReuseDurationSeconds   int64
	leaseReuseDurationPercent   float64
	leaseMaxAttachedObjectCount int64
	leaseAttachedObjectCount    int64
}

@ahrtr
Copy link
Member

ahrtr commented Feb 15, 2022

I just get time to take a look at this issue, I see some lease not found error messages in previous comment. Note that I submitted a PR 13690 recently, and the issue mentioned in the PR could be one of the reasons which could lead to "lease not found".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

9 participants