Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data inconsistency in etcd version 3.5.0 (3.5.x rollback> 3.4 upgrade-> 3.5) story #13514

Closed
moonovo opened this issue Dec 1, 2021 · 26 comments · Fixed by #13903
Closed

Data inconsistency in etcd version 3.5.0 (3.5.x rollback> 3.4 upgrade-> 3.5) story #13514

moonovo opened this issue Dec 1, 2021 · 26 comments · Fixed by #13903
Assignees
Labels
help wanted priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@moonovo
Copy link

moonovo commented Dec 1, 2021

When I enter the following command:

ETCDCTL_API=3  etcdctl --endpoints https://192.168.2.2:4003 get /Canal/can/locks/health-host-192-168-2-2 -w json; ETCDCTL_API=3  etcdctl --endpoints https://192.168.2.3:4003 get /Canal/can/locks/health-host-192-168-2-2 -w json; ETCDCTL_API=3  etcdctl --endpoints https://192.168.2.4:4003 get /Canal/can/locks/health-host-192-168-2-2 -w json; 

I get the response:

{"header":{"cluster_id":5771341481381694818,"member_id":8096206227661897536,"revision":973575,"raft_term":16},"kvs":[{"key":"L0NhbmFsL2Nhbi9sb2Nrcy9oZWFsdGgtaG9zdC0xOTItMTY4LTItMg==","create_revision":2,"mod_revision":973561,"version":122303,"value":"eyJraW5kIjoiTG9jayIsImFwaVZlcnNpb24iOiJjYW4vdjFhbHBoYTEiLCJtZXRhZGF0YSI6eyJuYW1lIjoiaGVhbHRoLWhvc3QtMTkyLTE2OC0yLTIiLCJjcmVhdGlvblRpbWVzdGFtcCI6IjIwMjEtMTItMDFUMjE6NTg6NTdaIn0sImxvY2tpZCI6ImhlYWx0aC1ob3N0LTE5Mi0xNjgtMi0yIn0K"}],"count":1}
{"header":{"cluster_id":5771341481381694818,"member_id":14749687755696706107,"revision":973575,"raft_term":16},"kvs":[{"key":"L0NhbmFsL2Nhbi9sb2Nrcy9oZWFsdGgtaG9zdC0xOTItMTY4LTItMg==","create_revision":2,"mod_revision":973561,"version":122303,"value":"eyJraW5kIjoiTG9jayIsImFwaVZlcnNpb24iOiJjYW4vdjFhbHBoYTEiLCJtZXRhZGF0YSI6eyJuYW1lIjoiaGVhbHRoLWhvc3QtMTkyLTE2OC0yLTIiLCJjcmVhdGlvblRpbWVzdGFtcCI6IjIwMjEtMTItMDFUMjE6NTg6NTdaIn0sImxvY2tpZCI6ImhlYWx0aC1ob3N0LTE5Mi0xNjgtMi0yIn0K"}],"count":1}
{"header":{"cluster_id":5771341481381694818,"member_id":9436515569532730235,"revision":922759,"raft_term":16},"kvs":[{"key":"L0NhbmFsL2Nhbi9sb2Nrcy9oZWFsdGgtaG9zdC0xOTItMTY4LTItMg==","create_revision":2,"mod_revision":922630,"version":120692,"value":"eyJraW5kIjoiTG9jayIsImFwaVZlcnNpb24iOiJjYW4vdjFhbHBoYTEiLCJtZXRhZGF0YSI6eyJuYW1lIjoiaGVhbHRoLWhvc3QtMTkyLTE2OC0yLTIiLCJjcmVhdGlvblRpbWVzdGFtcCI6IjIwMjEtMTItMDFUMjE6NTQ6MzFaIn0sImxvY2tpZCI6ImhlYWx0aC1ob3N0LTE5Mi0xNjgtMi0yIn0K"}],"count":1}

different mod_revision and version on 192.168.2.4 node

Reproduce Procedure:

  1. reboot the three nodes.
  2. when the 192.168.2.4 node starts up, it may have replaced the old etcd data into the etcd data directory, I guess.

If the second point is true, there will be data inconsistency? why can the node with broken data added to the etcd cluster?

@ptabor
Copy link
Contributor

ptabor commented Dec 7, 2021

Interesting. What's the exact etcd version ? [edit: see 3.5.0 in the title]

In general the behavior you are describing might happen in: --consistency="s" serializable mode (that is NOT the default for etcdctl get), but should NOT happen in the --consistency="l" mode (the default ONE), where the
reader should get from the leader the index of 'head revision' first and only if the replica is synced to the revision answer the request.

Last changes around the logic I'm aware of were in: #12762 (@wpedrak FYI)

Things to verify

  • is it a difference between 3.4.x and 3.5.0 ?
  • is --consistency=linearizable correctly passed from etcdctl to the server.

@ptabor ptabor added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 7, 2021
@ahrtr
Copy link
Member

ahrtr commented Dec 8, 2021

I am also thinking probably the the 3 members are not belong to the same cluster at all. Could you execute the following commands to double confirm?

etcdctl --endpoints https://192.168.2.4:4003 member list -w table
etcdctl --endpoints https://192.168.2.3:4003 member list -w table
etcdctl --endpoints https://192.168.2.2:4003 member list -w table

Please also double confirm the etcd and etcdctl versions using command etcd --version and etcdctl version

@moonovo
Copy link
Author

moonovo commented Dec 9, 2021

The fault in the environment has been rectified. My way of avoiding it: the client still uses clientv3 of the etcd 3.4 version. After I downgraded the etcd server version to 3.4.15, the problem does not occur again.
Therefore, I think clientv3 of etcd 3.4 may be incompatible with the server of etcd 3.5.

@moonovo
Copy link
Author

moonovo commented Dec 9, 2021

  1. The wal log replication is normal. I try to put a key/value to etcd and I can find wal logs on all three nodes by etcd-dump-logs
  2. Run bbolt to get the consistency_index value of meta bucket in db file on three nodes. The value remains unchanged.

@moonovo
Copy link
Author

moonovo commented Dec 9, 2021

I can't restart etcd because when I do, I get the panic: failed to recover v3 backend from snapshot, the error is "failed to find database snapshot file (snap: snapshot file doesn't exist)"

@bigfudan
Copy link

wow, I have meet the same question, have U fixed it?

@serathius
Copy link
Member

I'm unable to reproduce this with Etcd v3.5.1 and etcdctl v3.4 and v3.5. I run 3 member cluster where each member restarts every 10 seconds with a periodic get revision (every 1 second) and etcdctl check perf to generate load. I have never seen an inconsistent read.

Can you provide more detailed reproduction steps?

@serathius serathius self-assigned this Jan 10, 2022
@serathius
Copy link
Member

This is a blocker for v3.5.2 so I will focus on this. Help would be appreciated!

@ptabor
Copy link
Contributor

ptabor commented Jan 11, 2022

@moonovo

when the 192.168.2.4 node starts up, it may have replaced the old etcd data into the etcd data directory, I guess.

  1. What do you mean by 'get replaced' is there process that touches the files... or just etcd put operation that hits the server between shutdown.

  2. Does this problem happened once or you are reliably able to reproduce it ?
    How frequently it happens (per how many tries) ? Can you, please, share a script to reproduce it...

  3. Does the problem disappears after some time from restart... (i.e. it's flake during restart) and eventually the lagging replica catches up the leader... or the server lands in such persistent disconnect ?

I can't restart etcd because when I do, I get the panic: failed to recover v3 backend from snapshot, the error is "failed to find database snapshot file (snap: snapshot file doesn't exist)"

  1. That's even more scary. Does it mean every time you repro the problem: the-lagging-one/the-not-lagging-ones/all nodes cannot be restarted ? How do you restart the nodes (kill signal ? Ctrl-C ?) ?

  2. Is it the same state you keep working on... or it's reproductionable from the 'empty' inital state ?

@serathius
Copy link
Member

serathius commented Jan 11, 2022

Wrote a test that concurrently writes, reads and inject failures (partitions or restarts) and ensures that reads get monotonic revision. Didn't see any inconsistencies on both v3.5.0 and main branch. #13584

Without more details from @moonovo I will not be able to reproduce it.

@serathius
Copy link
Member

serathius commented Jan 12, 2022

With a test specific for this problem on the way and no reproduction I'm inclined to remove this as a blocker for v3.5.2 release.

@moonovo
Copy link
Author

moonovo commented Jan 18, 2022

This issue can be closed. I found the reason why consistency_index was not updated, but I did not find the reason why revision was inconsistent.
The reason why consistency_index is not updated is that the term field is added to the meta bucket after the upgrade from 3.4.15 to 3.5.0,
Then we rolled back from 3.5.0 to 3.4.15(we backup and restore, the raft term starts from 1.) The term field is not updated in 3.4.15,
After the version is upgraded to 3.5.0 again, consistency_index is not updated because the term in db is greater than the raft term.

@moonovo moonovo closed this as completed Jan 18, 2022
@ptabor ptabor reopened this Jan 18, 2022
@ptabor
Copy link
Contributor

ptabor commented Jan 18, 2022

Thank you @moonovo. That's a good story for rollback->rollforward safety, that @serathius
was looking into. Starting from 3.6 with versioning the rollback should be explicit and drop the 3.5 content.

I think in 3.5 the server should during bootstrap reconcile Term between WAL logs and bbolt considering bbolt to be authoritative source of truth.

@ptabor ptabor changed the title Data inconsistency in etcd version 3.5.0 Data inconsistency in etcd version 3.5.0 Mar 25, 2022
@ptabor ptabor changed the title Data inconsistency in etcd version 3.5.0 Data inconsistency in etcd version 3.5.0 (3.5.x rollback> 3.4 upgrade-> 3.5) story Mar 25, 2022
@chaochn47
Copy link
Member

chaochn47 commented Mar 28, 2022

Thanks @ptabor for calling this out in the "3.5 inconsistency issue" email thread !!!!

This issue looks worrying that makes 3.5.x rollback a risky operation. I've got a reproduction (manual testing) reproduction.log after following the provided description from @moonovo .

etcd_endpoints=http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379
 
echo $etcd_endpoints | tr ',' '\n' | xargs -I '{}' etcdctl --endpoints '{}' get foo -w json
{"header":{"cluster_id":9108655428580501207,"member_id":16513504468151473617,"revision":5,"raft_term":3},"kvs":[{"key":"Zm9v","create_revision":2,"mod_revision":5,"version":4,"value":"ZXJyb3I="}],"count":1}
{"header":{"cluster_id":9108655428580501207,"member_id":17404356270211841594,"revision":5,"raft_term":3},"kvs":[{"key":"Zm9v","create_revision":2,"mod_revision":5,"version":4,"value":"ZXJyb3I="}],"count":1}
{"header":{"cluster_id":9108655428580501207,"member_id":463034352321715000,"revision":9,"raft_term":3},"kvs":[{"key":"Zm9v","create_revision":2,"mod_revision":9,"version":8,"value":"ZXJyb3I="}],"count":1}

To answer your question @moonovo

but I did not find the reason why revision was inconsistent

The lagging-behind consistent_index node will re-apply some raft entries when it restarts. (the consistent_index will be reloaded from disk and unnecessarily allow applying old raft entries because the following code)

func (s *EtcdServer) applyEntryNormal(e *raftpb.Entry) {
shouldApplyV3 := membership.ApplyV2storeOnly
index := s.consistIndex.ConsistentIndex()
if e.Index > index {
// set the consistent index of current executing entry
s.consistIndex.SetConsistentIndex(e.Index, e.Term)
shouldApplyV3 = membership.ApplyBoth
}

Also the worrying initial-corruption-check feature was broken like that /cc @serathius

{"level":"warn","ts":"2022-03-28T06:05:40.650Z","caller":"etcdserver/corrupt.go:289","msg":"failed hash kv request","local-member-id":"66d074a48f05b38","requested-revision":7,"remote-peer-endpoint":"http://127.0.0.1:12382","error":"Get \"http://127.0.0.1:12382/members/hashkv\": EOF"}
{"level":"warn","ts":"2022-03-28T06:05:40.650Z","caller":"etcdserver/corrupt.go:289","msg":"failed hash kv request","local-member-id":"66d074a48f05b38","requested-revision":7,"remote-peer-endpoint":"http://127.0.0.1:22382","error":"Get \"http://127.0.0.1:22382/members/hashkv\": EOF"}
{"level":"info","ts":"2022-03-28T06:05:40.650Z","caller":"etcdserver/corrupt.go:116","msg":"initial corruption checking passed; no corruption","local-member-id":"66d074a48f05b38"}

I will work on the fix (and test) soon.

@michaljasionowski
Copy link
Contributor

Thanks @chaochn47, are you able to repeat this result? I've been trying to reproduce it with your reproduction log with no success. At the end I get the same revisions and versions across all members.

Here is my failed_repro.log.

@chaochn47
Copy link
Member

chaochn47 commented Mar 29, 2022

Hi @michaljasionowski thanks for the input!

are you able to repeat this result?

Yes, It is repeatable. I will covert it into code which makes it easy be reproduced by someone else.

I've been trying to reproduce it with your reproduction log with no success

I think I missed pasting something in the repro execution logs.

After this step,

## put the data again
~/etcd-binaries/v3.5.2/etcdctl --endpoints http://127.0.0.1:32379 put foo error

infra10 won't be able to persist consistent_index 12 to disk due to its term is 3 < 10

you need to kill infra10 and restart it which will reload its lagging-behind consistent_index from disk, reload raft storage from disk to memory, replay its log and apply to backend if necessary. In infra10, it will apply twice the number of raft entries from when it received mutations requests.

Also please git checkout main && git pull --rebase before running

gvm use go1.17.8
PASSES="build" ./scripts/test.sh -v 
cp ./bin/tools/etcd-dump-db ~/etcd-binaries/etcd-dump-db

This will help visualize the consistent_index with human-readable format instead of encoded plain bytes.

Let me know if it still does not work for you, thanks.

@ahrtr
Copy link
Member

ahrtr commented Mar 29, 2022

Please refer to discussion in pull/13844 and issues/13766

@michaljasionowski
Copy link
Contributor

Thanks @chaochn47, that was very helpful. I managed to reproduce it as well

{"header":{"cluster_id":9108655428580501207,"member_id":16513504468151473617,"revision":4,"raft_term":5},"kvs":[{"key":"Zm9v","create_revision":2,"mod_revision":4,"version":3,"value":"ZXJyb3I="}],"count":1}
{"header":{"cluster_id":9108655428580501207,"member_id":17404356270211841594,"revision":4,"raft_term":5},"kvs":[{"key":"Zm9v","create_revision":2,"mod_revision":4,"version":3,"value":"ZXJyb3I="}],"count":1}
{"header":{"cluster_id":9108655428580501207,"member_id":17604630397935920175,"revision":5,"raft_term":5},"kvs":[{"key":"Zm9v","create_revision":2,"mod_revision":5,"version":4,"value":"ZXJyb3I="}],"count":1}

and also initial corruption check passed

{"level":"info","ts":"2022-03-30T09:54:41.238Z","caller":"etcdserver/corrupt.go:116","msg":"initial corruption checking passed; no corruption","local-member-id":"f4503612d703a42f"}

@michaljasionowski
Copy link
Contributor

michaljasionowski commented Apr 1, 2022

This the script I used for reproduction: upgradedowngrade.log. It sets 3 variables at the beginning that have to be updated.

@serathius serathius mentioned this issue Apr 6, 2022
28 tasks
@serathius
Copy link
Member

serathius commented Apr 6, 2022

I have been looking into this issue and found a strange behavior, looks like etcd v3.5.2 upon restart reapplies entries. Each restart results in etcd reapplying the last entry. I have simplified the script provided by @michaljasionowski
script.txt (yes, it has extension .txt because for some reason GitHub doesn't support uploading .sh files)

If we look at data dir of:

  • WAL is identical between all instances
./bin/tools/etcd-dump-logs step5
Snapshot:
term=1 index=3 nodes=[a23bb980081aa72d e52bc0d24cfe95d1 f188b1d0c651be3a] confstate={"voters":[11690141217313171245,16513504468151473617,17404356270211841594],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=a23bb980081aa72d clusterID=7e6871da41f966d7 term=2 commitIndex=12 vote=e52bc0d24cfe95d1
WAL entries:
lastIndex=12
term	     index	type	data
   2	         4	norm	
   2	         5	norm	method=PUT path="/0/members/e52bc0d24cfe95d1/attributes" val="{\"name\":\"infra1\",\"clientURLs\":[\"http://127.0.0.1:2379\"]}"
   2	         6	norm	method=PUT path="/0/members/a23bb980081aa72d/attributes" val="{\"name\":\"infra3\",\"clientURLs\":[\"http://127.0.0.1:32379\"]}"
   2	         7	norm	method=PUT path="/0/members/f188b1d0c651be3a/attributes" val="{\"name\":\"infra2\",\"clientURLs\":[\"http://127.0.0.1:22379\"]}"
   2	         8	norm	method=PUT path="/0/version" val="3.4.0"
   2	         9	norm	header:<ID:12046425309767173124 > put:<key:"foo" value:"bar1" > 
   2	        10	norm	method=PUT path="/0/members/a23bb980081aa72d/attributes" val="{\"name\":\"infra3\",\"clientURLs\":[\"http://127.0.0.1:32379\"]}"
   2	        11	norm	header:<ID:12046425309769246980 > put:<key:"foo" value:"bar2" > 
   2	        12	norm	method=PUT path="/0/members/a23bb980081aa72d/attributes" val="{\"name\":\"infra3\",\"clientURLs\":[\"http://127.0.0.1:32379\"]}"

Entry types (Normal,ConfigChange) count is : 9
  • Corrupted etcd v3.5 has duplicated apply of last put:
./bin/tools/etcd-dump-db iterate-bucket step5 key --decode
rev={main:4 sub:0}, value=[key "foo" | val "bar2" | created 2 | mod 4 | ver 3]
rev={main:3 sub:0}, value=[key "foo" | val "bar2" | created 2 | mod 3 | ver 2]
rev={main:2 sub:0}, value=[key "foo" | val "bar1" | created 2 | mod 2 | ver 1]

Current guess is that this is caused by outdated term value caused by downgrade. cc @ptabor

@serathius
Copy link
Member

Looks like corruption happens before the restart, restarting just causes etcd to reapply the entries. consistent_index in db equals 9, which is WAL entry before put foo bar, however looks like the apply was already executed. The foo = bar` is already present in key bucket.

etcd $ ./bin/tools/etcd-dump-logs step4
Snapshot:
term=1 index=3 nodes=[a23bb980081aa72d e52bc0d24cfe95d1 f188b1d0c651be3a] confstate={"voters":[11690141217313171245,16513504468151473617,17404356270211841594],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=a23bb980081aa72d clusterID=7e6871da41f966d7 term=2 commitIndex=10 vote=e52bc0d24cfe95d1
WAL entries:
lastIndex=11
term	     index	type	data
   2	         4	norm	
   2	         5	norm	method=PUT path="/0/members/e52bc0d24cfe95d1/attributes" val="{\"name\":\"infra1\",\"clientURLs\":[\"http://127.0.0.1:2379\"]}"
   2	         6	norm	method=PUT path="/0/members/a23bb980081aa72d/attributes" val="{\"name\":\"infra3\",\"clientURLs\":[\"http://127.0.0.1:32379\"]}"
   2	         7	norm	method=PUT path="/0/members/f188b1d0c651be3a/attributes" val="{\"name\":\"infra2\",\"clientURLs\":[\"http://127.0.0.1:22379\"]}"
   2	         8	norm	method=PUT path="/0/version" val="3.4.0"
   2	         9	norm	header:<ID:12046425313434587396 > put:<key:"foo" value:"bar1" > 
   2	        10	norm	method=PUT path="/0/members/a23bb980081aa72d/attributes" val="{\"name\":\"infra3\",\"clientURLs\":[\"http://127.0.0.1:32379\"]}"
   2	        11	norm	header:<ID:12046425313437170180 > put:<key:"foo" value:"bar2" > 

Entry types (Normal,ConfigChange) count is : 8
etcd $ ./bin/tools/etcd-dump-db iterate-bucket step4 key --decode
rev={main:3 sub:0}, value=[key "foo" | val "bar2" | created 2 | mod 3 | ver 2]
rev={main:2 sub:0}, value=[key "foo" | val "bar1" | created 2 | mod 2 | ver 1]
etcd $ ./bin/tools/etcd-dump-db iterate-bucket step4 meta --decode
key="term", value=3
key="consistent_index", value=9
key="confState", value="{\"voters\":[15985099378392235387],\"auto_leave\":false}"

@serathius
Copy link
Member

cc @ahrtr

@serathius
Copy link
Member

serathius commented Apr 7, 2022

The cause of corruption is

oldi, oldTerm := UnsafeReadConsistentIndex(tx)
if term < oldTerm {
return
}

Outdated term field (caused by downgrade) results in etcd applying the record without updating CI. The case is artificial because downgrades are not officially supported, but this brought up two issues that are not related to downgrades. Still thinking what exactly is the problem that should be fixed:

  • Etcd v3.5 trusting term stored in DB and not one in WAL. This doesn't seem safe in v3.5, possibly should be delayed to v3.6?
  • Etcd applying the entry without changing CI.

Possibly both cases would be worth fixing. cc @spzala @ahrtr @ptabor for opinion.

@ahrtr
Copy link
Member

ahrtr commented Apr 7, 2022

I will have a deep dive into this one in the following couple of days. I have been busy with 13854 in the past week.

@serathius
Copy link
Member

serathius commented Apr 7, 2022

Looks like both issues were introduced here #12855 (comment) when we merged migrator consistent index update logic with WAL entry apply one.

@ptabor
Copy link
Contributor

ptabor commented Apr 7, 2022

Intuitively applyEntries should get never executed if we are in the wrong term.
Applies should happen only up to the last 'HardState' found in the WAL log and it should determine whether entry is a subject to apply or not.
So I think it's harmful defense in depth, but will need to deeper dive from the original logic from pre-refactoring to confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Development

Successfully merging a pull request may close this issue.

7 participants