Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MemberDowngrade failpoint #19038

Merged
merged 1 commit into from
Dec 31, 2024
Merged

Conversation

siyuanfoundation
Copy link
Contributor

@k8s-ci-robot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@siyuanfoundation
Copy link
Contributor Author

siyuanfoundation commented Dec 10, 2024

I am getting the following error sometimes. @serathius do you know what this error usually comes from?

Error:      	Received unexpected error:
        	            	failed to read WAL, cannot be repaired, err: wal: slice bounds out of range, snapshot[Index: 0, Term: 0], current entry[Index: 8193, Term: 4], len(ents): 7458

Copy link

codecov bot commented Dec 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.81%. Comparing base (e0bbea9) to head (747ef5f).
Report is 39 commits behind head on main.

Additional details and impacted files

see 31 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19038      +/-   ##
==========================================
+ Coverage   68.72%   68.81%   +0.08%     
==========================================
  Files         420      420              
  Lines       35627    35640      +13     
==========================================
+ Hits        24483    24524      +41     
+ Misses       9713     9692      -21     
+ Partials     1431     1424       -7     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0bbea9...747ef5f. Read the comment docs.

return nil, err
}
targetVersion := semver.Version{Major: v.Major, Minor: v.Minor - 1}
numberOfMembersToDowngrade := rand.Int()%len(clus.Procs) + 1
Copy link
Member

@serathius serathius Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not downgrade all members? Trying to think if there are any benefits on testing partial downgrade. Technically all partial upgrades are just subset of procedure of full downgrade.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the benefit is to verify that the correctness should never be broken no matter how many members are downgraded.

I am thinking that we should explicitly verify the two cases: full downgrade and partial downgrade.

@ahrtr
Copy link
Member

ahrtr commented Dec 16, 2024

I am getting the following error sometimes. @serathius do you know what this error usually comes from?

Error:      	Received unexpected error:
        	            	failed to read WAL, cannot be repaired, err: wal: slice bounds out of range

Most likely the WAL record is somehow corrupted again. Added more debug info in #19067.

@ahrtr
Copy link
Member

ahrtr commented Dec 16, 2024

Overall looks good to me, please mark this PR as ready to review when you feel comfortable.

@siyuanfoundation
Copy link
Contributor Author

siyuanfoundation commented Dec 16, 2024

I am getting the following error sometimes. @serathius do you know what this error usually comes from?

Error:      	Received unexpected error:
        	            	failed to read WAL, cannot be repaired, err: wal: slice bounds out of range

Most likely the WAL record is somehow corrupted again. Added more debug info in #19067.

Updated the error message.
I get this error almost always when I run the robustness test for 20 times.
Does that mean the downgrade would likely corrupt WAL?

I tried member.Stop() instead of member.Kill() and did not change the results.

@serathius @ahrtr

@ahrtr
Copy link
Member

ahrtr commented Dec 16, 2024

Updated the error message. I get this error almost always when I run the robustness test for 20 times. Does that mean the downgrade would likely corrupt WAL?

I tried member.Stop() instead of member.Kill() and did not change the results.

Did not get time to dig into the robustness test's use case of reading WAL file, but my immediate feeling is the reason might be due to an inappropriate snapshot {Index: 0, Term: 0} is always set for the WAL.

w, err := wal.OpenForRead(lg, walDir, walpb.Snapshot{Index: 0})

If the v2 snapshot files have already been rotated, in other words, the very first snapshot files have been purged, then it means there are data loss (you are not reading WAL records right following the v2 snapshot Index) from v2store perspective (although there isn't data loss from v3store perspective), then you will definitely see this error.

offset := e.Index - w.start.Index - 1
if offset > uint64(len(ents)) {
// return error before append call causes runtime panic
return nil, state, nil, fmt.Errorf("%w, snapshot[Index: %d, Term: %d], current entry[Index: %d, Term: %d], len(ents): %d",
ErrSliceOutOfRange, w.start.Index, w.start.Term, e.Index, e.Term, len(ents))
}

A thought to double check this... set a big value for both --max-snapshots and --max-wals to ensure no snapshot files and WAL files rotation happens.

@siyuanfoundation
Copy link
Contributor Author

@ahrtr Thanks for the suggestion.
I tried "--max-wals=1000", "--max-snapshots=1000", still seeing the same failure. So it is probably not due to rotation?

@ahrtr
Copy link
Member

ahrtr commented Dec 18, 2024

@ahrtr Thanks for the suggestion. I tried "--max-wals=1000", "--max-snapshots=1000", still seeing the same failure. So it is probably not due to rotation?

The WAL files should haven't rotated, otherwise it will fail to find the WAL files which match the snap index.

nameIndex, ok := searchIndex(lg, names, snap.Index)
if !ok {
return nil, -1, fmt.Errorf("wal: file not found which matches the snapshot index '%d'", snap.Index)
}

Please use the etcd-dump-logs to analyse the WAL files, to double check whether the WAL records are incremental by 1. Run command below,

./etcd-dump-logs --start-index=0 path/2/data-dir

Also I tried to test it locally, but couldn't reproduce it. Please provide detailed step, I may try it when I get bandwidth.

go test -run TestRobustnessExploratory -v --count 100 --failfast --timeout 1h

@ahrtr
Copy link
Member

ahrtr commented Dec 18, 2024

Please use the etcd-dump-logs to analyse the WAL files, to double check whether the WAL records are incremental by 1. Run command below,

./etcd-dump-logs --start-index=0 path/2/data-dir

Refer to
#19082

@siyuanfoundation
Copy link
Contributor Author

siyuanfoundation commented Dec 18, 2024

I added my local patch to force test this failpoint only.
To reproduce the failure:

# download 3.5 release with 
make test-e2e-release

make gofail-enable && make build
cd tests/robustness
go test -run TestRobustnessExploratory -v --count 100 --failfast --timeout 1h

@siyuanfoundation
Copy link
Contributor Author

The error msg is:

wal: slice bounds out of range, snapshot[Index: 0, Term: 0], current entry[Index: 12487, Term: 2], len(ents): 11243

and the log dump is

# member-2
Snapshot:
term=2 index=12486 nodes=[ac4ec652f10e5b49 bf19ae4419db00dc eabdbb777cf498cb] confstate={"voters":[12416079282240904009,13770228943176794332,16914881897345358027],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=bf19ae4419db00dc clusterID=b3bc0c1919fe5d7e term=3 commitIndex=36822 vote=eabdbb777cf498cb
WAL entries: 24337
lastIndex=36823
term         index      type    data
   2         12487      norm    header:<ID:11010056292433083151 > txn:<compare:<target:MOD key:"/registry/pods/default/hyG1G" mod_revision:8853 > success:<request_put:<key:"/registry/pods/default/hyG1G" value:"11523" > > failure:<request_range:<key:"/registry/pods/default/hyG1G" > > > 
   2         12488      norm    header:<ID:6577951309147099754 > txn:<compare:<target:MOD key:"/registry/pods/default/VIap8" mod_revision:8868 > success:<request_put:<key:"/registry/pods/default/VIap8" value:"11525" > > failure:<request_range:<key:"/registry/pods/default/VIap8" > > > 

# member-0
Snapshot:
term=2 index=8003 nodes=[ac4ec652f10e5b49 bf19ae4419db00dc eabdbb777cf498cb] confstate={"voters":[12416079282240904009,13770228943176794332,16914881897345358027],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=eabdbb777cf498cb clusterID=b3bc0c1919fe5d7e term=3 commitIndex=36822 vote=eabdbb777cf498cb
WAL entries: 28820
lastIndex=36823
term         index      type    data
   2          8004      norm    header:<ID:11010056292433081493 > txn:<compare:<target:MOD key:"/registry/pods/default/bRqG1" mod_revision:5736 > success:<request_put:<key:"/registry/pods/default/bRqG1" value:"7367" > > failure:<request_range:<key:"/registry/pods/default/bRqG1" > > > 
   2          8005      norm    header:<ID:11010056292433081494 > txn:<compare:<target:MOD key:"/registry/pods/default/3ApAH" mod_revision:5698 > success:<request_delete_range:<key:"/registry/pods/default/3ApAH" > > failure:<request_range:<key:"/registry/pods/default/3ApAH" > > > 

... # all the entries in between
2         12484      norm    header:<ID:11010056292433083149 > txn:<compare:<target:MOD key:"/registry/pods/default/7i4E8" mod_revision:8852 > success:<request_put:
<key:"/registry/pods/default/7i4E8" value:"11521" > > failure:<request_range:<key:"/registry/pods/default/7i4E8" > > > 
   2         12485      norm    header:<ID:11010056292433083150 > txn:<compare:<target:MOD key:"/registry/pods/default/UonNv" mod_revision:8829 > success:<request_put:
<key:"/registry/pods/default/UonNv" value:"11522" > > failure:<request_range:<key:"/registry/pods/default/UonNv" > > > 
   2         12486      norm    header:<ID:6577951309147099753 > txn:<compare:<target:MOD key:"/registry/pods/default/iDLqp" mod_revision:8839 > success:<request_delet
e_range:<key:"/registry/pods/default/iDLqp" > > failure:<request_range:<key:"/registry/pods/default/iDLqp" > > > 
   2         12487      norm    header:<ID:11010056292433083151 > txn:<compare:<target:MOD key:"/registry/pods/default/hyG1G" mod_revision:8853 > success:<request_put:<key:"/registry/pods/default/hyG1G" value:"11523" > > failure:<request_range:<key:"/registry/pods/default/hyG1G" > > > 
   2         12488      norm    header:<ID:6577951309147099754 > txn:<compare:<target:MOD key:"/registry/pods/default/VIap8" mod_revision:8868 > success:<request_put:<key:"/registry/pods/default/VIap8" value:"11525" > > failure:<request_range:<key:"/registry/pods/default/VIap8" > > > 
   2         12489      norm    header:<ID:6577951309147099755 > txn:<compare:<target:MOD key:"/registry/pods/default/ZYXYI" mod_revision:8865 > success:<request_put:<key:"/registry/pods/default/ZYXYI" value:"11526" > > failure:<request_range:<key:"/registry/pods/default/ZYXYI" > > > 
   2         12490      norm    method=PUT path="/0/members/bf19ae4419db00dc/attributes" val="{\"name\":\"TestRobustnessExploratoryKubernetesHighTrafficClusterOfSize3-test-2\",\"clientURLs\":[\"http://localhost:20010\"]}"

@ahrtr
Copy link
Member

ahrtr commented Dec 19, 2024

I reproduced this locally.

failed to read WAL, cannot be repaired, err: wal: slice bounds out of range, snapshot[Index: 0, Term: 0], current entry[Index: 11282, Term: 2], len(ents): 10522

Initial analysis

The direct reason is that some WAL entries were somehow missing right before the snapshot. The snapshot index in this case is 11281. The WAL records during 10523 ~ 11281 were missing. Confirmed that the WAL entries in this range were not missing in other two nodes.

   2	     10520	norm	header:<ID:6577951317820541023 > txn:<compare:<target:MOD key:"/registry/pods/default/w9Kjw" mod_revision:6857 > success:<request_put:<key:"/registry/pods/default/w9Kjw" value:"9686" > > failure:<request_range:<key:"/registry/pods/default/w9Kjw" > > > 
   2	     10521	norm	header:<ID:62087081945565313 > txn:<compare:<target:MOD key:"/registry/pods/default/pGSSX" mod_revision:6868 > success:<request_put:<key:"/registry/pods/default/pGSSX" value:"9683" > > failure:<request_range:<key:"/registry/pods/default/pGSSX" > > > 
   2	     10522	norm	header:<ID:62087081945565314 > txn:<compare:<target:MOD key:"/registry/pods/default/gP0lD" mod_revision:6867 > success:<request_put:<key:"/registry/pods/default/gP0lD" value:"9684" > > failure:<request_range:<key:"/registry/pods/default/gP0lD" > > > 
   0	         0	norm	
   0	         0	norm	
   0	         0	norm
   ......
   0	         0	norm	
   2	     11282	norm	header:<ID:11010056301106530314 > txn:<compare:<target:MOD key:"/registry/pods/default/rr6ak" mod_revision:7420 > success:<request_put:<key:"/registry/pods/default/rr6ak" value:"10401" > > failure:<request_range:<key:"/registry/pods/default/rr6ak" > > > 
   2	     11283	norm	method=PUT path="/0/members/ac4ec652f10e5b49/attributes" val="{\"name\":\"TestRobustnessExploratoryKubernetesHighTrafficClusterOfSize3-test-1\",\"clientURLs\":[\"http://localhost:20005\"]}"
   2	     11284	norm	header:<ID:62087081945565542 > txn:<compare:<target:MOD key:"/registry/pods/default/zloLS" mod_revision:7438 > success:<request_delete_range:<key:"/registry/pods/default/zloLS" > > failure:<request_range:<key:"/registry/pods/default/zloLS" > > > 

The good news is that the data across the three members are still consistent,

   ubuntu@vm2:~/go/src/github.com/ahrtr/etcd$ ./bin/etcdutl hashkv /tmp/TestRobustnessExploratory_KubernetesHighTraffic_ClusterOfSize3_MemberDowngrade/1734617398669080611/server-TestRobustnessExploratoryKubernetesHighTrafficClusterOfSize3-test-0/member/snap/db -w table
+------------+---------------+------------------+
|    HASH    | HASH REVISION | COMPACT REVISION |
+------------+---------------+------------------+
| 3080225026 |         17303 |            17041 |
+------------+---------------+------------------+
ubuntu@vm2:~/go/src/github.com/ahrtr/etcd$ 
ubuntu@vm2:~/go/src/github.com/ahrtr/etcd$ ./bin/etcdutl hashkv /tmp/TestRobustnessExploratory_KubernetesHighTraffic_ClusterOfSize3_MemberDowngrade/1734617398669080611/server-TestRobustnessExploratoryKubernetesHighTrafficClusterOfSize3-test-1/member/snap/db -w table
+------------+---------------+------------------+
|    HASH    | HASH REVISION | COMPACT REVISION |
+------------+---------------+------------------+
| 3080225026 |         17303 |            17041 |
+------------+---------------+------------------+
ubuntu@vm2:~/go/src/github.com/ahrtr/etcd$ 
ubuntu@vm2:~/go/src/github.com/ahrtr/etcd$ 
ubuntu@vm2:~/go/src/github.com/ahrtr/etcd$ ./bin/etcdutl hashkv /tmp/TestRobustnessExploratory_KubernetesHighTraffic_ClusterOfSize3_MemberDowngrade/1734617398669080611/server-TestRobustnessExploratoryKubernetesHighTrafficClusterOfSize3-test-2/member/snap/db -w table
+------------+---------------+------------------+
|    HASH    | HASH REVISION | COMPACT REVISION |
+------------+---------------+------------------+
| 3080225026 |         17303 |            17041 |
+------------+---------------+------------------+

Next step of the investigation

  • Add verification check in the raft loop to ensure any raft Entry will never have 0 term or Index.

for {
select {
case <-r.ticker.C:
r.tick()
case rd := <-r.Ready():

  • Turn off lazyfs
    Is the lazyfs enabled? If yes, then the data loss might be caused by lazyfs, but did not get time to verify it.

cc @siyuanfoundation @fuweid @serathius

Proposal on how to replay WAL in robustness test

  • You need to ensure the WAL files aren't rotated if you want to read the whole history, i.e. setting a big value for --max-wals;
    • Based on my test, the default value (5 files, and each 64MB) is good enough.
  • Currently robustness test just reads all WAL records based on snap {Index:0, term:0}. Suggest to read WAL files based on snapshots. For example, assuming you have 2 snapshot files, so read WAL records based on snap 1 firstly, then continue to read WAL records based on snap 2 (assuming snap2.Index > snap1.Index).
    • So you also need to ensure the snapshot files aren't rotated.

@ahrtr
Copy link
Member

ahrtr commented Dec 20, 2024

Confirmed that It has nothing to do with downgrade. Even just stopping & starting multiple members one by one can reproduce this "issue". I think the reason should be the WAL records do not get flushed to disk when it is being stopped in the high traffic scenario.

Note just stopping & starting one member can't reproduce it, because it allows 1 failure for a 3 member cluster. See below,

allowedFailures := len(dataDirs) / 2

Note I do not see any issue from end users perspective. When some WAL entries are lost due to not being flushed to disk, when the member gets started again, it will automatically re-sync the data from the leader, or receives a snapshot (confirmed).

/home/ubuntu/go/src/github.com/siyuanfoundation/etcd/bin/etcd-last-release (TestRobustnessExploratoryKubernetesHighTrafficClusterOfSize3-test-1) (148010): {"level":"info","ts":"2024-12-20T13:30:42.055831Z","caller":"membership/downgrade.go:50","msg":"cluster is downgrading to target version","target-cluster-version":"3.5.0","determined-cluster-version":"3.5","current-server-version":"3.5.17"}
......
/home/ubuntu/go/src/github.com/siyuanfoundation/etcd/bin/etcd-last-release (TestRobustnessExploratoryKubernetesHighTrafficClusterOfSize3-test-1) (148010): {"level":"info","ts":"2024-12-20T13:30:42.121622Z","caller":"rafthttp/http.go:257","msg":"receiving database snapshot","local-member-id":"ac4ec652f10e5b49","remote-snapshot-sender-id":"bf19ae4419db00dc","incoming-snapshot-index":16317,"incoming-snapshot-message-size-bytes":9416,"incoming-snapshot-message-size":"9.4 kB"}

On top of my previous comment,

  • You need to ensure the WAL files are not rotated if you always set snapshot {Term: 0, Index: 0}. Usually it isn't a problem because the default values is good enough.
  • Suggest that we just skip reading WAL entries if there are some missing records. It's also OK to use the continuous records before the gap. For example, assuming we have WAL records, 1, 2, 3, 6, 7, then we can only use WAL records 1,2,3.

@siyuanfoundation
Copy link
Contributor Author

Thanks @ahrtr for the investigation!
This is not due to lazyfs, because lazyfs is not enabled in cluster size 3.

@serathius do we need all the records for robustness test? Or the continuous records before the gap is enough?

@siyuanfoundation siyuanfoundation force-pushed the downgrade-robust-2 branch 2 times, most recently from 97a88f1 to 09d08ae Compare December 20, 2024 20:56
@siyuanfoundation siyuanfoundation marked this pull request as ready for review December 20, 2024 21:25
@siyuanfoundation siyuanfoundation force-pushed the downgrade-robust-2 branch 2 times, most recently from c3604e9 to 47540ce Compare December 20, 2024 21:52
@serathius
Copy link
Member

@serathius do we need all the records for robustness test? Or the continuous records before the gap is enough?

Yes, currently robustness tests assume all records are persisted. Sorry, I assumed that the error came from etcd, not from robustness tests.

return nil, err
}
member.Config().ExecPath = e2e.BinPath.EtcdLastRelease
err = patchArgs(member.Config().Args, "initial-cluster-state", "existing")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't necessary to patch the flag --initial-cluster-state, because the data-dir and wal-dir have already been created & initialised. It won't do any harm to do it, but suggest to remove it for simplicity.

Signed-off-by: Siyuan Zhang <sizhang@google.com>
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @siyuanfoundation for the nice work!

I'd like to get #19095 merged firstly to avoid the failed to read WAL flaky.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, siyuanfoundation

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahrtr ahrtr merged commit f307354 into etcd-io:main Dec 31, 2024
35 checks passed
@ahrtr
Copy link
Member

ahrtr commented Dec 31, 2024

@siyuanfoundation Please continue to work on #17118 when you get bandwidth, thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants