Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce SIGKILL data inconsistency. #13838

Closed
wants to merge 1 commit into from

Conversation

serathius
Copy link
Member

@serathius serathius commented Mar 25, 2022

Reproduces #13766

Run

GO_BUILD_FLAGS='-v -mod=readonly' ./scripts/build.sh && GOARCH=amd64 PASSES='functional' ./scripts/test.sh

@codecov-commenter
Copy link

codecov-commenter commented Mar 25, 2022

Codecov Report

Merging #13838 (9d4260b) into main (4019c59) will decrease coverage by 0.17%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main   #13838      +/-   ##
==========================================
- Coverage   72.43%   72.25%   -0.18%     
==========================================
  Files         468      468              
  Lines       38222    38222              
==========================================
- Hits        27685    27618      -67     
- Misses       8749     8800      +51     
- Partials     1788     1804      +16     
Flag Coverage Δ
all 72.25% <ø> (-0.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
client/pkg/v3/fileutil/lock_linux.go 72.22% <0.00%> (-8.34%) ⬇️
client/pkg/v3/tlsutil/tlsutil.go 83.33% <0.00%> (-8.34%) ⬇️
server/etcdserver/api/v3rpc/watch.go 83.55% <0.00%> (-5.04%) ⬇️
server/storage/wal/file_pipeline.go 90.69% <0.00%> (-4.66%) ⬇️
api/etcdserverpb/raft_internal_stringer.go 76.78% <0.00%> (-3.58%) ⬇️
server/etcdserver/api/rafthttp/msgappv2_codec.go 67.82% <0.00%> (-3.48%) ⬇️
server/etcdserver/util.go 85.71% <0.00%> (-3.18%) ⬇️
server/etcdserver/cluster_util.go 70.35% <0.00%> (-3.17%) ⬇️
client/v3/experimental/recipes/key.go 75.34% <0.00%> (-2.74%) ⬇️
client/v3/retry_interceptor.go 63.63% <0.00%> (-2.73%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4019c59...9d4260b. Read the comment docs.

@serathius
Copy link
Member Author

In my case one of the members will get corrupted and never start back up causing the health checks in functional tests to fail 60 times and fail the test. To confirm reproduction in one of the members logs will have message:

{"level":"error","ts":"2022-03-25T16:24:10.958+0100","caller":"embed/etcd.go:258","msg":"checkInitialHashKV failed","error":"f8870d0ead8d6a0d found data inconsistency with peers","stacktrace":"go.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/home/serathius/src/go.etcd.io/etcd/server/embed/etcd.go:258\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/home/serathius/src/go.etcd.io/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/home/serathius/src/go.etcd.io/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/home/serathius/src/go.etcd.io/etcd/server/etcdmain/main.go:40\nmain.main\n\t/home/serathius/src/go.etcd.io/etcd/server/main.go:32\nruntime.main\n\t/home/serathius/.gvm/gos/go1.17.6/src/runtime/proc.go:255"}
{"level":"info","ts":"2022-03-25T16:24:10.965+0100","caller":"embed/etcd.go:383","msg":"closing etcd server","name":"s3","data-dir":"/tmp/etcd-functional-3/etcd.data","advertise-peer-urls":["https://127.0.0.1:3381/"],"advertise-client-urls":["https://127.0.0.1:3379/"]}
{"level":"info","ts":"2022-03-25T16:24:10.965+0100","caller":"embed/etcd.go:385","msg":"closed etcd server","name":"s3","data-dir":"/tmp/etcd-functional-3/etcd.data","advertise-peer-urls":["https://127.0.0.1:3381/"],"advertise-client-urls":["https://127.0.0.1:3379/"]}
{"level":"fatal","ts":"2022-03-25T16:24:10.973+0100","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"f8870d0ead8d6a0d found data inconsistency with peers","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/home/serathius/src/go.etcd.io/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/home/serathius/src/go.etcd.io/etcd/server/etcdmain/main.go:40\nmain.main\n\t/home/serathius/src/go.etcd.io/etcd/server/main.go:32\nruntime.main\n\t/home/serathius/.gvm/gos/go1.17.6/src/runtime/proc.go:255"}

@serathius
Copy link
Member Author

serathius commented Mar 25, 2022

Important note, functional tests write both logs and data in /tmp/ . My workstation is weird and has ssd disk there instead of tmpfs. Not sure if it can be reproduced on tmpfs due to totally different performance characteristics writing to hdd/ssd.

I guess that changing the writing directory can be done by running sed on https://github.com/etcd-io/etcd/blob/main/tests/functional/functional.yaml to replace/tmp/etcd-functional-
Not tested

@deads2k
Copy link

deads2k commented Mar 25, 2022

I managed to reproduce on my system by doing find tests/ -type f | xargs sed -i 's|/tmp/|/home/deads/workspaces/etcd/src/etcd.io/test-temp/|g' to write to an SSD.

@chaochn47
Copy link
Member

chaochn47 commented Mar 25, 2022

Thanks @serathius, I think I've got the reproduction as well.

Put my hardware information here just in case someone else will find it useful.
AWS EC2 instance c5d.24xlarge while /tmp is mounted on a EBS volume.

[root@ip-172-31-45-226 etcd]# df /tmp
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/nvme0n1p1   8376300 3969444   4406856  48% /
{
  "level":"warn",
  "ts":"2022-03-25T17:46:12.592Z",
  "caller":"etcdserver/corrupt.go:75",
  "msg":"found different hash values from remote peer",
  "local-member-id":"edddfb2ea88ce1bd",
  "local-member-revision":233172,
  "local-member-compact-revision":-1,
  "local-member-hash":2564256760,
  "remote-peer-id":"0",
  "remote-peer-endpoints":["https://127.0.0.1:3381"],
  "remote-peer-revision":245403,
  "remote-peer-compact-revision":-1,
  "remote-peer-hash":2988221236
}

{
  "level":"error",
  "ts":"2022-03-25T17:46:12.593Z",
  "caller":"embed/etcd.go:258",
  "msg":"checkInitialHashKV failed",
  "error":"edddfb2ea88ce1bd found data inconsistency with peers",
  "stacktrace":"go.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/data/go.etcd.io/etcd/server/embed/etcd.go:258\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/data/go.etcd.io/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/data/go.etcd.io/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/data/go.etcd.io/etcd/server/etcdmain/main.go:40\nmain.main\n\t/data/go.etcd.io/etcd/server/main.go:32\nruntime.main\n\t/home/ec2-user/.gvm/gos/go1.17.6/src/runtime/proc.go:255"
}

etcd.log

@@ -58,8 +58,6 @@ var etcdFields = []string{
"LogOutputs",
"LogLevel",

"SocketReuseAddress",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity -> Why the options were problematic ?

Copy link
Member Author

@serathius serathius Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not supported in v3.4, so I removed it to have one commit to test all v3.4, v3.5, v3.6 releases.

@@ -174,6 +174,30 @@ func recover_SIGQUIT_ETCD_AND_REMOVE_DATA(clus *Cluster, idx1 int) error {
return err
}

func inject_SIGKILL(clus *Cluster, index int) error {
clus.lg.Info(
"disastrous machine failure START",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A like the drama of the comment... but maybe at least mention that technically it's SIGKILL

Copy link
Member Author

@serathius serathius Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's copied from similar SIGQUIT_ETCD_AND_REMOVE_DATA test, drama was not intentional, was just to lazy to change it :P

@serathius serathius mentioned this pull request Apr 6, 2022
28 tasks
@serathius serathius changed the title tests: Add a functional tests that sends sigkill under high stress Reproduce SIGKILL data inconsistency. Apr 11, 2022
@serathius
Copy link
Member Author

No plans on merging

@serathius serathius closed this May 6, 2022
@serathius serathius deleted the data-corruption branch June 15, 2023 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants