-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduce SIGKILL data inconsistency. #13838
Conversation
b8e4b38
to
9d4260b
Compare
Codecov Report
@@ Coverage Diff @@
## main #13838 +/- ##
==========================================
- Coverage 72.43% 72.25% -0.18%
==========================================
Files 468 468
Lines 38222 38222
==========================================
- Hits 27685 27618 -67
- Misses 8749 8800 +51
- Partials 1788 1804 +16
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
9d4260b
to
fcfffe2
Compare
fcfffe2
to
fcfb0a0
Compare
In my case one of the members will get corrupted and never start back up causing the health checks in functional tests to fail 60 times and fail the test. To confirm reproduction in one of the members logs will have message:
|
Important note, functional tests write both logs and data in /tmp/ . My workstation is weird and has ssd disk there instead of tmpfs. Not sure if it can be reproduced on tmpfs due to totally different performance characteristics writing to hdd/ssd. I guess that changing the writing directory can be done by running sed on https://github.com/etcd-io/etcd/blob/main/tests/functional/functional.yaml to replace |
I managed to reproduce on my system by doing |
Thanks @serathius, I think I've got the reproduction as well. Put my hardware information here just in case someone else will find it useful.
|
@@ -58,8 +58,6 @@ var etcdFields = []string{ | |||
"LogOutputs", | |||
"LogLevel", | |||
|
|||
"SocketReuseAddress", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity -> Why the options were problematic ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not supported in v3.4, so I removed it to have one commit to test all v3.4, v3.5, v3.6 releases.
@@ -174,6 +174,30 @@ func recover_SIGQUIT_ETCD_AND_REMOVE_DATA(clus *Cluster, idx1 int) error { | |||
return err | |||
} | |||
|
|||
func inject_SIGKILL(clus *Cluster, index int) error { | |||
clus.lg.Info( | |||
"disastrous machine failure START", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A like the drama of the comment... but maybe at least mention that technically it's SIGKILL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's copied from similar SIGQUIT_ETCD_AND_REMOVE_DATA
test, drama was not intentional, was just to lazy to change it :P
No plans on merging |
Reproduces #13766
Run