-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a bug with auto recovery on WAL write error #12995
Conversation
4de9460
to
a30b22a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with nit on comments thanks for fixing the bug!
db/error_handler.cc
Outdated
// TODO: remove parameter `wal_related` once we can automatically recover | ||
// from WAL write failure. | ||
// We should not try auto recover IOError from writing to WAL . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: once we fix the bug in automatically recovery from WAL write failure. Before that, we should not try auto recovery IOError ....
@@ -1956,6 +1958,7 @@ class DBImpl : public DB { | |||
void ReleaseFileNumberFromPendingOutputs( | |||
std::unique_ptr<std::list<uint64_t>::iterator>& v); | |||
|
|||
// Sets bg error if there is an error writing to WAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"// Sets bg error if there is an error writing to WAL." is this comment for SyncClosedWals or WALIOStatusCheck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's meant for SyncClosedWals(). Just to emphasis that this function will set bg error.
Some CI job without letting me to see the reason so I reran them fyi @cbi42 |
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: A recent crash test failure shows that auto recovery from WAL write failure can cause CFs to be inconsistent. A unit test repro in P1569398553. The following is an example sequence of events:
This can happen if a users configures manual_wal_flush, uses more than one CF, and can hit retryable error for WAL writes. This PR is a short-term fix that upgrades WAL related errors to fatal and not trigger auto recovery.
A long-term fix may be not drop buffered WAL data by checking how much data is actually written, or require atomically flushing all column families during error recovery from this kind of errors.
Test plan: added unit test to check error severity and if recovery is triggered. A crash test repro command that fails in a few runs before this PR: