storage: fix possible raft log panic after fsync error #37102

ajkr · 2019-04-24T22:46:45Z

Detected with #36989 applied by running ./bin/roachtest run --local '^system-crash/sync-errors=true$'.
With some slight modification to that test's constants it could repro
errors like this within a minute:

panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?

Debugging showed DBSyncWAL can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If DBSyncWAL returns an error once, keep track of
that error and return it for all future writes.

Release note: None

cockroach-teamcity · 2019-04-24T22:46:51Z

This change is

petermattis

Excellent find!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajkr)

pkg/storage/engine/rocksdb.go, line 717 at r1 (raw file):

		s.Unlock()

		if r.cfg.Dir != "" && err == nil {

Please add some commentary around this, including mention of the kernel behavior around fsyncs.

Detected with cockroachdb#36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`. With some slight modification to that test's constants it could repro errors like this within a minute: ``` panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost? ``` Debugging showed `DBSyncWAL` can be called even after a sync failure. I guess if it returns success any time after it fails it will ack writes that aren't recoverable in WAL. They aren't recoverable because RocksDB stops recovery upon hitting the offset corresponding to the lost write (typically there should be a corruption there). Meanwhile, there are still successfully synced writes at later offsets in the file. The fix is simple. If `DBSyncWAL` returns an error once, keep track of that error and return it for all future writes. Release note: None

ajkr · 2019-04-25T02:18:23Z

bors r+

Will check whether Pebble needs the same fix and backport this to 19.1. Also some day I'd like to look into recoverable I/O errors like ENOSPC. For example if the user clears up some space after hitting it can cockroach continue?

37102: storage: fix possible raft log panic after fsync error r=ajkr a=ajkr Detected with #36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`. With some slight modification to that test's constants it could repro errors like this within a minute: ``` panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost? ``` Debugging showed `DBSyncWAL` can be called even after a sync failure. I guess if it returns success any time after it fails it will ack writes that aren't recoverable in WAL. They aren't recoverable because RocksDB stops recovery upon hitting the offset corresponding to the lost write (typically there should be a corruption there). Meanwhile, there are still successfully synced writes at later offsets in the file. The fix is simple. If `DBSyncWAL` returns an error once, keep track of that error and return it for all future writes. Release note: None Co-authored-by: Andrew Kryczka <andrew.kryczka2@gmail.com>

craig · 2019-04-25T02:42:31Z

Build succeeded

GitHub CI (Cockroach)

tbg · 2019-04-25T07:21:54Z

That is an excellent find (our very own fsyncgate?)

Are #32927, #36522, #36567, #36592 all explained by this?

Also, this change definitely needs a release note. I'm also surprised that a unit test wasn't even discussed. If we fix something without adding a test, there ought to be a justification (such as: #37102 repros this and will be merged shortly, I personally would still add the unit test though).

The ship for the release note on master has sailed (right @bdarnell) but please update #37109 to include a comprehensive release note that lists the kinds of errors potentially explained by this bug. (SEs and Support like to search for errors they see in release notes).

bdarnell · 2019-04-25T16:27:34Z

The ship for the release note on master has sailed (right @bdarnell)

Yes. We'll let this be documented in 19.1.1 when that is released and then it doesn't also need to be documented in the first 19.2 alpha.

ajkr · 2019-04-25T17:38:41Z

The ship for the release note on master has sailed (right @bdarnell)

Yes. We'll let this be documented in 19.1.1 when that is released and then it doesn't also need to be documented in the first 19.2 alpha.

Got it. It is a little bit confusing, for example, if I'm upgrading from 19.1.0 to 19.2 alpha, it is unclear whether I'm getting this fix. I can try making an empty commit with the release notes errata for master if it's desirable.

ajkr · 2019-04-25T17:47:12Z

That is an excellent find (our very own fsyncgate?)

Are #32927, #36522, #36567, #36592 all explained by this?

Do we have debug zip for any of those? Couldn't find it. If the dmesg contains messages like "lost async page write" leading up to the crash then it seems highly likely. If not, I am not sure.

Also, this change definitely needs a release note. I'm also surprised that a unit test wasn't even discussed. If we fix something without adding a test, there ought to be a justification (such as: #37102 repros this and will be merged shortly, I personally would still add the unit test though).

The ship for the release note on master has sailed (right @bdarnell) but please update #37109 to include a comprehensive release note that lists the kinds of errors potentially explained by this bug. (SEs and Support like to search for errors they see in release notes).

Right, will come up with a release note before landing the backports at least.

tbg · 2019-04-25T20:37:59Z

Thanks!

Do we have debug zip for any of those?

No, Sentry issues don't provide us with much information (we'd need to impl a version of debug zip without PII... seems pretty hopeless). Our best bet is a user contacting support.

ajkr requested a review from a team April 24, 2019 22:46

petermattis approved these changes Apr 25, 2019

View reviewed changes

ajkr force-pushed the fix-fsync-error-raft-panic branch from 2ef0a60 to 13a54c0 Compare April 25, 2019 02:14

craig bot merged commit 13a54c0 into cockroachdb:master Apr 25, 2019

ajkr mentioned this pull request Apr 25, 2019

release-19.1: storage: fix possible raft log panic after fsync error #37109

Merged

This was referenced Apr 30, 2019

release-2.1: storage: fix possible raft log panic after fsync error #37214

Merged

release-2.0: storage: fix possible raft log panic after fsync error #37216

Merged

knz mentioned this pull request Nov 10, 2019

User-facing changes in 19.2 that were not picked up in release notes cockroachdb/docs#5819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: fix possible raft log panic after fsync error #37102

storage: fix possible raft log panic after fsync error #37102

ajkr commented Apr 24, 2019

cockroach-teamcity commented Apr 24, 2019

petermattis left a comment

ajkr commented Apr 25, 2019

craig bot commented Apr 25, 2019

tbg commented Apr 25, 2019

bdarnell commented Apr 25, 2019

ajkr commented Apr 25, 2019

ajkr commented Apr 25, 2019

tbg commented Apr 25, 2019

storage: fix possible raft log panic after fsync error #37102

storage: fix possible raft log panic after fsync error #37102

Conversation

ajkr commented Apr 24, 2019

cockroach-teamcity commented Apr 24, 2019

petermattis left a comment

Choose a reason for hiding this comment

ajkr commented Apr 25, 2019

craig bot commented Apr 25, 2019

Build succeeded

tbg commented Apr 25, 2019

bdarnell commented Apr 25, 2019

ajkr commented Apr 25, 2019

ajkr commented Apr 25, 2019

tbg commented Apr 25, 2019