release-2.0: storage: fix possible raft log panic after fsync error #37216

ajkr · 2019-04-30T19:07:50Z

Backport 1/1 commits from #37102.

/cc @cockroachdb/release

Detected with #36989 applied by running
./bin/roachtest run --local '^system-crash/sync-errors=true$'.
With some slight modification to that test's constants it could repro
errors like this within a minute:

panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?

Debugging showed DBSyncWAL can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If DBSyncWAL returns an error once, keep track of
that error and return it for all future writes.

Release note (bug fix): Fixed possible panic while recovering from a WAL
on which a sync operation failed.

Detected with cockroachdb#36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`. With some slight modification to that test's constants it could repro errors like this within a minute: ``` panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost? ``` Debugging showed `DBSyncWAL` can be called even after a sync failure. I guess if it returns success any time after it fails it will ack writes that aren't recoverable in WAL. They aren't recoverable because RocksDB stops recovery upon hitting the offset corresponding to the lost write (typically there should be a corruption there). Meanwhile, there are still successfully synced writes at later offsets in the file. The fix is simple. If `DBSyncWAL` returns an error once, keep track of that error and return it for all future writes. Release note (bug fix): Fixed possible panic while recovering from a WAL on which a sync operation failed.

cockroach-teamcity · 2019-04-30T19:07:57Z

This change is

ajkr requested a review from a team April 30, 2019 19:07

bdarnell approved these changes Apr 30, 2019

View reviewed changes

ajkr merged commit 5665a91 into cockroachdb:release-2.0 Apr 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-2.0: storage: fix possible raft log panic after fsync error #37216

release-2.0: storage: fix possible raft log panic after fsync error #37216

ajkr commented Apr 30, 2019

cockroach-teamcity commented Apr 30, 2019

release-2.0: storage: fix possible raft log panic after fsync error #37216

release-2.0: storage: fix possible raft log panic after fsync error #37216

Conversation

ajkr commented Apr 30, 2019

cockroach-teamcity commented Apr 30, 2019