Skip to content

Commit

Permalink
storage: fix possible raft log panic after fsync error
Browse files Browse the repository at this point in the history
Detected with cockroachdb#36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`.
With some slight modification to that test's constants it could repro
errors like this within a minute:

```
panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?
```

Debugging showed `DBSyncWAL` can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If `DBSyncWAL` returns an error once, keep track of
that error and return it for all future writes.

Release note: None
  • Loading branch information
ajkr committed Apr 24, 2019
1 parent df024f7 commit 2ef0a60
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions pkg/storage/engine/rocksdb.go
Original file line number Diff line number Diff line change
Expand Up @@ -688,6 +688,7 @@ func (r *RocksDB) syncLoop() {
s.Lock()

var lastSync time.Time
var err error

for {
for len(s.pending) == 0 && !s.closed {
Expand All @@ -713,8 +714,7 @@ func (r *RocksDB) syncLoop() {

s.Unlock()

var err error
if r.cfg.Dir != "" {
if r.cfg.Dir != "" && err == nil {
err = statusToError(C.DBSyncWAL(r.rdb))
lastSync = timeutil.Now()
}
Expand Down

0 comments on commit 2ef0a60

Please sign in to comment.