Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-2.0: storage: fix possible raft log panic after fsync error #37216

Merged
merged 1 commit into from
Apr 30, 2019

Commits on Apr 30, 2019

  1. storage: fix possible raft log panic after fsync error

    Detected with cockroachdb#36989 applied by running
    `./bin/roachtest run --local '^system-crash/sync-errors=true$'`.
    With some slight modification to that test's constants it could repro
    errors like this within a minute:
    
    ```
    panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?
    ```
    
    Debugging showed `DBSyncWAL` can be called even after a sync failure.
    I guess if it returns success any time after it fails it will ack
    writes that aren't recoverable in WAL. They aren't recoverable because
    RocksDB stops recovery upon hitting the offset corresponding to the
    lost write (typically there should be a corruption there). Meanwhile,
    there are still successfully synced writes at later offsets in the
    file.
    
    The fix is simple. If `DBSyncWAL` returns an error once, keep track of
    that error and return it for all future writes.
    
    Release note (bug fix): Fixed possible panic while recovering from a WAL
    on which a sync operation failed.
    ajkr committed Apr 30, 2019
    Configuration menu
    Copy the full SHA
    fcc3c6a View commit details
    Browse the repository at this point in the history