Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: new test simulating system crash and sync failures #36989

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Commits on Apr 24, 2019

  1. Add roachtest that simulates system crash and sync failures

    There is an existing synctest that verifies the database is correct and
    usable after a crash triggered by an I/O error. The charybdefs
    dependency it uses does error injection by manipulating return values.
    When it injects an error into a sync operation, that sync does no work
    and returns an error, but unsynced writes still survive in page cache.
    Then after process crash-recovery, the DB's state is the same as if the
    failed sync had succeeded. This new test attempts to simulate the
    effects of a failed sync more completely, in particular by ensuring
    unsynced writes are dropped.
    
    The approach taken in this new test is to buffer unsynced writes in
    process memory. This is achieved by providing our own implementation of
    a few C syscall wrappers via `LD_PRELOAD`. By buffering in process
    memory instead of page cache, we can easily drop unsynced writes.
    
    In this new test, sync failure injection
    (`system-crash/sync-errors=true`) involves both returning an error and
    deleting unsynced data. Assuming error handling is correct the process
    will crash itself shortly afterwards. There is also some logic in the
    failure injector to force crash a little while later in case there's
    ever a bug in RocksDB or Cockroach where we ignore the failure.
    
    We can also use this approach to simulate machine crash
    (`system-crash/sync-errors=false`). Simply killing the process will drop
    writes that aren't yet synced, which is the same as what would happen if
    a machine crashed.
    
    Right now the test relies on frequent consistency checks to find errors
    like missing writes. It hits the DB heavily with KV queries to try to
    trigger enough flushes/WAL changes/compactions in case there are bugs in
    those code paths. But I am open to suggestions for alternative
    workloads/verification mechanisms.
    
    Release note: None
    ajkr committed Apr 24, 2019
    Configuration menu
    Copy the full SHA
    1c54042 View commit details
    Browse the repository at this point in the history