roachtest: new test simulating system crash and sync failures #36989

There is an existing synctest that verifies the database is correct and usable after a crash triggered by an I/O error. The charybdefs dependency it uses does error injection by manipulating return values. When it injects an error into a sync operation, that sync does no work and returns an error, but unsynced writes still survive in page cache. Then after process crash-recovery, the DB's state is the same as if the failed sync had succeeded. This new test attempts to simulate the effects of a failed sync more completely, in particular by ensuring unsynced writes are dropped. The approach taken in this new test is to buffer unsynced writes in process memory. This is achieved by providing our own implementation of a few C syscall wrappers via `LD_PRELOAD`. By buffering in process memory instead of page cache, we can easily drop unsynced writes. In this new test, sync failure injection (`system-crash/sync-errors=true`) involves both returning an error and deleting unsynced data. Assuming error handling is correct the process will crash itself shortly afterwards. There is also some logic in the failure injector to force crash a little while later in case there's ever a bug in RocksDB or Cockroach where we ignore the failure. We can also use this approach to simulate machine crash (`system-crash/sync-errors=false`). Simply killing the process will drop writes that aren't yet synced, which is the same as what would happen if a machine crashed. Right now the test relies on frequent consistency checks to find errors like missing writes. It hits the DB heavily with KV queries to try to trigger enough flushes/WAL changes/compactions in case there are bugs in those code paths. But I am open to suggestions for alternative workloads/verification mechanisms. Release note: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: new test simulating system crash and sync failures #36989

roachtest: new test simulating system crash and sync failures #36989

Commits on Apr 24, 2019

roachtest: new test simulating system crash and sync failures #36989

Are you sure you want to change the base?

roachtest: new test simulating system crash and sync failures #36989

Commits on Apr 24, 2019