roachtest: new test simulating system crash and sync failures #36989

ajkr · 2019-04-22T01:15:18Z

There is an existing synctest that verifies the database is correct and usable after a crash triggered by an I/O error. The charybdefs dependency it uses does error injection by manipulating return values. When it injects an error into a sync operation, that sync does no work and returns an error, but unsynced writes still survive in page cache. Then after process crash-recovery, the DB's state is the same as if the failed sync had succeeded. This new test attempts to simulate the effects of a failed sync more completely, in particular by ensuring unsynced writes are dropped.

The approach taken in this new test is to buffer unsynced writes in process memory. This is achieved by providing our own implementation of a few C syscall wrappers via LD_PRELOAD. By buffering in process memory instead of page cache, we can easily drop unsynced writes.

In this new test, sync failure injection (system-crash/sync-errors=true) involves both returning an error and deleting unsynced data. Assuming error handling is correct the process will crash itself shortly afterwards. There is also some logic in the failure injector to force crash a little while later in case there's ever a bug in RocksDB or Cockroach where we ignore the failure.

We can also use this approach to simulate machine crash (system-crash/sync-errors=false). Simply killing the process will drop writes that aren't yet synced, which is the same as what would happen if a machine crashed.

Right now the test relies on frequent consistency checks to find errors like missing writes. It hits the DB heavily with KV queries to try to trigger enough flushes/WAL changes/compactions in case there are bugs in those code paths. But I am open to suggestions for alternative workloads/verification mechanisms.

Release note: None

cockroach-teamcity · 2019-04-22T01:15:24Z

This change is

ajkr · 2019-04-22T01:27:41Z

Some unknowns:

Is there a better place for iofailuresim.c? I put it in a separate repo because there didn't seem to be any precedent for this kind of thing, plus I wanted to try using it with RocksDB and Pebble at a later time.
Even if the test should fail (e.g., a node enters an infinite panic loop), it does not. That's because the test can't yet distinguish intentional node crashes from unintentional ones.

petermattis

Is there a better place for iofailuresim.c? I put it in a separate repo because there didn't seem to be any precedent for this kind of thing, plus I wanted to try using it with RocksDB and Pebble at a later time.

See my question below about whether iofailuresim.c is needed.

Even if the test should fail (e.g., a node enters an infinite panic loop), it does not. That's because the test can't yet distinguish intentional node crashes from unintentional ones.

Yeah, I think we'll want to figure out how to distinguish the crashes.

Before diving into the details of the new test, I have some questions:

We always configure RocksDB with manual_wal_flush. Is there a significant difference between the IO interception LD_PRELOAD and the behavior we get with manual_wal_flush? Specifically, we always call FlushWAL and then immediately call SyncWAL.
Is there a difference between the LD_PRELOAD and providing an env wrapper which accomplishes the same?
iofailuresim.c is only intercepting write, fsync and fdatasync. I believe RocksDB also uses pwrite as well, though I don't know the context. Is it worth intercepting that call as well?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)

pkg/cmd/roachprod/install/cockroach.go, line 391 at r1 (raw file):

		if c.IsLocal() {
			// This is consistent with the working directory used by `roachprod run`.
			cmd += fmt.Sprintf("cd ${HOME}/local/%d ; ", nodes[i])

This looks ok to me, but is there something that changed that required this?

pkg/cmd/roachprod/vm/gce/gcloud.go, line 54 at r1 (raw file):

// init will inject the GCE provider into vm.Providers, but only if the gcloud tool is available on the local path.
func init() {
	var p vm.Provider = &Provider{}

What is this doing? Is passing a nil provider to flagstub.New problematic?

ajkr

We always configure RocksDB with manual_wal_flush. Is there a significant difference between the IO interception LD_PRELOAD and the behavior we get with manual_wal_flush? Specifically, we always call FlushWAL and then immediately call SyncWAL.

Yes, seems they'll be buffering similar data. In particular, the IO interception will be buffering the data written by FlushWAL but not yet synced by SyncWAL. I think it is still useful because it makes us hold the data in process memory longer, which allows us to simulate system crash between the two functions, or failure by the sync function.

Is there a difference between the LD_PRELOAD and providing an env wrapper which accomplishes the same?

Good question, I hadn't thought of the alternative approach. If the env under test just provides thin wrappers around file I/O APIs, it should be the same, I think. So there's not much to justify the extra difficulty in the implementation and lack of portability. Will try migrating over to an EnvWrapper.

iofailuresim.c is only intercepting write, fsync and fdatasync. I believe RocksDB also uses pwrite as well, though I don't know the context. Is it worth intercepting that call as well?

It might've been only for that global seqno in ingested files. I can double check.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis)

pkg/cmd/roachprod/install/cockroach.go, line 391 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This looks ok to me, but is there something that changed that required this?

It was convenient to use the same working directory so I could use relative paths (in particular LD_PRELOAD=./libiofailuresim.so) in the start command.

pkg/cmd/roachprod/vm/gce/gcloud.go, line 54 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What is this doing? Is passing a nil provider to flagstub.New problematic?

roachprod tries to use the pointer so it was failing here when it was nil:

cockroach/pkg/cmd/roachprod/main.go

Line 1558 in 006dde5

p.Flags().ConfigureCreateFlags(createCmd.Flags())

. I copied this way of initialization from the AWS provider.

ajkr · 2019-04-22T15:58:48Z

Yes, seems they'll be buffering similar data. In particular, the IO interception will be buffering the data written by FlushWAL but not yet synced by SyncWAL. I think it is still useful because it makes us hold the data in process memory longer, which allows us to simulate system crash between the two functions, or failure by the sync function.

For proof of this, the failure injection to the sync function is occasionally leading to raft log panics right now. I haven't figured out yet whether it's a test problem or a code problem, though.

petermattis

Good question, I hadn't thought of the alternative approach. If the env under test just provides thin wrappers around file I/O APIs, it should be the same, I think. So there's not much to justify the extra difficulty in the implementation and lack of portability. Will try migrating over to an EnvWrapper.

👍

For proof of this, the failure injection to the sync function is occasionally leading to raft log panics right now. I haven't figured out yet whether it's a test problem or a code problem, though.

Interesting. That sounds like a bug worth tracking down.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)

pkg/cmd/roachprod/install/cockroach.go, line 391 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

It was convenient to use the same working directory so I could use relative paths (in particular LD_PRELOAD=./libiofailuresim.so) in the start command.

Ack. Makes sense to do this, even if you move away from using LD_PRELOAD.

pkg/cmd/roachprod/vm/gce/gcloud.go, line 54 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

roachprod tries to use the pointer so it was failing here when it was nil:

cockroach/pkg/cmd/roachprod/main.go

Line 1558 in 006dde5

p.Flags().ConfigureCreateFlags(createCmd.Flags())

. I copied this way of initialization from the AWS provider.

Got it.

Nit: p := &Provider{} should work here too. I don't think we need the type of p to be vm.Provider.

ajkr

Yeah, I think we'll want to figure out how to distinguish the crashes.

This is still unresolved in the latest rendition. Suggestions welcome :).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)

c-deps/libroach/db.cc, line 157 at r2 (raw file):

DBStatus DBOpen(DBEngine** db, DBSlice dir, DBOptions db_opts) {
  // Have a couple `SyncFaultInjectionEnv`s registered with RocksDB for our test

My preferred place to put this registration logic would be as static globals in "env_sync_fault_injection.cc". When trying that, the linker must've considered them unused variables there and eliminated them, as the registry was empty. This answer (https://stackoverflow.com/a/842770) suggests some workarounds to try, though I am not sure about changing build flags for this purpose...

petermattis

This is still unresolved in the latest rendition. Suggestions welcome :).

Let's work through my other review comments first.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)

c-deps/libroach/db.cc, line 157 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

My preferred place to put this registration logic would be as static globals in "env_sync_fault_injection.cc". When trying that, the linker must've considered them unused variables there and eliminated them, as the registry was empty. This answer (https://stackoverflow.com/a/842770) suggests some workarounds to try, though I am not sure about changing build flags for this purpose...

This is fine. I think you could move these to another function (e.g. DBRegisterTestingEnvs) which would at least encapsulate that the purpose slightly better. Changing build flags feels fraught. I recall dealing with this problem in a past life, but don't think it is worth doing anything sophisticated here.

c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 78 at r2 (raw file):

    if (num_syncs_until_crash_ == 0) {
      exit(0);
    }

I think it will be problematic to not return here. A sync failure has already occurred and you dropped the data. Allowing more data to be written to the file seems like a recipe for corruption.

c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 89 at r2 (raw file):

    }
    buffer_.clear();
    return rocksdb::Status::OK();

Shouldn't this return an error? I thought you were saying that the sync failure returned an error and didn't write the data. Claiming the sync succeeded and not writing the data will surely lead to assertion failures in CockroachDB.

pkg/cmd/roachtest/system_crash.go, line 178 at r2 (raw file):

			args += ",rocksdb=env=sync-failure-injection-wrapping-default-env"
		} else {
			args += ",rocksdb=env=crash-failure-injection-wrapping-default-env"

Neat!

pkg/cmd/roachtest/system_crash.go, line 203 at r2 (raw file):

func getWorkloadNode(c *cluster) nodeListOption {
	return c.Node(c.nodes)
}

Rather than small global functions, I think you should embed these functions inside the main test functions. This is a bit like namespacing in C++, and avoids polluting the global namespace. For an example of this, see allocator.go:runWideReplication and the functions such as setReplication and waitForReplication.

ajkr

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)

c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 78 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think it will be problematic to not return here. A sync failure has already occurred and you dropped the data. Allowing more data to be written to the file seems like a recipe for corruption.

Modulo the mistake in not returning a non-OK status when the first failure happened, I believe this is the same behavior as Linux fsync, which is indeed a recipe for corruption! The kernel seems to reports a writeback error once and then you're free to keep writing. I didn't find docs easily on this but here is someone else's investigation into this "clear-error-and-continue" behavior: https://lwn.net/Articles/752093/.

We should also inject errors into sync_file_rangeas that is another syscall we're making where a writeback error can be reported, and presumably cleared afterwards.

c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 89 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Shouldn't this return an error? I thought you were saying that the sync failure returned an error and didn't write the data. Claiming the sync succeeded and not writing the data will surely lead to assertion failures in CockroachDB.

Good catch. FWIW this mistake wasn't present in the original version where I saw raft log panics. So we aren't safe yet :).

pkg/cmd/roachprod/vm/gce/gcloud.go, line 54 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Got it.

Nit: p := &Provider{} should work here too. I don't think we need the type of p to be vm.Provider.

Done.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)

c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 78 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Modulo the mistake in not returning a non-OK status when the first failure happened, I believe this is the same behavior as Linux fsync, which is indeed a recipe for corruption! The kernel seems to reports a writeback error once and then you're free to keep writing. I didn't find docs easily on this but here is someone else's investigation into this "clear-error-and-continue" behavior: https://lwn.net/Articles/752093/.

We should also inject errors into sync_file_rangeas that is another syscall we're making where a writeback error can be reported, and presumably cleared afterwards.

Ok. Please add a comment explaining that letting the sync continue after a previous error is intentional.

One difference between this and the clear-error-and-continue behavior in the kernel is that the new data will be written to a different offset. Perhaps you should be tracking the offset of the unwritten data and using pwrite below instead of Append.

c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 89 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Good catch. FWIW this mistake wasn't present in the original version where I saw raft log panics. So we aren't safe yet :).

Well, I suppose that is good news. It would be nice to have a reproducible scenario that causes Raft panics in the face of sync errors.

There is an existing synctest that verifies the database is correct and usable after a crash triggered by an I/O error. The charybdefs dependency it uses does error injection by manipulating return values. When it injects an error into a sync operation, that sync does no work and returns an error, but unsynced writes still survive in page cache. Then after process crash-recovery, the DB's state is the same as if the failed sync had succeeded. This new test attempts to simulate the effects of a failed sync more completely, in particular by ensuring unsynced writes are dropped. The approach taken in this new test is to buffer unsynced writes in process memory. This is achieved by providing our own implementation of a few C syscall wrappers via `LD_PRELOAD`. By buffering in process memory instead of page cache, we can easily drop unsynced writes. In this new test, sync failure injection (`system-crash/sync-errors=true`) involves both returning an error and deleting unsynced data. Assuming error handling is correct the process will crash itself shortly afterwards. There is also some logic in the failure injector to force crash a little while later in case there's ever a bug in RocksDB or Cockroach where we ignore the failure. We can also use this approach to simulate machine crash (`system-crash/sync-errors=false`). Simply killing the process will drop writes that aren't yet synced, which is the same as what would happen if a machine crashed. Right now the test relies on frequent consistency checks to find errors like missing writes. It hits the DB heavily with KV queries to try to trigger enough flushes/WAL changes/compactions in case there are bugs in those code paths. But I am open to suggestions for alternative workloads/verification mechanisms. Release note: None

ajkr

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)

c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 78 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ok. Please add a comment explaining that letting the sync continue after a previous error is intentional.

One difference between this and the clear-error-and-continue behavior in the kernel is that the new data will be written to a different offset. Perhaps you should be tracking the offset of the unwritten data and using pwrite below instead of Append.

Done. Instead of tracking offset I overwrote the buffer with all zeros before we return an I/O error. That way the future writes/syncs should go to the right offset while we still simulate missing data.

pkg/cmd/roachprod/vm/gce/gcloud.go, line 54 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Done.

I started seeing the below error so changed it back.

pkg/cmd/roachprod/vm/gce/gcloud.go:56:5: cannot use flagstub.New(p, "please install the gcloud CLI utilities (https://cloud.google.com/sdk/downloads)") (type vm.Provider) as type *Provider in assignment: need type assertion

pkg/cmd/roachtest/system_crash.go, line 203 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Rather than small global functions, I think you should embed these functions inside the main test functions. This is a bit like namespacing in C++, and avoids polluting the global namespace. For an example of this, see allocator.go:runWideReplication and the functions such as setReplication and waitForReplication.

Done.

ajkr

Maybe a verification at the end could just be to restart all the cockroach nodes without the fault injecting env and run a consistency check. Not sure if there's a way to do the consistency check on-demand.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)

petermattis

Maybe a verification at the end could just be to restart all the cockroach nodes without the fault injecting env and run a consistency check. Not sure if there's a way to do the consistency check on-demand.

cluster.FailOnReplicaDivergence does exactly this. I think it is called automatically.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)

pkg/cmd/roachprod/vm/gce/gcloud.go, line 54 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

I started seeing the below error so changed it back.

pkg/cmd/roachprod/vm/gce/gcloud.go:56:5: cannot use flagstub.New(p, "please install the gcloud CLI utilities (https://cloud.google.com/sdk/downloads)") (type vm.Provider) as type *Provider in assignment: need type assertion

Huh? Ok, I'll look at this later.

pkg/cmd/roachtest/system_crash.go, line 74 at r3 (raw file):

	getWorkloadNode := func(c *cluster) nodeListOption {
		return c.Node(c.nodes)
	}

Nit: the above can be variables instead of methods:

cockroachNodes := c.Range(1, c.nodes-1)
reliableCockroachNode := c.Node(1)
unreliableCockroachNodes := c.Range(2, c.nodes-1)
workloadNode := c.Node(c.nodes)

pkg/cmd/roachtest/system_crash.go, line 144 at r3 (raw file):

					output, err := execCmdWithBuffer(
						ctx, t.l, roachprod, "monitor", c.name, "--oneshot", "--ignore-empty-nodes",
					)

There is something odd here. A monitor internally runs roachprod monitor in order to detect node deaths. So why do you need to invoke roachprod monitor separately? I suspect you might have a misunderstanding about monitor and don't actually need to use the monitor object itself.

pkg/cmd/roachtest/system_crash.go, line 181 at r3 (raw file):

								restartNodeIds = append(restartNodeIds, nodeId)
							}
						}

This is an unfortunate duplication of the work that is already done inside monitor.wait. Perhaps there should be a way to configure a callback to be invoked when a node dies.

Detected with cockroachdb#36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`. With some slight modification to that test's constants it could repro errors like this within a minute: ``` panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost? ``` Debugging showed `DBSyncWAL` can be called even after a sync failure. I guess if it returns success any time after it fails it will ack writes that aren't recoverable in WAL. They aren't recoverable because RocksDB stops recovery upon hitting the offset corresponding to the lost write (typically there should be a corruption there). Meanwhile, there are still successfully synced writes at later offsets in the file. The fix is simple. If `DBSyncWAL` returns an error once, keep track of that error and return it for all future writes. Release note: None

37102: storage: fix possible raft log panic after fsync error r=ajkr a=ajkr Detected with #36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`. With some slight modification to that test's constants it could repro errors like this within a minute: ``` panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost? ``` Debugging showed `DBSyncWAL` can be called even after a sync failure. I guess if it returns success any time after it fails it will ack writes that aren't recoverable in WAL. They aren't recoverable because RocksDB stops recovery upon hitting the offset corresponding to the lost write (typically there should be a corruption there). Meanwhile, there are still successfully synced writes at later offsets in the file. The fix is simple. If `DBSyncWAL` returns an error once, keep track of that error and return it for all future writes. Release note: None Co-authored-by: Andrew Kryczka <andrew.kryczka2@gmail.com>

Detected with cockroachdb#36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`. With some slight modification to that test's constants it could repro errors like this within a minute: ``` panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost? ``` Debugging showed `DBSyncWAL` can be called even after a sync failure. I guess if it returns success any time after it fails it will ack writes that aren't recoverable in WAL. They aren't recoverable because RocksDB stops recovery upon hitting the offset corresponding to the lost write (typically there should be a corruption there). Meanwhile, there are still successfully synced writes at later offsets in the file. The fix is simple. If `DBSyncWAL` returns an error once, keep track of that error and return it for all future writes. Release note (bug fix): Fixed possible panic while recovering from a WAL on which a sync operation failed.

petermattis reviewed Apr 22, 2019

View reviewed changes

ajkr commented Apr 22, 2019

View reviewed changes

petermattis reviewed Apr 22, 2019

View reviewed changes

ajkr mentioned this pull request Apr 23, 2019

Option string/map/file can set env from object registry cockroachdb/rocksdb#30

Merged

ajkr force-pushed the roachtest-system-crash branch from df49767 to 99b3c23 Compare April 24, 2019 00:54

ajkr requested a review from a team April 24, 2019 00:54

ajkr commented Apr 24, 2019

View reviewed changes

petermattis reviewed Apr 24, 2019

View reviewed changes

ajkr commented Apr 24, 2019

View reviewed changes

petermattis reviewed Apr 24, 2019

View reviewed changes

ajkr force-pushed the roachtest-system-crash branch from 99b3c23 to 5912e14 Compare April 24, 2019 18:05

ajkr force-pushed the roachtest-system-crash branch from 5912e14 to 1c54042 Compare April 24, 2019 18:14

ajkr requested a review from a team April 24, 2019 18:14

ajkr commented Apr 24, 2019

View reviewed changes

ajkr changed the title ~~Add roachtest that simulates system crash and sync failures~~ roachtest: new test simulating system crash and sync failures Apr 24, 2019

ajkr commented Apr 24, 2019

View reviewed changes

petermattis reviewed Apr 24, 2019

View reviewed changes

ajkr mentioned this pull request Apr 24, 2019

storage: fix possible raft log panic after fsync error #37102

Merged

ajkr mentioned this pull request Apr 25, 2019

release-19.1: storage: fix possible raft log panic after fsync error #37109

Merged

ajkr mentioned this pull request Apr 30, 2019

release-2.1: storage: fix possible raft log panic after fsync error #37214

Merged

ajkr mentioned this pull request Apr 30, 2019

release-2.0: storage: fix possible raft log panic after fsync error #37216

Merged

tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: new test simulating system crash and sync failures #36989

roachtest: new test simulating system crash and sync failures #36989

ajkr commented Apr 22, 2019 •

edited

Loading

cockroach-teamcity commented Apr 22, 2019

ajkr commented Apr 22, 2019

petermattis left a comment

ajkr left a comment

ajkr commented Apr 22, 2019

petermattis left a comment

ajkr left a comment

petermattis left a comment

ajkr left a comment

petermattis left a comment

ajkr left a comment

ajkr left a comment

petermattis left a comment

roachtest: new test simulating system crash and sync failures #36989

Are you sure you want to change the base?

roachtest: new test simulating system crash and sync failures #36989

Conversation

ajkr commented Apr 22, 2019 • edited Loading

cockroach-teamcity commented Apr 22, 2019

ajkr commented Apr 22, 2019

petermattis left a comment

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment

ajkr commented Apr 22, 2019

petermattis left a comment

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

ajkr commented Apr 22, 2019 •

edited

Loading