Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: new test simulating system crash and sync failures #36989

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented Apr 22, 2019

There is an existing synctest that verifies the database is correct and usable after a crash triggered by an I/O error. The charybdefs dependency it uses does error injection by manipulating return values. When it injects an error into a sync operation, that sync does no work and returns an error, but unsynced writes still survive in page cache. Then after process crash-recovery, the DB's state is the same as if the failed sync had succeeded. This new test attempts to simulate the effects of a failed sync more completely, in particular by ensuring unsynced writes are dropped.

The approach taken in this new test is to buffer unsynced writes in process memory. This is achieved by providing our own implementation of a few C syscall wrappers via LD_PRELOAD. By buffering in process memory instead of page cache, we can easily drop unsynced writes.

In this new test, sync failure injection (system-crash/sync-errors=true) involves both returning an error and deleting unsynced data. Assuming error handling is correct the process will crash itself shortly afterwards. There is also some logic in the failure injector to force crash a little while later in case there's ever a bug in RocksDB or Cockroach where we ignore the failure.

We can also use this approach to simulate machine crash (system-crash/sync-errors=false). Simply killing the process will drop writes that aren't yet synced, which is the same as what would happen if a machine crashed.

Right now the test relies on frequent consistency checks to find errors like missing writes. It hits the DB heavily with KV queries to try to trigger enough flushes/WAL changes/compactions in case there are bugs in those code paths. But I am open to suggestions for alternative workloads/verification mechanisms.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajkr
Copy link
Contributor Author

ajkr commented Apr 22, 2019

Some unknowns:

  • Is there a better place for iofailuresim.c? I put it in a separate repo because there didn't seem to be any precedent for this kind of thing, plus I wanted to try using it with RocksDB and Pebble at a later time.
  • Even if the test should fail (e.g., a node enters an infinite panic loop), it does not. That's because the test can't yet distinguish intentional node crashes from unintentional ones.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better place for iofailuresim.c? I put it in a separate repo because there didn't seem to be any precedent for this kind of thing, plus I wanted to try using it with RocksDB and Pebble at a later time.

See my question below about whether iofailuresim.c is needed.

Even if the test should fail (e.g., a node enters an infinite panic loop), it does not. That's because the test can't yet distinguish intentional node crashes from unintentional ones.

Yeah, I think we'll want to figure out how to distinguish the crashes.

Before diving into the details of the new test, I have some questions:

  • We always configure RocksDB with manual_wal_flush. Is there a significant difference between the IO interception LD_PRELOAD and the behavior we get with manual_wal_flush? Specifically, we always call FlushWAL and then immediately call SyncWAL.
  • Is there a difference between the LD_PRELOAD and providing an env wrapper which accomplishes the same?
  • iofailuresim.c is only intercepting write, fsync and fdatasync. I believe RocksDB also uses pwrite as well, though I don't know the context. Is it worth intercepting that call as well?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)


pkg/cmd/roachprod/install/cockroach.go, line 391 at r1 (raw file):

		if c.IsLocal() {
			// This is consistent with the working directory used by `roachprod run`.
			cmd += fmt.Sprintf("cd ${HOME}/local/%d ; ", nodes[i])

This looks ok to me, but is there something that changed that required this?


pkg/cmd/roachprod/vm/gce/gcloud.go, line 54 at r1 (raw file):

// init will inject the GCE provider into vm.Providers, but only if the gcloud tool is available on the local path.
func init() {
	var p vm.Provider = &Provider{}

What is this doing? Is passing a nil provider to flagstub.New problematic?

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always configure RocksDB with manual_wal_flush. Is there a significant difference between the IO interception LD_PRELOAD and the behavior we get with manual_wal_flush? Specifically, we always call FlushWAL and then immediately call SyncWAL.

Yes, seems they'll be buffering similar data. In particular, the IO interception will be buffering the data written by FlushWAL but not yet synced by SyncWAL. I think it is still useful because it makes us hold the data in process memory longer, which allows us to simulate system crash between the two functions, or failure by the sync function.

Is there a difference between the LD_PRELOAD and providing an env wrapper which accomplishes the same?

Good question, I hadn't thought of the alternative approach. If the env under test just provides thin wrappers around file I/O APIs, it should be the same, I think. So there's not much to justify the extra difficulty in the implementation and lack of portability. Will try migrating over to an EnvWrapper.

iofailuresim.c is only intercepting write, fsync and fdatasync. I believe RocksDB also uses pwrite as well, though I don't know the context. Is it worth intercepting that call as well?

It might've been only for that global seqno in ingested files. I can double check.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @petermattis)


pkg/cmd/roachprod/install/cockroach.go, line 391 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This looks ok to me, but is there something that changed that required this?

It was convenient to use the same working directory so I could use relative paths (in particular LD_PRELOAD=./libiofailuresim.so) in the start command.


pkg/cmd/roachprod/vm/gce/gcloud.go, line 54 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What is this doing? Is passing a nil provider to flagstub.New problematic?

roachprod tries to use the pointer so it was failing here when it was nil:

p.Flags().ConfigureCreateFlags(createCmd.Flags())
. I copied this way of initialization from the AWS provider.

@ajkr
Copy link
Contributor Author

ajkr commented Apr 22, 2019

Yes, seems they'll be buffering similar data. In particular, the IO interception will be buffering the data written by FlushWAL but not yet synced by SyncWAL. I think it is still useful because it makes us hold the data in process memory longer, which allows us to simulate system crash between the two functions, or failure by the sync function.

For proof of this, the failure injection to the sync function is occasionally leading to raft log panics right now. I haven't figured out yet whether it's a test problem or a code problem, though.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I hadn't thought of the alternative approach. If the env under test just provides thin wrappers around file I/O APIs, it should be the same, I think. So there's not much to justify the extra difficulty in the implementation and lack of portability. Will try migrating over to an EnvWrapper.

👍

For proof of this, the failure injection to the sync function is occasionally leading to raft log panics right now. I haven't figured out yet whether it's a test problem or a code problem, though.

Interesting. That sounds like a bug worth tracking down.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)


pkg/cmd/roachprod/install/cockroach.go, line 391 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

It was convenient to use the same working directory so I could use relative paths (in particular LD_PRELOAD=./libiofailuresim.so) in the start command.

Ack. Makes sense to do this, even if you move away from using LD_PRELOAD.


pkg/cmd/roachprod/vm/gce/gcloud.go, line 54 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

roachprod tries to use the pointer so it was failing here when it was nil:

p.Flags().ConfigureCreateFlags(createCmd.Flags())
. I copied this way of initialization from the AWS provider.

Got it.

Nit: p := &Provider{} should work here too. I don't think we need the type of p to be vm.Provider.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we'll want to figure out how to distinguish the crashes.

This is still unresolved in the latest rendition. Suggestions welcome :).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)


c-deps/libroach/db.cc, line 157 at r2 (raw file):

DBStatus DBOpen(DBEngine** db, DBSlice dir, DBOptions db_opts) {
  // Have a couple `SyncFaultInjectionEnv`s registered with RocksDB for our test

My preferred place to put this registration logic would be as static globals in "env_sync_fault_injection.cc". When trying that, the linker must've considered them unused variables there and eliminated them, as the registry was empty. This answer (https://stackoverflow.com/a/842770) suggests some workarounds to try, though I am not sure about changing build flags for this purpose...

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still unresolved in the latest rendition. Suggestions welcome :).

Let's work through my other review comments first.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)


c-deps/libroach/db.cc, line 157 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

My preferred place to put this registration logic would be as static globals in "env_sync_fault_injection.cc". When trying that, the linker must've considered them unused variables there and eliminated them, as the registry was empty. This answer (https://stackoverflow.com/a/842770) suggests some workarounds to try, though I am not sure about changing build flags for this purpose...

This is fine. I think you could move these to another function (e.g. DBRegisterTestingEnvs) which would at least encapsulate that the purpose slightly better. Changing build flags feels fraught. I recall dealing with this problem in a past life, but don't think it is worth doing anything sophisticated here.


c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 78 at r2 (raw file):

    if (num_syncs_until_crash_ == 0) {
      exit(0);
    }

I think it will be problematic to not return here. A sync failure has already occurred and you dropped the data. Allowing more data to be written to the file seems like a recipe for corruption.


c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 89 at r2 (raw file):

    }
    buffer_.clear();
    return rocksdb::Status::OK();

Shouldn't this return an error? I thought you were saying that the sync failure returned an error and didn't write the data. Claiming the sync succeeded and not writing the data will surely lead to assertion failures in CockroachDB.


pkg/cmd/roachtest/system_crash.go, line 178 at r2 (raw file):

			args += ",rocksdb=env=sync-failure-injection-wrapping-default-env"
		} else {
			args += ",rocksdb=env=crash-failure-injection-wrapping-default-env"

Neat!


pkg/cmd/roachtest/system_crash.go, line 203 at r2 (raw file):

func getWorkloadNode(c *cluster) nodeListOption {
	return c.Node(c.nodes)
}

Rather than small global functions, I think you should embed these functions inside the main test functions. This is a bit like namespacing in C++, and avoids polluting the global namespace. For an example of this, see allocator.go:runWideReplication and the functions such as setReplication and waitForReplication.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)


c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 78 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think it will be problematic to not return here. A sync failure has already occurred and you dropped the data. Allowing more data to be written to the file seems like a recipe for corruption.

Modulo the mistake in not returning a non-OK status when the first failure happened, I believe this is the same behavior as Linux fsync, which is indeed a recipe for corruption! The kernel seems to reports a writeback error once and then you're free to keep writing. I didn't find docs easily on this but here is someone else's investigation into this "clear-error-and-continue" behavior: https://lwn.net/Articles/752093/.

We should also inject errors into sync_file_rangeas that is another syscall we're making where a writeback error can be reported, and presumably cleared afterwards.


c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 89 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Shouldn't this return an error? I thought you were saying that the sync failure returned an error and didn't write the data. Claiming the sync succeeded and not writing the data will surely lead to assertion failures in CockroachDB.

Good catch. FWIW this mistake wasn't present in the original version where I saw raft log panics. So we aren't safe yet :).


pkg/cmd/roachprod/vm/gce/gcloud.go, line 54 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Got it.

Nit: p := &Provider{} should work here too. I don't think we need the type of p to be vm.Provider.

Done.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)


c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 78 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Modulo the mistake in not returning a non-OK status when the first failure happened, I believe this is the same behavior as Linux fsync, which is indeed a recipe for corruption! The kernel seems to reports a writeback error once and then you're free to keep writing. I didn't find docs easily on this but here is someone else's investigation into this "clear-error-and-continue" behavior: https://lwn.net/Articles/752093/.

We should also inject errors into sync_file_rangeas that is another syscall we're making where a writeback error can be reported, and presumably cleared afterwards.

Ok. Please add a comment explaining that letting the sync continue after a previous error is intentional.

One difference between this and the clear-error-and-continue behavior in the kernel is that the new data will be written to a different offset. Perhaps you should be tracking the offset of the unwritten data and using pwrite below instead of Append.


c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 89 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Good catch. FWIW this mistake wasn't present in the original version where I saw raft log panics. So we aren't safe yet :).

Well, I suppose that is good news. It would be nice to have a reproducible scenario that causes Raft panics in the face of sync errors.

There is an existing synctest that verifies the database is correct and
usable after a crash triggered by an I/O error. The charybdefs
dependency it uses does error injection by manipulating return values.
When it injects an error into a sync operation, that sync does no work
and returns an error, but unsynced writes still survive in page cache.
Then after process crash-recovery, the DB's state is the same as if the
failed sync had succeeded. This new test attempts to simulate the
effects of a failed sync more completely, in particular by ensuring
unsynced writes are dropped.

The approach taken in this new test is to buffer unsynced writes in
process memory. This is achieved by providing our own implementation of
a few C syscall wrappers via `LD_PRELOAD`. By buffering in process
memory instead of page cache, we can easily drop unsynced writes.

In this new test, sync failure injection
(`system-crash/sync-errors=true`) involves both returning an error and
deleting unsynced data. Assuming error handling is correct the process
will crash itself shortly afterwards. There is also some logic in the
failure injector to force crash a little while later in case there's
ever a bug in RocksDB or Cockroach where we ignore the failure.

We can also use this approach to simulate machine crash
(`system-crash/sync-errors=false`). Simply killing the process will drop
writes that aren't yet synced, which is the same as what would happen if
a machine crashed.

Right now the test relies on frequent consistency checks to find errors
like missing writes. It hits the DB heavily with KV queries to try to
trigger enough flushes/WAL changes/compactions in case there are bugs in
those code paths. But I am open to suggestions for alternative
workloads/verification mechanisms.

Release note: None
@ajkr ajkr requested a review from a team April 24, 2019 18:14
Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)


c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc, line 78 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ok. Please add a comment explaining that letting the sync continue after a previous error is intentional.

One difference between this and the clear-error-and-continue behavior in the kernel is that the new data will be written to a different offset. Perhaps you should be tracking the offset of the unwritten data and using pwrite below instead of Append.

Done. Instead of tracking offset I overwrote the buffer with all zeros before we return an I/O error. That way the future writes/syncs should go to the right offset while we still simulate missing data.


pkg/cmd/roachprod/vm/gce/gcloud.go, line 54 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Done.

I started seeing the below error so changed it back.

pkg/cmd/roachprod/vm/gce/gcloud.go:56:5: cannot use flagstub.New(p, "please install the gcloud CLI utilities (https://cloud.google.com/sdk/downloads)") (type vm.Provider) as type *Provider in assignment: need type assertion

pkg/cmd/roachtest/system_crash.go, line 203 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Rather than small global functions, I think you should embed these functions inside the main test functions. This is a bit like namespacing in C++, and avoids polluting the global namespace. For an example of this, see allocator.go:runWideReplication and the functions such as setReplication and waitForReplication.

Done.

@ajkr ajkr changed the title Add roachtest that simulates system crash and sync failures roachtest: new test simulating system crash and sync failures Apr 24, 2019
Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a verification at the end could just be to restart all the cockroach nodes without the fault injecting env and run a consistency check. Not sure if there's a way to do the consistency check on-demand.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a verification at the end could just be to restart all the cockroach nodes without the fault injecting env and run a consistency check. Not sure if there's a way to do the consistency check on-demand.

cluster.FailOnReplicaDivergence does exactly this. I think it is called automatically.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr)


pkg/cmd/roachprod/vm/gce/gcloud.go, line 54 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

I started seeing the below error so changed it back.

pkg/cmd/roachprod/vm/gce/gcloud.go:56:5: cannot use flagstub.New(p, "please install the gcloud CLI utilities (https://cloud.google.com/sdk/downloads)") (type vm.Provider) as type *Provider in assignment: need type assertion

Huh? Ok, I'll look at this later.


pkg/cmd/roachtest/system_crash.go, line 74 at r3 (raw file):

	getWorkloadNode := func(c *cluster) nodeListOption {
		return c.Node(c.nodes)
	}

Nit: the above can be variables instead of methods:

cockroachNodes := c.Range(1, c.nodes-1)
reliableCockroachNode := c.Node(1)
unreliableCockroachNodes := c.Range(2, c.nodes-1)
workloadNode := c.Node(c.nodes)

pkg/cmd/roachtest/system_crash.go, line 144 at r3 (raw file):

					output, err := execCmdWithBuffer(
						ctx, t.l, roachprod, "monitor", c.name, "--oneshot", "--ignore-empty-nodes",
					)

There is something odd here. A monitor internally runs roachprod monitor in order to detect node deaths. So why do you need to invoke roachprod monitor separately? I suspect you might have a misunderstanding about monitor and don't actually need to use the monitor object itself.


pkg/cmd/roachtest/system_crash.go, line 181 at r3 (raw file):

								restartNodeIds = append(restartNodeIds, nodeId)
							}
						}

This is an unfortunate duplication of the work that is already done inside monitor.wait. Perhaps there should be a way to configure a callback to be invoked when a node dies.

ajkr added a commit to ajkr/cockroach that referenced this pull request Apr 24, 2019
Detected with cockroachdb#36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`.
With some slight modification to that test's constants it could repro
errors like this within a minute:

```
panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?
```

Debugging showed `DBSyncWAL` can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If `DBSyncWAL` returns an error once, keep track of
that error and return it for all future writes.

Release note: None
ajkr added a commit to ajkr/cockroach that referenced this pull request Apr 25, 2019
Detected with cockroachdb#36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`.
With some slight modification to that test's constants it could repro
errors like this within a minute:

```
panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?
```

Debugging showed `DBSyncWAL` can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If `DBSyncWAL` returns an error once, keep track of
that error and return it for all future writes.

Release note: None
craig bot pushed a commit that referenced this pull request Apr 25, 2019
37102: storage: fix possible raft log panic after fsync error r=ajkr a=ajkr

Detected with #36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`.
With some slight modification to that test's constants it could repro
errors like this within a minute:

```
panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?
```

Debugging showed `DBSyncWAL` can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If `DBSyncWAL` returns an error once, keep track of
that error and return it for all future writes.

Release note: None

Co-authored-by: Andrew Kryczka <andrew.kryczka2@gmail.com>
ajkr added a commit to ajkr/cockroach that referenced this pull request Apr 30, 2019
Detected with cockroachdb#36989 applied by running
`./bin/roachtest run --local '^system-crash/sync-errors=true$'`.
With some slight modification to that test's constants it could repro
errors like this within a minute:

```
panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?
```

Debugging showed `DBSyncWAL` can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If `DBSyncWAL` returns an error once, keep track of
that error and return it for all future writes.

Release note (bug fix): Fixed possible panic while recovering from a WAL
on which a sync operation failed.
ajkr added a commit to ajkr/cockroach that referenced this pull request Apr 30, 2019
Detected with cockroachdb#36989 applied by running
`./bin/roachtest run --local '^system-crash/sync-errors=true$'`.
With some slight modification to that test's constants it could repro
errors like this within a minute:

```
panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?
```

Debugging showed `DBSyncWAL` can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If `DBSyncWAL` returns an error once, keep track of
that error and return it for all future writes.

Release note (bug fix): Fixed possible panic while recovering from a WAL
on which a sync operation failed.
ajkr added a commit to ajkr/cockroach that referenced this pull request Apr 30, 2019
Detected with cockroachdb#36989 applied by running
`./bin/roachtest run --local '^system-crash/sync-errors=true$'`.
With some slight modification to that test's constants it could repro
errors like this within a minute:

```
panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?
```

Debugging showed `DBSyncWAL` can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If `DBSyncWAL` returns an error once, keep track of
that error and return it for all future writes.

Release note (bug fix): Fixed possible panic while recovering from a WAL
on which a sync operation failed.
@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants