Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] util/fail: create new fail point package #38953

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nvanbenschoten
Copy link
Member

Package fail provides a fail point implementation for testing-only failure
injection.

Fail points are code instrumentations that allow errors and other behavior to
be injected dynamically at runtime, primarily for testing purposes. Fail
points are flexible and can be configured to exhibit a variety of behavior,
including returning errors, panicing, and sleeping.

Fail points have a few unique benefits when compared to the standard testing
hook pattern. To start, adding a new fail point requires significantly less
code than adding a new testing hook. To add a new fail point, simply create
a new Event in event.go and trigger the fail point where desired by calling
fail.Point. Unit tests can then activate this new failure point by calling
fail.Activate. This combines to make fail points lightweight and easy to use.
This compares favorably to the effort needed to install a new testing hook,
which often requires defining new types, threading testing knob dependencies
through production code, and avoiding package dependency cycles.

More importantly, fail points have no associated runtime cost in production
builds of the code. Fail points use conditional compilation through the use
of build tags to optimize away to nothing when not enabled. Only when enabled
do the calls to fail.Point have any effect. This means that users of the
library are free to scatter fail points in performance-critical code paths
without fear of having an effect on performance.

This package is inspired by FreeBSD's failpoints: https://freebsd.org/cgi/man.cgi?query=fail


DNM: This doesn't fully solve the problem it was intended to, which is that right now we don't have a good way of testing process failures at very specific moments. For instance, we have no good way for @jeffrey-xiao to test failures between stages of applying a snapshot in #38932. This package allows us to easily inject these failures, but we're not capturing them in such a way that a test can actually assert any useful properties about them. I think it would be possible to catch injected panics in our test harnesses, but right now none of them are set up to do so. I'm not so sure it would be possible to catch injected log.Fatal/os.Exit calls in unit tests, which is really what we'd like to do.

I see a few possible approaches to addressing this:

  1. we try to mock out these panics/fatals in unit tests and catch them in our testing harness. This seems like a very involved task.
  2. we introduce some process separation between nodes and the test runner in unit tests that want to crash nodes so that a log.Fatal behaves properly in unit tests and we can easily restart these subprocesses.
  3. we abandon the idea of performing chaos testing in unit tests entirely and rely on higher-level acceptance tests (e.g. roachtests) for these kinds of tests. If we did this, we would want to change the interface of this fail point library. It's not clear how the conditional compilation story would play out.

This has been unused since 0d071f6.

Release note: None
Package fail provides a fail point implementation for testing-only failure
injection.

Fail points are code instrumentations that allow errors and other behavior to
be injected dynamically at runtime, primarily for testing purposes. Fail
points are flexible and can be configured to exhibit a variety of behavior,
including returning errors, panicing, and sleeping.

Fail points have a few unique benefits when compared to the standard testing
hook pattern. To start, adding a new fail point requires significantly less
code than adding a new testing hook. To add a new fail point, simply create
a new Event in event.go and trigger the fail point where desired by calling
fail.Point. Unit tests can then activate this new failure point by calling
fail.Activate. This combines to make fail points lightweight and easy to use.
This compares favorably to the effort needed to install a new testing hook,
which often requires defining new types, threading testing knob dependencies
through production code, and avoiding package dependency cycles.

More importantly, fail points have no associated runtime cost in production
builds of the code. Fail points use conditional compilation through the use
of build tags to optimize away to nothing when not enabled. Only when enabled
do the calls to fail.Point have any effect. This means that users of the
library are free to scatter fail points in performance critical code paths
without fear of having an effect on performance.

This package is inspired by FreeBSD's failpoints: https://freebsd.org/cgi/man.cgi?query=fail

Release note: None
@nvanbenschoten nvanbenschoten requested review from a team July 18, 2019 02:19
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member

tbg commented Jul 18, 2019

Unit test failpoints will necessarily involve some compromise between simplicity and applicability. Perhaps what #38932 is trying to do is pushing what can be achieved with in-process, but that doesn't need to imply that the whole approach is not worth it. I'd like to stay away from options two (for now - might be useful enough too at some point to handle the out-of-scope cases, but it will require its own infra) and three (forever starting now).

Mind jotting down a toy version of what @jeffrey-xiao is trying to do? I assume it's that we're running a server or even cluster and one of its goroutines is trying to apply a snapshot, which involves, say, two steps, and you want to simulate the case in which step one succeeds and step two fails. Naively we want to simulate a "fatal error", i.e. we want to zap the node and all of its processes dead, which is difficult for the reasons you describe above. But is there a way we can "just" return from the method in a way that leaves the server clearly in an incorrect state, but not incorrect enough to crash on its own, so that you can stop it the regular way?

If we're talking about writes to the storage engine, there's also another way you can go about it -- when your failpoint fires, it doesn't try to exit the process. Instead, it takes a RocksDB checkpoint, and then you shut down the server as usual. What's next is probably obvious: you replace the node's data with the checkpoint. You restart it again, and voila, hopefully you've achieved something close to what you were looking for.

Since a desire to crash the node will often correspond to wanting to provoke some sort of partial disk state, I suspect this strategy will get us a long way.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/util/fail/event.go, line 23 at r2 (raw file):

	_ Event = iota

	///////////////////////////////////////////////////////////////////////////

This approach is simple, but there are some downsides.

  1. the events are hardcoded in it the lib, so it will be hard to open source this library (which I hope we do, there isn't a good substitute for it out there and I'd miss this whenever I work on anything else). Not saying that it's worth throwing it out there right there but it'd be nice to be able to pull it in externally (yes I know the license, but still).
  2. it's too global, i.e. failpoints can easily leak from one test to the next

I think we answered both questions in our cluster settings already and conveniently we already plumb those to our code (so we can add something that gets plumbed inside it). I'd follow a very similar pattern here, where packages define their failpoints, except we wouldn't key on a name, but on a type:

// storage_test.go
func TestFoo(t *testing.T) {
  cfg := makeCfg()
  cfg.fail.Set(FailpointRaftApplyEntry{}, func(...interface{}) error { return errors.New("boom") })
  doStuff(cfg)
  if cfg.fail.NumCalls(FailpointRaftApplyEntry{}) == 0 { t.Fatal("failpoint not called?!") } // maybe `defer cfg.fail.AssertAllCalled(t)` instead
}

// storage.go
type FailpointRaftApplyEntry struct{}

func init() {
    fail.Register(FailpointRaftApplyEntry{}) // basically how context.Value works
}

func (s *Store) productionCode() error {
  if f := s.cfg.fail.Enabled(FailpointRaftApplyEntry{}); f != nil {
    if err := f("some", "relevant", "input"); err != nil {
      return err
    }
  }
}

We'd have to godbolt it, but couldn't we even go for

func (s *Store) productionCode() error {
  if err := s.cfg.fail.Call(FailpointRaftApplyEntry{}, "some", "relevant", "input"); err != nil {
      return err
  }
}

fail.Call would be returning a static noop-func, which hopefully could also be optimized out? Hmm, that's a bummer, I just tried and doesn't look like it:

package main

func Call(typ interface{}, args ...interface{}) bool {
	return false
}

func main() {
	i := 5
	j := 7
	Call(i, i, j)
}

i,j escape. With an explicit guard, they do not.

func main() {
	i := 5
	j := 7
	if false {
		Call(i, i, j)
	}
}

https://godbolt.org/z/xPvKE_


pkg/util/fail/fail.go, line 45 at r2 (raw file):

type Callback func(...interface{}) error

// Error is an error returned by the detault fail point callback.

Thought there was a speck of dust on my screen, but it was detault's fault.

@nvanbenschoten
Copy link
Member Author

Mind jotting down a toy version of what @jeffrey-xiao is trying to do? I assume it's that we're running a server or even cluster and one of its goroutines is trying to apply a snapshot, which involves, say, two steps, and you want to simulate the case in which step one succeeds and step two fails. Naively we want to simulate a "fatal error", i.e. we want to zap the node and all of its processes dead, which is difficult for the reasons you describe above. But is there a way we can "just" return from the method in a way that leaves the server clearly in an incorrect state, but not incorrect enough to crash on its own, so that you can stop it the regular way?

This is pretty much right. In the current proposal, the ingestion of snapshots will become a multi-step process that needs to be recoverable even after an untimely crash. @jeffrey-xiao is looking to add tests that crash a node in between these steps and then start it back up again to confirm that the snapshot process recovers correctly.

If we're talking about writes to the storage engine, there's also another way you can go about it -- when your failpoint fires, it doesn't try to exit the process. Instead, it takes a RocksDB checkpoint, and then you shut down the server as usual. What's next is probably obvious: you replace the node's data with the checkpoint. You restart it again, and voila, hopefully you've achieved something close to what you were looking for.

This is an interesting approach, although it does seem to require a lot of mocking and it will never correspond 1:1 with a real crash. Do we have any tests that operate in this way today?

@ajwerner and I spent some time talking about this topic this morning. He had initially commented the following in slack:

Another approach we could take for fail points which has a different set of trade offs would be to have the test launch a subprocess that then attaches with delve to the original test and cooperatively drives injection. It has the advantage of being able to read lots of state and there's experimental support for function calls in the remote process (though I'm sure there will be some weird edge cases). Downsides are that the test would need access to a delve binary that's been granted linux capabilities to have permission to do the debugging and that it's probably a weird style of programming. Also there may be places where the compiler optimizations make it less useful and may need to be disabled. The upside is it shouldn't require any changes to the core business logic that impact production performance.

We talked a good bit about what this might look like in practice and how it might hook into the existing unit test infrastructure. We also contrasted this approach to option #2 listed above. In the end, we didn't come out of the discussion feeling particularly good about the prospect of launching and manipulating subprocesses in Go unit tests. The complications here primarily revolve around manipulating these subprocesses and causing them to behave in a certain way.

We came to the conclusion that if we're going to perform out-of-process coordination then we might as well invest further in our acceptance-level testing. Specifically, we should invest more in acceptance-level roachtest that are meant to be run locally (like acceptance/rapid-restart). This appears to be lowest-level test that works at the level of cockroach processes, and it doesn't really make sense to talk about a process crash without having cockroach nodes in their own processes.

The question then becomes how we inject these failures into running cockroach processes in these acceptance-level roachtest. It would be possible to use delve for that solution, although I suspect that doing so would be fairly heavy-weight and hard to maintain. My proposal is to extend this library with environment variable support. cockroach processes built with the right build tag would activate fail points based on the value of this environment variable. In practice, this would mean that a roachtest that wants to crash a process at a certain point would simply start it up with startArgs("--env=COCKROACH_FAILPOINTS=fail-point-name"). Interestingly, this also seems to be what fail-rs does.

So to bring this all back to the testing that @jeffrey-xiao wants to perform, here are the suite of tests that I would propose:

  1. a few unit tests that work similarly to TestStoreInitAndBootstrap. They would avoid any attempt at actually crashing a node, and would instead synthesize the state of a node that has crashed in these intermediate states by writing to an in-memory engine (and an impl of the snapshot sst directory) before starting up a Store. They would then assert that the startup process behaves correctly and correctly completes the snapshot process.
  2. a few acceptance/snapshot-crash-... roachtests that use the env var extension to this fail point library. They would launch cockroach processes with specific fail points enabled and would allow nodes to crash at the desired points. They would then reboot these nodes and assert that the raft groups correctly recover.

@tbg
Copy link
Member

tbg commented Jul 18, 2019

This is an interesting approach, although it does seem to require a lot of mocking and it will never correspond 1:1 with a real crash. Do we have any tests that operate in this way today?

The mocking seems pretty straightforward, doesn't even need the failpoint library (a testing knob injecting a func that gets to see the engine in the right place is enough), but no, I don't think we do this today. I'm a bit surprised that you consider it complicated in the first place. Is there something you think I'm missing?

In the end, we didn't come out of the discussion feeling particularly good about the prospect of launching and manipulating subprocesses in Go unit tests

I agree.

We came to the conclusion that if we're going to perform out-of-process coordination then we might as well invest further in our acceptance-level testing.

To me, this feels a like making perfect and slow the enemy of good and fast. You only get ultimate "crash" testing when you actually crash (and do it on all possible file systems, etc), so there is a reason for doing acceptance-level testing of this feature and I welcome that. But it will be several orders of magnitude slower to run this test and it will be much less accessible to our tooling (think attaching with a debugger), which will really hurt if it does find anything that the basic test would've also caught. And on a more fundamental level, if we can exercise something that we're concerned about via a unit test to any appreciable extent in a fairly clean way, that seems like a bummer to not do.

If I'm not missing complexity in setting this up unit-test style via a testing knob or failpoint, would you (both) be up for trying that out (as a stepping stone for the roachtest, not a replacement)? More than happy to chip in when there's a snag.

@nvanbenschoten
Copy link
Member Author

If I'm not missing complexity in setting this up unit-test style via a testing knob or failpoint, would you (both) be up for trying that out (as a stepping stone for the roachtest, not a replacement)? More than happy to chip in when there's a snag.

@jeffrey-xiao and I just talked about this. We're going to try out the strategy of snapshotting the RocksDB engine and SST storage directory in a testing knob and bringing the node back up from this state after allowing it to shut down cleanly. Hopefully, this style of testing will catch on. Thanks for the idea!

@tbg
Copy link
Member

tbg commented Jul 22, 2019 via email

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants