-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM] util/fail: create new fail point package #38953
base: master
Are you sure you want to change the base?
[DNM] util/fail: create new fail point package #38953
Conversation
This has been unused since 0d071f6. Release note: None
Package fail provides a fail point implementation for testing-only failure injection. Fail points are code instrumentations that allow errors and other behavior to be injected dynamically at runtime, primarily for testing purposes. Fail points are flexible and can be configured to exhibit a variety of behavior, including returning errors, panicing, and sleeping. Fail points have a few unique benefits when compared to the standard testing hook pattern. To start, adding a new fail point requires significantly less code than adding a new testing hook. To add a new fail point, simply create a new Event in event.go and trigger the fail point where desired by calling fail.Point. Unit tests can then activate this new failure point by calling fail.Activate. This combines to make fail points lightweight and easy to use. This compares favorably to the effort needed to install a new testing hook, which often requires defining new types, threading testing knob dependencies through production code, and avoiding package dependency cycles. More importantly, fail points have no associated runtime cost in production builds of the code. Fail points use conditional compilation through the use of build tags to optimize away to nothing when not enabled. Only when enabled do the calls to fail.Point have any effect. This means that users of the library are free to scatter fail points in performance critical code paths without fear of having an effect on performance. This package is inspired by FreeBSD's failpoints: https://freebsd.org/cgi/man.cgi?query=fail Release note: None
Unit test failpoints will necessarily involve some compromise between simplicity and applicability. Perhaps what #38932 is trying to do is pushing what can be achieved with in-process, but that doesn't need to imply that the whole approach is not worth it. I'd like to stay away from options two (for now - might be useful enough too at some point to handle the out-of-scope cases, but it will require its own infra) and three (forever starting now). Mind jotting down a toy version of what @jeffrey-xiao is trying to do? I assume it's that we're running a server or even cluster and one of its goroutines is trying to apply a snapshot, which involves, say, two steps, and you want to simulate the case in which step one succeeds and step two fails. Naively we want to simulate a "fatal error", i.e. we want to zap the node and all of its processes dead, which is difficult for the reasons you describe above. But is there a way we can "just" return from the method in a way that leaves the server clearly in an incorrect state, but not incorrect enough to crash on its own, so that you can stop it the regular way? If we're talking about writes to the storage engine, there's also another way you can go about it -- when your failpoint fires, it doesn't try to exit the process. Instead, it takes a RocksDB checkpoint, and then you shut down the server as usual. What's next is probably obvious: you replace the node's data with the checkpoint. You restart it again, and voila, hopefully you've achieved something close to what you were looking for. Since a desire to crash the node will often correspond to wanting to provoke some sort of partial disk state, I suspect this strategy will get us a long way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/util/fail/event.go, line 23 at r2 (raw file):
_ Event = iota ///////////////////////////////////////////////////////////////////////////
This approach is simple, but there are some downsides.
- the events are hardcoded in it the lib, so it will be hard to open source this library (which I hope we do, there isn't a good substitute for it out there and I'd miss this whenever I work on anything else). Not saying that it's worth throwing it out there right there but it'd be nice to be able to pull it in externally (yes I know the license, but still).
- it's too global, i.e. failpoints can easily leak from one test to the next
I think we answered both questions in our cluster settings already and conveniently we already plumb those to our code (so we can add something that gets plumbed inside it). I'd follow a very similar pattern here, where packages define their failpoints, except we wouldn't key on a name, but on a type:
// storage_test.go
func TestFoo(t *testing.T) {
cfg := makeCfg()
cfg.fail.Set(FailpointRaftApplyEntry{}, func(...interface{}) error { return errors.New("boom") })
doStuff(cfg)
if cfg.fail.NumCalls(FailpointRaftApplyEntry{}) == 0 { t.Fatal("failpoint not called?!") } // maybe `defer cfg.fail.AssertAllCalled(t)` instead
}
// storage.go
type FailpointRaftApplyEntry struct{}
func init() {
fail.Register(FailpointRaftApplyEntry{}) // basically how context.Value works
}
func (s *Store) productionCode() error {
if f := s.cfg.fail.Enabled(FailpointRaftApplyEntry{}); f != nil {
if err := f("some", "relevant", "input"); err != nil {
return err
}
}
}
We'd have to godbolt it, but couldn't we even go for
func (s *Store) productionCode() error {
if err := s.cfg.fail.Call(FailpointRaftApplyEntry{}, "some", "relevant", "input"); err != nil {
return err
}
}
fail.Call
would be returning a static noop-func, which hopefully could also be optimized out? Hmm, that's a bummer, I just tried and doesn't look like it:
package main
func Call(typ interface{}, args ...interface{}) bool {
return false
}
func main() {
i := 5
j := 7
Call(i, i, j)
}
i,j
escape. With an explicit guard, they do not.
func main() {
i := 5
j := 7
if false {
Call(i, i, j)
}
}
pkg/util/fail/fail.go, line 45 at r2 (raw file):
type Callback func(...interface{}) error // Error is an error returned by the detault fail point callback.
Thought there was a speck of dust on my screen, but it was detault
's fault.
This is pretty much right. In the current proposal, the ingestion of snapshots will become a multi-step process that needs to be recoverable even after an untimely crash. @jeffrey-xiao is looking to add tests that crash a node in between these steps and then start it back up again to confirm that the snapshot process recovers correctly.
This is an interesting approach, although it does seem to require a lot of mocking and it will never correspond 1:1 with a real crash. Do we have any tests that operate in this way today? @ajwerner and I spent some time talking about this topic this morning. He had initially commented the following in slack:
We talked a good bit about what this might look like in practice and how it might hook into the existing unit test infrastructure. We also contrasted this approach to option #2 listed above. In the end, we didn't come out of the discussion feeling particularly good about the prospect of launching and manipulating subprocesses in Go unit tests. The complications here primarily revolve around manipulating these subprocesses and causing them to behave in a certain way. We came to the conclusion that if we're going to perform out-of-process coordination then we might as well invest further in our acceptance-level testing. Specifically, we should invest more in acceptance-level roachtest that are meant to be run locally (like The question then becomes how we inject these failures into running cockroach processes in these acceptance-level roachtest. It would be possible to use delve for that solution, although I suspect that doing so would be fairly heavy-weight and hard to maintain. My proposal is to extend this library with environment variable support. So to bring this all back to the testing that @jeffrey-xiao wants to perform, here are the suite of tests that I would propose:
|
The mocking seems pretty straightforward, doesn't even need the failpoint library (a testing knob injecting a func that gets to see the engine in the right place is enough), but no, I don't think we do this today. I'm a bit surprised that you consider it complicated in the first place. Is there something you think I'm missing?
I agree.
To me, this feels a like making perfect and slow the enemy of good and fast. You only get ultimate "crash" testing when you actually crash (and do it on all possible file systems, etc), so there is a reason for doing acceptance-level testing of this feature and I welcome that. But it will be several orders of magnitude slower to run this test and it will be much less accessible to our tooling (think attaching with a debugger), which will really hurt if it does find anything that the basic test would've also caught. And on a more fundamental level, if we can exercise something that we're concerned about via a unit test to any appreciable extent in a fairly clean way, that seems like a bummer to not do. If I'm not missing complexity in setting this up unit-test style via a testing knob or failpoint, would you (both) be up for trying that out (as a stepping stone for the roachtest, not a replacement)? More than happy to chip in when there's a snag. |
@jeffrey-xiao and I just talked about this. We're going to try out the strategy of snapshotting the RocksDB engine and SST storage directory in a testing knob and bringing the node back up from this state after allowing it to shut down cleanly. Hopefully, this style of testing will catch on. Thanks for the idea! |
Thanks for trying it out! I'll buy both of you an ice cream (looks like
anyone in NYC deserves one these days) if it turns out to be a waste of
time.
|
Package fail provides a fail point implementation for testing-only failure
injection.
Fail points are code instrumentations that allow errors and other behavior to
be injected dynamically at runtime, primarily for testing purposes. Fail
points are flexible and can be configured to exhibit a variety of behavior,
including returning errors, panicing, and sleeping.
Fail points have a few unique benefits when compared to the standard testing
hook pattern. To start, adding a new fail point requires significantly less
code than adding a new testing hook. To add a new fail point, simply create
a new Event in event.go and trigger the fail point where desired by calling
fail.Point. Unit tests can then activate this new failure point by calling
fail.Activate. This combines to make fail points lightweight and easy to use.
This compares favorably to the effort needed to install a new testing hook,
which often requires defining new types, threading testing knob dependencies
through production code, and avoiding package dependency cycles.
More importantly, fail points have no associated runtime cost in production
builds of the code. Fail points use conditional compilation through the use
of build tags to optimize away to nothing when not enabled. Only when enabled
do the calls to fail.Point have any effect. This means that users of the
library are free to scatter fail points in performance-critical code paths
without fear of having an effect on performance.
This package is inspired by FreeBSD's failpoints: https://freebsd.org/cgi/man.cgi?query=fail
DNM: This doesn't fully solve the problem it was intended to, which is that right now we don't have a good way of testing process failures at very specific moments. For instance, we have no good way for @jeffrey-xiao to test failures between stages of applying a snapshot in #38932. This package allows us to easily inject these failures, but we're not capturing them in such a way that a test can actually assert any useful properties about them. I think it would be possible to catch injected panics in our test harnesses, but right now none of them are set up to do so. I'm not so sure it would be possible to catch injected
log.Fatal
/os.Exit
calls in unit tests, which is really what we'd like to do.I see a few possible approaches to addressing this:
log.Fatal
behaves properly in unit tests and we can easily restart these subprocesses.