circuit: add probing-based circuit breaker #73641

tbg · 2021-12-09T15:34:05Z

These are initially for use in #71806 but were originally conceived of
for #70485, which we are not currently prioritizing.

Importantly, this circuit breaker does not recruit a fraction of
requests to do the probing, which is desirable for both PR #71806 and
PR #70485; requests recruited as probes tend to incur high latency and
errors, and we don't want SQL client traffic to experience those.

Release note: None

cockroach-teamcity · 2021-12-09T15:34:13Z

This change is

erikgrinaker

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)

-- commits, line 5 at r1:
Consider a link to the relevant issue(s) too.

pkg/util/circuit/circuitbreaker.go, line 77 at r1 (raw file):

// Signal is low-overhead and thus suitable for use in hot code paths.
func (b *Breaker) Signal() (chan struct{}, func() error) {
	return b.errAndCh().ch, b.definitelyErr

If the caller has a need to see the tripping error, shouldn't we make sure to propagate the correct one?

pkg/util/circuit/circuitbreaker.go, line 91 at r1 (raw file):

// already tripped), the error that would be returned from subsequent calls to
// Err() is returned.
func (b *Breaker) Report(err error) error {

nit: maybe give the return value a name here, to clarify that, unusually, an error response indicates success.

Why do we even return the error here? Looks like there's only a single test call that checks it. Do we expect anyone to care about it?

pkg/util/circuit/circuitbreaker.go, line 106 at r1 (raw file):

	}
	opts := b.Opts()
	err = opts.ShouldTrip(err)

nit: let's allow this to be unspecified too.

pkg/util/circuit/circuitbreaker.go, line 220 at r1 (raw file):

// from Breaker.Err(), i.e. `errors.Is(err, ErrBreakerOpen()) can be
// used to check whether an error originated from some Breaker.
func ErrBreakerOpen() error {

nit: why is this a function, rather than exporting errBreakerOpen?

pkg/util/circuit/circuitbreaker.go, line 224 at r1 (raw file):

}

type errAndCh struct {

nit: since we now protect this with a mutex, I don't think there's any point in having this type anymore? We can just use separate fields in Breaker.

pkg/util/circuit/options.go, line 18 at r1 (raw file):

type Options struct {
	// Name is the name of a Breaker and will be mentioned in the errors
	// that particular Breaker generates.

nit: "that a particular"

pkg/util/circuit/options.go, line 24 at r3 (raw file):

	// `Breaker.Report`. In particular, it can return nil to ignore certain
	// errors.
	ShouldTrip func(err error) error

nit: "should" implies a boolean response, and the error semantics here are unusual. Consider clarifying the name, or using two separate functions: one to decide whether to trip the breaker, and one to annotate the error.

tbg

Thanks for the review! I made changes that I would call substantial and I'm pretty happy with the result.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/util/circuit/circuitbreaker.go, line 91 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: maybe give the return value a name here, to clarify that, unusually, an error response indicates success.

Why do we even return the error here? Looks like there's only a single test call that checks it. Do we expect anyone to care about it?

The original motivation for returning the error was so that the caller could log the annotated error. But this is sort of going the wrong way, we don't want to make it look like that caller failed on the breaker, so it is better to log the un-wrapped error. Removed the error return which also means the linter won't yell at us for not handling it.

pkg/util/circuit/circuitbreaker.go, line 220 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: why is this a function, rather than exporting errBreakerOpen?

So that you can't do circuit.ErrBreakerOpen = something. I realize there's plenty of precedent for ignoring that "problem". Switched to var ErrBreakerOpen.

pkg/util/circuit/circuitbreaker.go, line 224 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: since we now protect this with a mutex, I don't think there's any point in having this type anymore? We can just use separate fields in Breaker.

This was initially left-over from the unsafe.Pointer approach, but in the meantime I have removed this type and found that I needed to re-add it as it seems like the best way to get error stability from Signal() as well as avoiding allocations in that path.

pkg/util/circuit/options.go, line 24 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: "should" implies a boolean response, and the error semantics here are unusual. Consider clarifying the name, or using two separate functions: one to decide whether to trip the breaker, and one to annotate the error.

I removed this option altogether, as I don't anticipate a need for it (and if we ever do, easy enough to add back).

erikgrinaker

Nice, I really like the changes!

Reviewed 30 of 30 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/util/circuit/circuitbreaker.go, line 114 at r4 (raw file):

	// We get to write the error since we have exclusive access via b.mu.
	b.mu.errAndCh.err = storeErr
	close(b.mu.errAndCh.ch)

I had to read this a couple of times before I got why this closes the initial channel on the first error. Maybe add a comment clarifying how only on the nil -> !nil transition do we close the channel that actually has any waiters. After that, we close it pre-emptively for later callers.

pkg/util/circuit/circuitbreaker.go, line 115 at r4 (raw file):

	b.mu.errAndCh.err = storeErr
	close(b.mu.errAndCh.ch)
	b.mu.Unlock()

nano-optimization (?): we can unlock before closing.

pkg/util/circuit/circuitbreaker.go, line 213 at r4 (raw file):

	select {
	case <-eac.ch:
		eac.maybeTriggerProbe()

Why are we triggering the probe here?

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/util/circuit/circuitbreaker.go, line 115 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nano-optimization (?): we can unlock before closing.

Actually, we clearly can't without another reference to the channel. Doesn't seem worth it, nevermind.

These are initially for use in cockroachdb#71806 but were originally conceived of for cockroachdb#70485, which we are not currently prioritizing. Importantly, this circuit breaker does not recruit a fraction of requests to do the probing, which is desirable for both PR cockroachdb#71806 and PR cockroachdb#70485; requests recruited as probes tend to incur high latency and errors, and we don't want SQL client traffic to experience those. Touches cockroachdb#33007. Release note: None

tbg

TFTR! Will merge on green (tomorrow).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/util/circuit/circuitbreaker.go, line 213 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Why are we triggering the probe here?

I wrote more comments about how this works in general. Basically, when a probe has returned and failed to heal the breaker, we're going to launch the next one only reactively, after the next client bounces off the circuit breaker (which means a call to Err() that returns non-nil). I think this behavior makes sense (if a probe wants to keep trying for 10 minutes, it can do that rather easily), added a test for it as well.

tbg · 2021-12-14T22:22:55Z

... or today..

bors r=erikgrinaker

craig · 2021-12-15T00:21:54Z

Build succeeded:

GitHub CI (Cockroach)

Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the "slowness" existing detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, all of this is off via the `kv.replica_circuit_breakers.enabled` cluster setting while we determine which of the many follow-up items to target for 22.1 (see cockroachdb#74705). Primarily though, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are - UI work (cockroachdb#74713) - Metrics (cockroachdb#74505) The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None

Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, all of this is opt-in via the `kv.replica_circuit_breakers.enabled` cluster setting while we determine which of the many follow-up items to target for 22.1 (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None

Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within the timeout specified by the `kv.replica_circuit_breaker.slow_replication_threshold` cluster setting, the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, the cluster setting defaults to zero (disabling the entire mechanism) until some necessary follow-up items have been addressed (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None

Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within the timeout specified by the `kv.replica_circuit_breaker.slow_replication_threshold` cluster setting, the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). Perhaps surprisingly, when the breaker is tripped, *all* traffic to the Replica gets the fail-fast behavior, not just mutations. This is because even though a read may look good to serve based on the lease, we would also need to check for latches, and in particular we would need to fail-fast if there is a transitive dependency on any write (since that write is presumably stuck). This is not trivial and so we don't do it in this first iteration (see cockroachdb#74799). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, the cluster setting defaults to zero (disabling the entire mechanism) until some necessary follow-up items have been addressed (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None touchie

71806: kvserver: circuit-break requests to unavailable ranges r=erikgrinaker a=tbg Fixes #33007. Closes #61311. This PR uses the circuit breaker package introduced in #73641 to address issue #33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in #72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, all of this is opt-in via the `kv.replica_circuit_breakers.enabled` cluster setting while we determine which of the many follow-up items to target for 22.1 (see #74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); #74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (#74713) and metrics (#74505). The release note is deferred to #74705 (where breakers are turned on). Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

tbg requested a review from erikgrinaker December 9, 2021 15:34

tbg force-pushed the circ-breaker-lib branch 2 times, most recently from 77d948a to 046008f Compare December 9, 2021 16:15

tbg requested a review from a team as a code owner December 9, 2021 16:15

tbg force-pushed the circ-breaker-lib branch from 046008f to 83fb87b Compare December 9, 2021 17:13

erikgrinaker reviewed Dec 9, 2021

View reviewed changes

tbg force-pushed the circ-breaker-lib branch from 83fb87b to 7c26c5e Compare December 10, 2021 14:44

tbg requested a review from erikgrinaker December 10, 2021 14:45

tbg commented Dec 10, 2021

View reviewed changes

tbg force-pushed the circ-breaker-lib branch from 7c26c5e to 5bc60ff Compare December 10, 2021 14:46

erikgrinaker reviewed Dec 10, 2021

View reviewed changes

tbg force-pushed the circ-breaker-lib branch from 5bc60ff to 57745fe Compare December 14, 2021 20:52

tbg force-pushed the circ-breaker-lib branch from 57745fe to eed8603 Compare December 14, 2021 21:14

tbg requested a review from erikgrinaker December 14, 2021 21:14

tbg commented Dec 14, 2021

View reviewed changes

tbg mentioned this pull request Dec 14, 2021

kvserver: circuit-break requests to unavailable ranges #71806

Merged

craig bot merged commit acf40e9 into cockroachdb:master Dec 15, 2021

tbg mentioned this pull request Jan 14, 2022

kvserver: circuit-break requests to unavailable ranges #33007

Closed

erikgrinaker mentioned this pull request Apr 14, 2023

kvclient: DistSender circuit breakers for unreachable leaseholders #93501

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

circuit: add probing-based circuit breaker #73641

circuit: add probing-based circuit breaker #73641

tbg commented Dec 9, 2021

cockroach-teamcity commented Dec 9, 2021

erikgrinaker left a comment

tbg left a comment

erikgrinaker left a comment

erikgrinaker left a comment

tbg left a comment

tbg commented Dec 14, 2021

craig bot commented Dec 15, 2021

circuit: add probing-based circuit breaker #73641

circuit: add probing-based circuit breaker #73641

Conversation

tbg commented Dec 9, 2021

cockroach-teamcity commented Dec 9, 2021

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg commented Dec 14, 2021

craig bot commented Dec 15, 2021