Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

circuit: add probing-based circuit breaker #73641

Merged
merged 1 commit into from
Dec 15, 2021

Conversation

tbg
Copy link
Member

@tbg tbg commented Dec 9, 2021

These are initially for use in #71806 but were originally conceived of
for #70485, which we are not currently prioritizing.

Importantly, this circuit breaker does not recruit a fraction of
requests to do the probing, which is desirable for both PR #71806 and
PR #70485; requests recruited as probes tend to incur high latency and
errors, and we don't want SQL client traffic to experience those.

Release note: None

@tbg tbg requested a review from erikgrinaker December 9, 2021 15:34
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the circ-breaker-lib branch 2 times, most recently from 77d948a to 046008f Compare December 9, 2021 16:15
@tbg tbg requested a review from a team as a code owner December 9, 2021 16:15
@tbg tbg force-pushed the circ-breaker-lib branch from 046008f to 83fb87b Compare December 9, 2021 17:13
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)


-- commits, line 5 at r1:
Consider a link to the relevant issue(s) too.


pkg/util/circuit/circuitbreaker.go, line 77 at r1 (raw file):

// Signal is low-overhead and thus suitable for use in hot code paths.
func (b *Breaker) Signal() (chan struct{}, func() error) {
	return b.errAndCh().ch, b.definitelyErr

If the caller has a need to see the tripping error, shouldn't we make sure to propagate the correct one?


pkg/util/circuit/circuitbreaker.go, line 91 at r1 (raw file):

// already tripped), the error that would be returned from subsequent calls to
// Err() is returned.
func (b *Breaker) Report(err error) error {

nit: maybe give the return value a name here, to clarify that, unusually, an error response indicates success.

Why do we even return the error here? Looks like there's only a single test call that checks it. Do we expect anyone to care about it?


pkg/util/circuit/circuitbreaker.go, line 106 at r1 (raw file):

	}
	opts := b.Opts()
	err = opts.ShouldTrip(err)

nit: let's allow this to be unspecified too.


pkg/util/circuit/circuitbreaker.go, line 220 at r1 (raw file):

// from Breaker.Err(), i.e. `errors.Is(err, ErrBreakerOpen()) can be
// used to check whether an error originated from some Breaker.
func ErrBreakerOpen() error {

nit: why is this a function, rather than exporting errBreakerOpen?


pkg/util/circuit/circuitbreaker.go, line 224 at r1 (raw file):

}

type errAndCh struct {

nit: since we now protect this with a mutex, I don't think there's any point in having this type anymore? We can just use separate fields in Breaker.


pkg/util/circuit/options.go, line 18 at r1 (raw file):

type Options struct {
	// Name is the name of a Breaker and will be mentioned in the errors
	// that particular Breaker generates.

nit: "that a particular"


pkg/util/circuit/options.go, line 24 at r3 (raw file):

	// `Breaker.Report`. In particular, it can return nil to ignore certain
	// errors.
	ShouldTrip func(err error) error

nit: "should" implies a boolean response, and the error semantics here are unusual. Consider clarifying the name, or using two separate functions: one to decide whether to trip the breaker, and one to annotate the error.

@tbg tbg force-pushed the circ-breaker-lib branch from 83fb87b to 7c26c5e Compare December 10, 2021 14:44
@tbg tbg requested a review from erikgrinaker December 10, 2021 14:45
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I made changes that I would call substantial and I'm pretty happy with the result.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/util/circuit/circuitbreaker.go, line 91 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: maybe give the return value a name here, to clarify that, unusually, an error response indicates success.

Why do we even return the error here? Looks like there's only a single test call that checks it. Do we expect anyone to care about it?

The original motivation for returning the error was so that the caller could log the annotated error. But this is sort of going the wrong way, we don't want to make it look like that caller failed on the breaker, so it is better to log the un-wrapped error. Removed the error return which also means the linter won't yell at us for not handling it.


pkg/util/circuit/circuitbreaker.go, line 220 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: why is this a function, rather than exporting errBreakerOpen?

So that you can't do circuit.ErrBreakerOpen = something. I realize there's plenty of precedent for ignoring that "problem". Switched to var ErrBreakerOpen.


pkg/util/circuit/circuitbreaker.go, line 224 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: since we now protect this with a mutex, I don't think there's any point in having this type anymore? We can just use separate fields in Breaker.

This was initially left-over from the unsafe.Pointer approach, but in the meantime I have removed this type and found that I needed to re-add it as it seems like the best way to get error stability from Signal() as well as avoiding allocations in that path.


pkg/util/circuit/options.go, line 24 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: "should" implies a boolean response, and the error semantics here are unusual. Consider clarifying the name, or using two separate functions: one to decide whether to trip the breaker, and one to annotate the error.

I removed this option altogether, as I don't anticipate a need for it (and if we ever do, easy enough to add back).

@tbg tbg force-pushed the circ-breaker-lib branch from 7c26c5e to 5bc60ff Compare December 10, 2021 14:46
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I really like the changes!

Reviewed 30 of 30 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/util/circuit/circuitbreaker.go, line 114 at r4 (raw file):

	// We get to write the error since we have exclusive access via b.mu.
	b.mu.errAndCh.err = storeErr
	close(b.mu.errAndCh.ch)

I had to read this a couple of times before I got why this closes the initial channel on the first error. Maybe add a comment clarifying how only on the nil -> !nil transition do we close the channel that actually has any waiters. After that, we close it pre-emptively for later callers.


pkg/util/circuit/circuitbreaker.go, line 115 at r4 (raw file):

	b.mu.errAndCh.err = storeErr
	close(b.mu.errAndCh.ch)
	b.mu.Unlock()

nano-optimization (?): we can unlock before closing.


pkg/util/circuit/circuitbreaker.go, line 213 at r4 (raw file):

	select {
	case <-eac.ch:
		eac.maybeTriggerProbe()

Why are we triggering the probe here?

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/util/circuit/circuitbreaker.go, line 115 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nano-optimization (?): we can unlock before closing.

Actually, we clearly can't without another reference to the channel. Doesn't seem worth it, nevermind.

@tbg tbg force-pushed the circ-breaker-lib branch from 5bc60ff to 57745fe Compare December 14, 2021 20:52
These are initially for use in cockroachdb#71806 but were originally conceived of
for cockroachdb#70485, which we are not currently prioritizing.

Importantly, this circuit breaker does not recruit a fraction of
requests to do the probing, which is desirable for both PR cockroachdb#71806 and
PR cockroachdb#70485; requests recruited as probes tend to incur high latency and
errors, and we don't want SQL client traffic to experience those.

Touches cockroachdb#33007.

Release note: None
@tbg tbg force-pushed the circ-breaker-lib branch from 57745fe to eed8603 Compare December 14, 2021 21:14
@tbg tbg requested a review from erikgrinaker December 14, 2021 21:14
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! Will merge on green (tomorrow).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/util/circuit/circuitbreaker.go, line 213 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Why are we triggering the probe here?

I wrote more comments about how this works in general. Basically, when a probe has returned and failed to heal the breaker, we're going to launch the next one only reactively, after the next client bounces off the circuit breaker (which means a call to Err() that returns non-nil). I think this behavior makes sense (if a probe wants to keep trying for 10 minutes, it can do that rather easily), added a test for it as well.

@tbg
Copy link
Member Author

tbg commented Dec 14, 2021

... or today..

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Dec 15, 2021

Build succeeded:

@craig craig bot merged commit acf40e9 into cockroachdb:master Dec 15, 2021
tbg added a commit to tbg/cockroach that referenced this pull request Jan 11, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the "slowness"
existing detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not)
refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal
was first inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is off via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see cockroachdb#74705).
Primarily though, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); cockroachdb#74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are

- UI work (cockroachdb#74713)
- Metrics (cockroachdb#74505)

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Jan 11, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the existing
"slowness" detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is opt-in via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see cockroachdb#74705).  For
example, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); cockroachdb#74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

Perhaps surprisingly, when the breaker is tripped, *all* traffic to
the Replica gets the fail-fast behavior, not just mutations. This is
because even though a read may look good to serve based on the lease,
we would also need to check for latches, and in particular we would
need to fail-fast if there is a transitive dependency on any write
(since that write is presumably stuck). This is not trivial and so
we don't do it in this first iteration (see cockroachdb#74799).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None

touchie
tbg added a commit to tbg/cockroach that referenced this pull request Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

Perhaps surprisingly, when the breaker is tripped, *all* traffic to
the Replica gets the fail-fast behavior, not just mutations. This is
because even though a read may look good to serve based on the lease,
we would also need to check for latches, and in particular we would
need to fail-fast if there is a transitive dependency on any write
(since that write is presumably stuck). This is not trivial and so
we don't do it in this first iteration (see cockroachdb#74799).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None

touchie
craig bot pushed a commit that referenced this pull request Jan 14, 2022
71806: kvserver: circuit-break requests to unavailable ranges r=erikgrinaker a=tbg

Fixes #33007.
Closes #61311.

This PR uses the circuit breaker package introduced in #73641 to address
issue #33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the existing
"slowness" detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not) refreshed
(#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in #72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is opt-in via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see #74705).  For
example, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); #74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are UI work (#74713) and metrics (#74505).

The release note is deferred to #74705 (where breakers are turned on).

Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants