Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: use async-probing based circuit breakers #70485

Closed
wants to merge 9 commits into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Sep 21, 2021

First commit is separate PR: #70979


See #68419 (comment) for the original discussion.

This commit adds a new circuit package that uses probing-based
circuit breakers. This breaker does not recruit the occasional
request to carry out the probing. Instead, the circuit breaker
is configured with an "asychronous probe" that effectively
determines when the breaker should reset.

We prefer this approach precisely because it avoids recruiting
regular traffic, which is often tied to end-user requests, and
led to inacceptable latencies there.

The potential downside of the probing approach is that the breaker setup
is more complex and there is residual risk of configuring the probe
differently from the actual client requests. In the worst case, the
breaker would be perpetually tripped even though everything should be
fine. This isn't expected - our two uses of circuit breakers are pretty
clear about what they protect - but it is worth mentioning as this
consideration likely influenced the design of the original breaker.

Touches #69888
Touches #70111
Touches #53410

Also, this breaker was designed to be a good fit for:
#33007
which will use the Signal() call.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the better-circuit-breaking branch from 8cf20ca to 200e293 Compare September 21, 2021 12:49
@tbg tbg force-pushed the better-circuit-breaking branch 9 times, most recently from ab1130f to 6865300 Compare September 30, 2021 14:55
It's safe to use a RedactableString as a format argument instead
of a constant string. This teaches the linter that.

Release note: None
@tbg tbg force-pushed the better-circuit-breaking branch 2 times, most recently from e6ebcf8 to 5d6bd24 Compare October 1, 2021 09:35
tbg and others added 8 commits October 1, 2021 13:27
See cockroachdb#68419 (comment) for the original discussion.

This commit adds a new `circuit` package that uses probing-based
circuit breakers. This breaker does *not* recruit the occasional
request to carry out the probing. Instead, the circuit breaker
is configured with an "asychronous probe" that effectively
determines when the breaker should reset.

We prefer this approach precisely because it avoids recruiting
regular traffic, which is often tied to end-user requests, and
led to inacceptable latencies there.

The potential downside of the probing approach is that the breaker setup
is more complex and there is residual risk of configuring the probe
differently from the actual client requests. In the worst case, the
breaker would be perpetually tripped even though everything should be
fine. This isn't expected - our two uses of circuit breakers are pretty
clear about what they protect - but it is worth mentioning as this
consideration likely influenced the design of the original breaker.

Touches cockroachdb#69888
Touches cockroachdb#70111
Touches cockroachdb#53410

Also, this breaker was designed to be a good fit for:
cockroachdb#33007
which will use the `Signal()` call.

Release note: None
Keep the circuit breaker use in one place. Also, make the circuit
breaker's probe actually try to dial.

Release note: None
Release note: None
Release note: None
Release note: None
This test was tripping the breaker but the breaker was configured
to ignore that. Now the test gets what it is asking for.

Release note: None
Give this test circuit breakers that actually trip when the test is
asking them to do so.

Release note: None
@tbg tbg force-pushed the better-circuit-breaking branch from 5d6bd24 to 30e7b0c Compare October 1, 2021 11:30
@tbg tbg marked this pull request as ready for review October 1, 2021 11:30
@tbg tbg requested review from a team as code owners October 1, 2021 11:30
@tbg tbg requested review from erikgrinaker and removed request for a team October 1, 2021 11:30
@tbg
Copy link
Member Author

tbg commented Oct 1, 2021

Eh, I did mess something up here in a last round of changes, but I'd say this is still worth a high-level review. It's fairly polished.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have stared hard at the first commit here that's not from #70979 and I really do not understand why this needs to be a redact.SafeString. What breaks with it being a regular string?

Reviewed 4 of 4 files at r1, 25 of 25 files at r2, 6 of 6 files at r3, 3 of 3 files at r4, 2 of 2 files at r5, 2 of 2 files at r6, 5 of 5 files at r7, 2 of 2 files at r8, 1 of 1 files at r9.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how you see this integrating with #70111 down the road. Is this a temporary construct that will be replaced by a broader connection manager, or do you see the connection manager making use of this breaker? It seems to me like the async probes here all use dialing as part of the probe, and that would presumably become the responsibility of the connection manager (along with heartbeats, which would possibly run in the same goroutine as the dialing), so I'm not sure if the extra Breaker layer buys us much there.

Given the fragility of the RPC connection code, I have a slight preference for not changing it too much until we overhaul it. But I see the advantages of making incremental improvements as well.

Reviewed 4 of 4 files at r1, 25 of 25 files at r2, 6 of 6 files at r3, 4 of 5 files at r7, 1 of 2 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/gossip/client.go, line 94 at r3 (raw file):

		var conn Gossip_GossipClient
		cc, err := g.dial(ctx, c.addr.String())

Does this need a timeout?


pkg/gossip/gossip.go, line 1550 at r2 (raw file):

				})
		}
		breaker = g.rpcContext.NewBreaker(name, asyncProbe)

I'm not too thrilled about the callers specifying their own probes, I'd prefer the RPC layer to deal with this on its own. Presumably, callers just care about having a live gRPC connection to some service. Is there a particular reason why this needs to be a higher-level concern?


pkg/gossip/gossip.go, line 1540 at r3 (raw file):

				func(ctx context.Context) {
					defer done()
					_, err := g.dial(ctx, addr.String())

I think this needs a timeout.


pkg/rpc/nodedialer/nodedialer.go, line 329 at r2 (raw file):

					t.Reset(nodeDialerCircuitBreakerProbeBackoff)
				}
			}) != nil {

nit: maybe use a temporary err variable here for readability.


pkg/util/circuit/circuitbreaker.go, line 1 at r2 (raw file):

// Copyright 2021 The Cockroach Authors.

nit: I'd call this file breaker.go, for consistency with the Breaker struct.


pkg/util/circuit/circuitbreaker.go, line 28 at r2 (raw file):

// from Breaker.Err(), i.e. `errors.Is(err, ErrBreakerOpen()) can be
// used to check whether an error originated from some Breaker.
func ErrBreakerOpen() error {

Why is this a function rather than exporting errBreakerOpen?


pkg/util/circuit/circuitbreaker.go, line 46 at r2 (raw file):

// and until then all calls to `Err()` return an error.
type Breaker struct {
	opts unsafe.Pointer // *Options

The unsafe.Pointer use across this package is a bit unfortunate. I take it a plain ol' mutex has too much overhead?´


pkg/util/circuit/event_handler.go, line 19 at r2 (raw file):

// An EventHandler is reported to by circuit breakers.
type EventHandler interface {

Do we actually need this? It seems like we're currently only using it for logging. I think I'd prefer to just keep it simple and do the logging directly in the breaker if that's sufficient.

@tbg tbg requested a review from erikgrinaker October 4, 2021 11:26
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews!

I'm curious how you see this integrating with #70111 down the road.

I was tempted to keep pulling and to push #70111 to a conclusion, but figured it was a bad idea to do so before having gotten additional eyes on it. Now that that is happening, and assuming now is a good time for you to work through this stuff with me, I think we should follow through as I agree that the current structure leaves work to be desired.

I need to dig into this a little bit more, but where I would likely take this is to integrate the breaker with *Connection, i.e. we would make this map store a *Connection to any target ever dialed (probably we need some crude eviction policy at some point, but let's ignore that now - easy to add):

conns syncmap.Map

and when we create a connection here:

func newConnectionToNodeID(stopper *stop.Stopper, remoteNodeID roachpb.NodeID) *Connection {
c := &Connection{
initialHeartbeatDone: make(chan struct{}),
stopper: stopper,
remoteNodeID: remoteNodeID,
}
c.heartbeatResult.Store(heartbeatResult{err: ErrNotHeartbeated})
return c
}

it's a

type Connection struct {
    breaker circuit.Breaker
    // other state that needs to be here
}

where dialing automatically checks the breaker & the breaker's probe is just the heartbeat loop (with the semantics that we discussed on the issue: you heartbeat for a while, and if nobody has tried to connect, you eventually stop the heartbeat until someone tries again - you'll notice that this is trivial to achieve with the control the breaker confers upon the probe).

There are some slight UX things to consider here - a given target can be dialed both through a NodeID but also without a NodeID (and in particular an address can "reincarnate" under a new NodeID; we shouldn't ever be logging the wrong one), should breakers take into account the connection classes, etc - I think those can all be figured out though, as long as we don't forget about them.

Happy to chat synchronously about all of this.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)


pkg/gossip/client.go, line 94 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Does this need a timeout?

I will definitely add one. I'm not sure


pkg/util/circuit/circuitbreaker.go, line 46 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

The unsafe.Pointer use across this package is a bit unfortunate. I take it a plain ol' mutex has too much overhead?´

I didn't benchmark and I'll freely admit that this is probably more complex than it should be (started simple, grew more complex, old story). Let's discuss the external API first and I'm happy to swap this out with a "trivial" implementation and to check where that lands us once we've talked through the remainder of the PR.


pkg/util/circuit/event_handler.go, line 19 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Do we actually need this? It seems like we're currently only using it for logging. I think I'd prefer to just keep it simple and do the logging directly in the breaker if that's sufficient.

I don't want the breaker to depend on the log package (not sure if that is what you're suggesting) but we could probably get away with a pure-logging interface here. The reason I didn't do that is because I don't think it's sufficient. The old breaker had an "event" API as well and I think for a good reason - you want to be able to trigger actions depending on what the breaker does. For example, when we use this breaker for #33007, when it trips and untrips I'd like to update relevant gauges. You can't drive that through logging. You could try to do it through the probe but it feels slightly dirty as it's not really the probe's job to track what state the breaker is in.

I do need to add testing for the events though, this was another original motivation for adding this but clearly I haven't capitalized on this yet. The testing for this package is a little thin in general as well.

Lastly, I do just in general appreciate to think of events over thinking of logs. Events are easy to translate to logs but if consumers are primarily handed events I tend to think they'll do better things with them? Not sure, perhaps I just have tendency for premature abstraction here. But looking at etcd/raft, it has bugged me many times that they use the logging-interface approach and not event-based. All the little things of relevance it's maybe telling us about (term changes, elections, who voted for whom) go nowhere because what we really needed were events and because there's not a chance in hell we'll do string-matching.

@tbg
Copy link
Member Author

tbg commented Oct 4, 2021

Formed a plan with Erik. We're going to do the following in some order that prioritizes "simpler" low-risk changes first, that we then land in reviewable chunks.

Refactor the API RPCContext/NodeDialer give to its callers. Instead of the current procedure which is:
nodeDialer.DialNode(...)
Call Connect(ctx) [may block or circuit-break on failed heartbeat]
wrap returned grpc.ClientConn with proto-generated constructor (for example gossippb.NewGossipClient)
we’ll integrate a-c into a method on rpcCtx (one method per client since we don’t have generics). That is, getting the GossipClient above will be as easy as calling rpcContext.InternalClient which returns the roachpb.InternalClient (or an error). This also bakes in the circuit breaker checks.
This is not only more pleasant to use, it also prevents handing a grpc.ClientConn to callers, who may call .Close which is bad as it blasts everyone else using the conn (we have a linter against this, which can then be dismantled). Maybe not in this initial push, but definitely in a follow-up, NodeDialer would be merged into rpcContext, and rpcContext would be cleaned up. In particular, we may limit instantiation of a full rpcContext for the cli, where many of the concerns don’t apply.

We’ll make sure the heartbeat loop pings with a timeout and destroys (Closes) the grpc.ClientConn on failure. This strengthens us in fail-slow scenarios as all users of the clientConn will receive an error in a timely manner and can thus react faster. Not sure if this is already happening, but we’ll make sure it does. We’ll also reconsider whether head-of-line-blocking (or anything else) can cause false breaker trips (as the heartbeat pings are using the same underlying http2 connection).

We’ll take a look at the metrics reported for RPC roundtrip ping latencies. Concretely, want to make sure that they return infinity when heartbeats fail. We also strongly want to use labels so that the metrics can reflect the connection quality between nodes.

We think that the default should be to establish connections to all nodes in the cluster proactively. This "generally" frees callers from ever having to decide between blocking or an unnecessary fail-fast. We will need to see how to get this right, as we don't want the first couple of seconds of a CRDB process to be plagued by random "circuit breaker" errors, but also don't want blocking to seep back into the callers anywhere. For example, we can have the factory methods (i.e. rpcContext.Gossip in the above example) block on the first attempt, but then never again. Or we start out with the breaker open and use the non-blocking gRPC dial, which will delegate the blocking to first use of the returned service client.
I (Tobias) think that we were originally “conservative” about dialing proactively as a premature optimization. Clusters have to get “very large” until this matters. We can dial back the proactivity easily if it ever comes to be a problem.
In the short term we leave #70485 open; this will be one of the later pieces, as it changes the internals of dialing and thus is likely to break at least something somewhere. We agreed on the basic circuit breaker approach and its utility for both the connection management and replica-based circuit breakers (#33007) so much of the PR can be reused then.

tbg added a commit to tbg/cockroach that referenced this pull request Nov 9, 2021
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR introduces a new circuit breaker package that was first
prototyped in cockroachdb#70485. These circuit breakers never recruit regular
requests to do the probing but instead have a configurable probe
attached that determines when the breaker untrips. (It can be tripped
proactively or by client traffic, similar to the old breaker).

They are then used to address cockroachdb#33007: when a replica becomes
unavailable, it should eagerly refuse traffic that it believes
would simply hang.

Concretely, whenever a request (a lease acquisition attempt or
a replicated write) does not manage to replicate within
`base.SlowRequestThreshold` (15s at time of writing), the
breaker is tripped. The corresponding probe uses a newly
introduced `NoopWrite` which is a writing request that does
not mutate state but which always goes through the replication
layer and which gets to bypass the lease.

TODO (generally pulling sizeable chunks out into their own PRs and
landing them in some good order):

- [ ] rewrite circuit breaker internals to avoid all of the `unsafe`
- [ ] make base.SlowRequestThreshold overridable via TestingKnob
- [ ] add end-to-end test using TestCluster verifying the tripping and
      fail-fast behavior under various unavailability conditions (for
      example blocking during evaluation, or making the liveness range
      unavailable).
- [ ] add version gate for NoopWriteRequest (own PR)
- [ ] add targeted tests for NoopWriteRequest (in PR above)
- [ ] add cluster setting to disable breakers
- [ ] introduce a structured error for circuit breaker failures and file
      issue for SQL Observability to render this error nicely
      (translating table names, etc)
- [ ] Make sure the breaker also trips on pipelined writes.
- [ ] address, file issues for, or explicitly discard any inline TODOs
      added in the diff.
- [ ] write the final release note.

Release note (ops change): TODO
@tbg tbg closed this Dec 9, 2021
@tbg
Copy link
Member Author

tbg commented Dec 9, 2021

Leaving this for another day as it's not in the cards anytime soon. We'll be using the circuit breakers in #71806 though.

tbg added a commit to tbg/cockroach that referenced this pull request Dec 9, 2021
These are initially for use in cockroachdb#71806 but were originally conceived of
for cockroachdb#70485, which we are not currently prioritizing.

Importantly, this circuit breaker does not recruit a fraction of
requests to do the probing, which is desirable for both PR cockroachdb#71806 and
PR cockroachdb#70485; requests recruited as probes tend to incur high latency and
errors, and we don't want SQL client traffic to experience those.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Dec 9, 2021
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR introduces a new circuit breaker package that was first
prototyped in cockroachdb#70485. These circuit breakers never recruit regular
requests to do the probing but instead have a configurable probe
attached that determines when the breaker untrips. (It can be tripped
proactively or by client traffic, similar to the old breaker).

They are then used to address cockroachdb#33007: when a replica becomes
unavailable, it should eagerly refuse traffic that it believes
would simply hang.

Concretely, whenever a request (a lease acquisition attempt or
a replicated write) does not manage to replicate within
`base.SlowRequestThreshold` (15s at time of writing), the
breaker is tripped. The corresponding probe uses a newly
introduced `NoopWrite` which is a writing request that does
not mutate state but which always goes through the replication
layer and which gets to bypass the lease.

TODO (generally pulling sizeable chunks out into their own PRs and
landing them in some good order):

- [ ] rewrite circuit breaker internals to avoid all of the `unsafe`
- [ ] make base.SlowRequestThreshold overridable via TestingKnob
- [ ] add end-to-end test using TestCluster verifying the tripping and
      fail-fast behavior under various unavailability conditions (for
      example blocking during evaluation, or making the liveness range
      unavailable).
- [ ] add version gate for NoopWriteRequest (own PR)
- [ ] add targeted tests for NoopWriteRequest (in PR above)
- [ ] add cluster setting to disable breakers
- [ ] introduce a structured error for circuit breaker failures and file
      issue for SQL Observability to render this error nicely
      (translating table names, etc)
- [ ] Make sure the breaker also trips on pipelined writes.
- [ ] address, file issues for, or explicitly discard any inline TODOs
      added in the diff.
- [ ] write the final release note.

Release note (ops change): TODO
tbg added a commit to tbg/cockroach that referenced this pull request Dec 10, 2021
These are initially for use in cockroachdb#71806 but were originally conceived of
for cockroachdb#70485, which we are not currently prioritizing.

Importantly, this circuit breaker does not recruit a fraction of
requests to do the probing, which is desirable for both PR cockroachdb#71806 and
PR cockroachdb#70485; requests recruited as probes tend to incur high latency and
errors, and we don't want SQL client traffic to experience those.

Touches cockroachdb#33007.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Dec 14, 2021
These are initially for use in cockroachdb#71806 but were originally conceived of
for cockroachdb#70485, which we are not currently prioritizing.

Importantly, this circuit breaker does not recruit a fraction of
requests to do the probing, which is desirable for both PR cockroachdb#71806 and
PR cockroachdb#70485; requests recruited as probes tend to incur high latency and
errors, and we don't want SQL client traffic to experience those.

Touches cockroachdb#33007.

Release note: None
craig bot pushed a commit that referenced this pull request Dec 14, 2021
73362: kv: don't unquiesce uninitialized replicas r=tbg a=nvanbenschoten

In a [support issue](https://github.com/cockroachlabs/support/issues/1340), we saw that 10s of thousands of uninitialized replicas were being ticked regularly and creating a large amount of background work on a node, driving up CPU. This commit updates the Raft quiescence logic to disallow uninitialized replicas from being unquiesced and Tick()'ing themselves.

Keeping uninitialized replicas quiesced even in the presence of Raft traffic avoids wasted work. We could Tick() these replicas, but doing so is unnecessary because uninitialized replicas can never win elections, so there is no reason for them to ever call an election. In fact, uninitialized replicas do not even know who their peers are, so there would be no way for them to call an election or for them to send any other non-reactive message. As a result, all work performed by an uninitialized replica is reactive and in response to incoming messages (see `processRequestQueue`).

There are multiple ways for an uninitialized replica to be created and then abandoned, and we don't do a good job garbage collecting them at a later point (see #73424), so it is important that they are cheap. Keeping them quiesced instead of letting them unquiesce and tick every 200ms indefinitely avoids a meaningful amount of periodic work for each uninitialized replica.

Release notes (bug fix): uninitialized replicas that are abandoned after an unsuccessful snapshot no longer perform periodic background work, so they no longer have a non-negligible cost.

73641: circuit: add probing-based circuit breaker r=erikgrinaker a=tbg

These are initially for use in #71806 but were originally conceived of
for #70485, which we are not currently prioritizing.

Importantly, this circuit breaker does not recruit a fraction of
requests to do the probing, which is desirable for both PR #71806 and
PR #70485; requests recruited as probes tend to incur high latency and
errors, and we don't want SQL client traffic to experience those.

Release note: None


73718: kv: pass roachpb.Header by pointer to DeclareKeysFunc r=nvanbenschoten a=nvanbenschoten

The `roachpb.Header` struct is up to 160 bytes in size. That's a little too
large to be passing by value repeatedly when doing so is easy to avoid. This
commit switches to passing roachpb.Header structs by pointer through the
DeclareKeysFunc implementations.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
tbg added a commit to tbg/cockroach that referenced this pull request Dec 23, 2021
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR introduces a new circuit breaker package that was first
prototyped in cockroachdb#70485. These circuit breakers never recruit regular
requests to do the probing but instead have a configurable probe
attached that determines when the breaker untrips. (It can be tripped
proactively or by client traffic, similar to the old breaker).

They are then used to address cockroachdb#33007: when a replica becomes
unavailable, it should eagerly refuse traffic that it believes
would simply hang.

Concretely, whenever a request (a lease acquisition attempt or
a replicated write) does not manage to replicate within
`base.SlowRequestThreshold` (15s at time of writing), the
breaker is tripped. The corresponding probe uses a newly
introduced `NoopWrite` which is a writing request that does
not mutate state but which always goes through the replication
layer and which gets to bypass the lease.

TODO (generally pulling sizeable chunks out into their own PRs and
landing them in some good order):

- [ ] rewrite circuit breaker internals to avoid all of the `unsafe`
- [ ] make base.SlowRequestThreshold overridable via TestingKnob
- [ ] add end-to-end test using TestCluster verifying the tripping and
      fail-fast behavior under various unavailability conditions (for
      example blocking during evaluation, or making the liveness range
      unavailable).
- [ ] add version gate for NoopWriteRequest (own PR)
- [ ] add targeted tests for NoopWriteRequest (in PR above)
- [ ] add cluster setting to disable breakers
- [ ] introduce a structured error for circuit breaker failures and file
      issue for SQL Observability to render this error nicely
      (translating table names, etc)
- [ ] Make sure the breaker also trips on pipelined writes.
- [ ] address, file issues for, or explicitly discard any inline TODOs
      added in the diff.
- [ ] write the final release note.

Release note (ops change): TODO
tbg added a commit to tbg/cockroach that referenced this pull request Jan 3, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR introduces a new circuit breaker package that was first
prototyped in cockroachdb#70485. These circuit breakers never recruit regular
requests to do the probing but instead have a configurable probe
attached that determines when the breaker untrips. (It can be tripped
proactively or by client traffic, similar to the old breaker).

They are then used to address cockroachdb#33007: when a replica becomes
unavailable, it should eagerly refuse traffic that it believes
would simply hang.

Concretely, whenever a request (a lease acquisition attempt or
a replicated write) does not manage to replicate within
`base.SlowRequestThreshold` (15s at time of writing), the
breaker is tripped. The corresponding probe uses a newly
introduced `NoopWrite` which is a writing request that does
not mutate state but which always goes through the replication
layer and which gets to bypass the lease.

TODO (generally pulling sizeable chunks out into their own PRs and
landing them in some good order):

- [ ] rewrite circuit breaker internals to avoid all of the `unsafe`
- [ ] make base.SlowRequestThreshold overridable via TestingKnob
- [ ] add end-to-end test using TestCluster verifying the tripping and
      fail-fast behavior under various unavailability conditions (for
      example blocking during evaluation, or making the liveness range
      unavailable).
- [ ] add version gate for NoopWriteRequest (own PR)
- [ ] add targeted tests for NoopWriteRequest (in PR above)
- [ ] add cluster setting to disable breakers
- [ ] introduce a structured error for circuit breaker failures and file
      issue for SQL Observability to render this error nicely
      (translating table names, etc)
- [ ] Make sure the breaker also trips on pipelined writes.
- [ ] address, file issues for, or explicitly discard any inline TODOs
      added in the diff.
- [ ] write the final release note.

Release note (ops change): TODO
tbg added a commit to tbg/cockroach that referenced this pull request Jan 5, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR introduces a new circuit breaker package that was first
prototyped in cockroachdb#70485. These circuit breakers never recruit regular
requests to do the probing but instead have a configurable probe
attached that determines when the breaker untrips. (It can be tripped
proactively or by client traffic, similar to the old breaker).

They are then used to address cockroachdb#33007: when a replica becomes
unavailable, it should eagerly refuse traffic that it believes
would simply hang.

Concretely, whenever a request (a lease acquisition attempt or
a replicated write) does not manage to replicate within
`base.SlowRequestThreshold` (15s at time of writing), the
breaker is tripped. The corresponding probe uses a newly
introduced `NoopWrite` which is a writing request that does
not mutate state but which always goes through the replication
layer and which gets to bypass the lease.

TODO (generally pulling sizeable chunks out into their own PRs and
landing them in some good order):

- [ ] rewrite circuit breaker internals to avoid all of the `unsafe`
- [ ] make base.SlowRequestThreshold overridable via TestingKnob
- [ ] add end-to-end test using TestCluster verifying the tripping and
      fail-fast behavior under various unavailability conditions (for
      example blocking during evaluation, or making the liveness range
      unavailable).
- [ ] add version gate for NoopWriteRequest (own PR)
- [ ] add targeted tests for NoopWriteRequest (in PR above)
- [ ] add cluster setting to disable breakers
- [ ] introduce a structured error for circuit breaker failures and file
      issue for SQL Observability to render this error nicely
      (translating table names, etc)
- [ ] Make sure the breaker also trips on pipelined writes.
- [ ] address, file issues for, or explicitly discard any inline TODOs
      added in the diff.
- [ ] write the final release note.

Release note (ops change): TODO
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants