rpc: use async-probing based circuit breakers #70485

tbg · 2021-09-21T10:57:53Z

First commit is separate PR: #70979

See #68419 (comment) for the original discussion.

This commit adds a new circuit package that uses probing-based
circuit breakers. This breaker does not recruit the occasional
request to carry out the probing. Instead, the circuit breaker
is configured with an "asychronous probe" that effectively
determines when the breaker should reset.

We prefer this approach precisely because it avoids recruiting
regular traffic, which is often tied to end-user requests, and
led to inacceptable latencies there.

The potential downside of the probing approach is that the breaker setup
is more complex and there is residual risk of configuring the probe
differently from the actual client requests. In the worst case, the
breaker would be perpetually tripped even though everything should be
fine. This isn't expected - our two uses of circuit breakers are pretty
clear about what they protect - but it is worth mentioning as this
consideration likely influenced the design of the original breaker.

Touches #69888
Touches #70111
Touches #53410

Also, this breaker was designed to be a good fit for:
#33007
which will use the Signal() call.

Release note: None

cockroach-teamcity · 2021-09-21T10:58:00Z

This change is

It's safe to use a RedactableString as a format argument instead of a constant string. This teaches the linter that. Release note: None

See cockroachdb#68419 (comment) for the original discussion. This commit adds a new `circuit` package that uses probing-based circuit breakers. This breaker does *not* recruit the occasional request to carry out the probing. Instead, the circuit breaker is configured with an "asychronous probe" that effectively determines when the breaker should reset. We prefer this approach precisely because it avoids recruiting regular traffic, which is often tied to end-user requests, and led to inacceptable latencies there. The potential downside of the probing approach is that the breaker setup is more complex and there is residual risk of configuring the probe differently from the actual client requests. In the worst case, the breaker would be perpetually tripped even though everything should be fine. This isn't expected - our two uses of circuit breakers are pretty clear about what they protect - but it is worth mentioning as this consideration likely influenced the design of the original breaker. Touches cockroachdb#69888 Touches cockroachdb#70111 Touches cockroachdb#53410 Also, this breaker was designed to be a good fit for: cockroachdb#33007 which will use the `Signal()` call. Release note: None

Keep the circuit breaker use in one place. Also, make the circuit breaker's probe actually try to dial. Release note: None

Release note: None

This test was tripping the breaker but the breaker was configured to ignore that. Now the test gets what it is asking for. Release note: None

Give this test circuit breakers that actually trip when the test is asking them to do so. Release note: None

tbg · 2021-10-01T15:36:05Z

Eh, I did mess something up here in a last round of changes, but I'd say this is still worth a high-level review. It's fairly polished.

knz

I have stared hard at the first commit here that's not from #70979 and I really do not understand why this needs to be a redact.SafeString. What breaks with it being a regular string?

Reviewed 4 of 4 files at r1, 25 of 25 files at r2, 6 of 6 files at r3, 3 of 3 files at r4, 2 of 2 files at r5, 2 of 2 files at r6, 5 of 5 files at r7, 2 of 2 files at r8, 1 of 1 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

erikgrinaker

I'm curious how you see this integrating with #70111 down the road. Is this a temporary construct that will be replaced by a broader connection manager, or do you see the connection manager making use of this breaker? It seems to me like the async probes here all use dialing as part of the probe, and that would presumably become the responsibility of the connection manager (along with heartbeats, which would possibly run in the same goroutine as the dialing), so I'm not sure if the extra Breaker layer buys us much there.

Given the fragility of the RPC connection code, I have a slight preference for not changing it too much until we overhaul it. But I see the advantages of making incremental improvements as well.

Reviewed 4 of 4 files at r1, 25 of 25 files at r2, 6 of 6 files at r3, 4 of 5 files at r7, 1 of 2 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/gossip/client.go, line 94 at r3 (raw file):

		var conn Gossip_GossipClient
		cc, err := g.dial(ctx, c.addr.String())

Does this need a timeout?

pkg/gossip/gossip.go, line 1550 at r2 (raw file):

				})
		}
		breaker = g.rpcContext.NewBreaker(name, asyncProbe)

I'm not too thrilled about the callers specifying their own probes, I'd prefer the RPC layer to deal with this on its own. Presumably, callers just care about having a live gRPC connection to some service. Is there a particular reason why this needs to be a higher-level concern?

pkg/gossip/gossip.go, line 1540 at r3 (raw file):

				func(ctx context.Context) {
					defer done()
					_, err := g.dial(ctx, addr.String())

I think this needs a timeout.

pkg/rpc/nodedialer/nodedialer.go, line 329 at r2 (raw file):

					t.Reset(nodeDialerCircuitBreakerProbeBackoff)
				}
			}) != nil {

nit: maybe use a temporary err variable here for readability.

pkg/util/circuit/circuitbreaker.go, line 1 at r2 (raw file):

// Copyright 2021 The Cockroach Authors.

nit: I'd call this file breaker.go, for consistency with the Breaker struct.

pkg/util/circuit/circuitbreaker.go, line 28 at r2 (raw file):

// from Breaker.Err(), i.e. `errors.Is(err, ErrBreakerOpen()) can be
// used to check whether an error originated from some Breaker.
func ErrBreakerOpen() error {

Why is this a function rather than exporting errBreakerOpen?

pkg/util/circuit/circuitbreaker.go, line 46 at r2 (raw file):

// and until then all calls to `Err()` return an error.
type Breaker struct {
	opts unsafe.Pointer // *Options

The unsafe.Pointer use across this package is a bit unfortunate. I take it a plain ol' mutex has too much overhead?´

pkg/util/circuit/event_handler.go, line 19 at r2 (raw file):

// An EventHandler is reported to by circuit breakers.
type EventHandler interface {

Do we actually need this? It seems like we're currently only using it for logging. I think I'd prefer to just keep it simple and do the logging directly in the breaker if that's sufficient.

tbg

Thanks for the reviews!

I'm curious how you see this integrating with #70111 down the road.

I was tempted to keep pulling and to push #70111 to a conclusion, but figured it was a bad idea to do so before having gotten additional eyes on it. Now that that is happening, and assuming now is a good time for you to work through this stuff with me, I think we should follow through as I agree that the current structure leaves work to be desired.

I need to dig into this a little bit more, but where I would likely take this is to integrate the breaker with *Connection, i.e. we would make this map store a *Connection to any target ever dialed (probably we need some crude eviction policy at some point, but let's ignore that now - easy to add):

cockroach/pkg/rpc/context.go

Line 301 in 895027e

conns syncmap.Map

and when we create a connection here:

cockroach/pkg/rpc/context.go

Lines 240 to 248 in 895027e

    
           func newConnectionToNodeID(stopper *stop.Stopper, remoteNodeID roachpb.NodeID) *Connection { 
        
           	c := &Connection{ 
        
           		initialHeartbeatDone: make(chan struct{}), 
        
           		stopper:              stopper, 
        
           		remoteNodeID:         remoteNodeID, 
        
           	} 
        
           	c.heartbeatResult.Store(heartbeatResult{err: ErrNotHeartbeated}) 
        
           	return c 
        
           }

it's a

type Connection struct {
    breaker circuit.Breaker
    // other state that needs to be here
}

where dialing automatically checks the breaker & the breaker's probe is just the heartbeat loop (with the semantics that we discussed on the issue: you heartbeat for a while, and if nobody has tried to connect, you eventually stop the heartbeat until someone tries again - you'll notice that this is trivial to achieve with the control the breaker confers upon the probe).

There are some slight UX things to consider here - a given target can be dialed both through a NodeID but also without a NodeID (and in particular an address can "reincarnate" under a new NodeID; we shouldn't ever be logging the wrong one), should breakers take into account the connection classes, etc - I think those can all be figured out though, as long as we don't forget about them.

Happy to chat synchronously about all of this.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)

pkg/gossip/client.go, line 94 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Does this need a timeout?

I will definitely add one. I'm not sure

pkg/util/circuit/circuitbreaker.go, line 46 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

The unsafe.Pointer use across this package is a bit unfortunate. I take it a plain ol' mutex has too much overhead?´

I didn't benchmark and I'll freely admit that this is probably more complex than it should be (started simple, grew more complex, old story). Let's discuss the external API first and I'm happy to swap this out with a "trivial" implementation and to check where that lands us once we've talked through the remainder of the PR.

pkg/util/circuit/event_handler.go, line 19 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Do we actually need this? It seems like we're currently only using it for logging. I think I'd prefer to just keep it simple and do the logging directly in the breaker if that's sufficient.

I don't want the breaker to depend on the log package (not sure if that is what you're suggesting) but we could probably get away with a pure-logging interface here. The reason I didn't do that is because I don't think it's sufficient. The old breaker had an "event" API as well and I think for a good reason - you want to be able to trigger actions depending on what the breaker does. For example, when we use this breaker for #33007, when it trips and untrips I'd like to update relevant gauges. You can't drive that through logging. You could try to do it through the probe but it feels slightly dirty as it's not really the probe's job to track what state the breaker is in.

I do need to add testing for the events though, this was another original motivation for adding this but clearly I haven't capitalized on this yet. The testing for this package is a little thin in general as well.

Lastly, I do just in general appreciate to think of events over thinking of logs. Events are easy to translate to logs but if consumers are primarily handed events I tend to think they'll do better things with them? Not sure, perhaps I just have tendency for premature abstraction here. But looking at etcd/raft, it has bugged me many times that they use the logging-interface approach and not event-based. All the little things of relevance it's maybe telling us about (term changes, elections, who voted for whom) go nowhere because what we really needed were events and because there's not a chance in hell we'll do string-matching.

tbg · 2021-10-04T13:54:10Z

Formed a plan with Erik. We're going to do the following in some order that prioritizes "simpler" low-risk changes first, that we then land in reviewable chunks.

Refactor the API RPCContext/NodeDialer give to its callers. Instead of the current procedure which is:
nodeDialer.DialNode(...)
Call Connect(ctx) [may block or circuit-break on failed heartbeat]
wrap returned grpc.ClientConn with proto-generated constructor (for example gossippb.NewGossipClient)
we’ll integrate a-c into a method on rpcCtx (one method per client since we don’t have generics). That is, getting the GossipClient above will be as easy as calling rpcContext.InternalClient which returns the roachpb.InternalClient (or an error). This also bakes in the circuit breaker checks.
This is not only more pleasant to use, it also prevents handing a grpc.ClientConn to callers, who may call .Close which is bad as it blasts everyone else using the conn (we have a linter against this, which can then be dismantled). Maybe not in this initial push, but definitely in a follow-up, NodeDialer would be merged into rpcContext, and rpcContext would be cleaned up. In particular, we may limit instantiation of a full rpcContext for the cli, where many of the concerns don’t apply.

We’ll make sure the heartbeat loop pings with a timeout and destroys (Closes) the grpc.ClientConn on failure. This strengthens us in fail-slow scenarios as all users of the clientConn will receive an error in a timely manner and can thus react faster. Not sure if this is already happening, but we’ll make sure it does. We’ll also reconsider whether head-of-line-blocking (or anything else) can cause false breaker trips (as the heartbeat pings are using the same underlying http2 connection).

We’ll take a look at the metrics reported for RPC roundtrip ping latencies. Concretely, want to make sure that they return infinity when heartbeats fail. We also strongly want to use labels so that the metrics can reflect the connection quality between nodes.

We think that the default should be to establish connections to all nodes in the cluster proactively. This "generally" frees callers from ever having to decide between blocking or an unnecessary fail-fast. We will need to see how to get this right, as we don't want the first couple of seconds of a CRDB process to be plagued by random "circuit breaker" errors, but also don't want blocking to seep back into the callers anywhere. For example, we can have the factory methods (i.e. rpcContext.Gossip in the above example) block on the first attempt, but then never again. Or we start out with the breaker open and use the non-blocking gRPC dial, which will delegate the blocking to first use of the returned service client.
I (Tobias) think that we were originally “conservative” about dialing proactively as a premature optimization. Clusters have to get “very large” until this matters. We can dial back the proactivity easily if it ever comes to be a problem.
In the short term we leave #70485 open; this will be one of the later pieces, as it changes the internals of dialing and thus is likely to break at least something somewhere. We agreed on the basic circuit breaker approach and its utility for both the connection management and replica-based circuit breakers (#33007) so much of the PR can be reused then.

Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO

tbg · 2021-12-09T12:46:37Z

Leaving this for another day as it's not in the cards anytime soon. We'll be using the circuit breakers in #71806 though.

These are initially for use in cockroachdb#71806 but were originally conceived of for cockroachdb#70485, which we are not currently prioritizing. Importantly, this circuit breaker does not recruit a fraction of requests to do the probing, which is desirable for both PR cockroachdb#71806 and PR cockroachdb#70485; requests recruited as probes tend to incur high latency and errors, and we don't want SQL client traffic to experience those. Release note: None

Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO

These are initially for use in cockroachdb#71806 but were originally conceived of for cockroachdb#70485, which we are not currently prioritizing. Importantly, this circuit breaker does not recruit a fraction of requests to do the probing, which is desirable for both PR cockroachdb#71806 and PR cockroachdb#70485; requests recruited as probes tend to incur high latency and errors, and we don't want SQL client traffic to experience those. Touches cockroachdb#33007. Release note: None

73362: kv: don't unquiesce uninitialized replicas r=tbg a=nvanbenschoten In a [support issue](https://github.com/cockroachlabs/support/issues/1340), we saw that 10s of thousands of uninitialized replicas were being ticked regularly and creating a large amount of background work on a node, driving up CPU. This commit updates the Raft quiescence logic to disallow uninitialized replicas from being unquiesced and Tick()'ing themselves. Keeping uninitialized replicas quiesced even in the presence of Raft traffic avoids wasted work. We could Tick() these replicas, but doing so is unnecessary because uninitialized replicas can never win elections, so there is no reason for them to ever call an election. In fact, uninitialized replicas do not even know who their peers are, so there would be no way for them to call an election or for them to send any other non-reactive message. As a result, all work performed by an uninitialized replica is reactive and in response to incoming messages (see `processRequestQueue`). There are multiple ways for an uninitialized replica to be created and then abandoned, and we don't do a good job garbage collecting them at a later point (see #73424), so it is important that they are cheap. Keeping them quiesced instead of letting them unquiesce and tick every 200ms indefinitely avoids a meaningful amount of periodic work for each uninitialized replica. Release notes (bug fix): uninitialized replicas that are abandoned after an unsuccessful snapshot no longer perform periodic background work, so they no longer have a non-negligible cost. 73641: circuit: add probing-based circuit breaker r=erikgrinaker a=tbg These are initially for use in #71806 but were originally conceived of for #70485, which we are not currently prioritizing. Importantly, this circuit breaker does not recruit a fraction of requests to do the probing, which is desirable for both PR #71806 and PR #70485; requests recruited as probes tend to incur high latency and errors, and we don't want SQL client traffic to experience those. Release note: None 73718: kv: pass roachpb.Header by pointer to DeclareKeysFunc r=nvanbenschoten a=nvanbenschoten The `roachpb.Header` struct is up to 160 bytes in size. That's a little too large to be passing by value repeatedly when doing so is easy to avoid. This commit switches to passing roachpb.Header structs by pointer through the DeclareKeysFunc implementations. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO

tbg force-pushed the better-circuit-breaking branch from 8cf20ca to 200e293 Compare September 21, 2021 12:49

tbg force-pushed the better-circuit-breaking branch 9 times, most recently from ab1130f to 6865300 Compare September 30, 2021 14:55

fmtsafe: allow string(RedactableString) as safe arg

205a052

It's safe to use a RedactableString as a format argument instead of a constant string. This teaches the linter that. Release note: None

tbg force-pushed the better-circuit-breaking branch 2 times, most recently from e6ebcf8 to 5d6bd24 Compare October 1, 2021 09:35

tbg mentioned this pull request Oct 1, 2021

fmtsafe: allow string(RedactableString) as safe arg #70979

Closed

tbg and others added 8 commits October 1, 2021 13:27

gossip: simplify the circuit breaker use

2c6af18

Keep the circuit breaker use in one place. Also, make the circuit breaker's probe actually try to dial. Release note: None

circuit: remove Trip() and Tripped()

82ae382

Release note: None

circuit: remove Success()

1140e9f

Release note: None

circuit: remove Fail()

0d55820

Release note: None

circuit: remove Ready()

5cf6345

Release note: None

kvserver: modernize TestReportUnreachableHeartbeats

e56c266

This test was tripping the breaker but the breaker was configured to ignore that. Now the test gets what it is asking for. Release note: None

kvserver: modernize TestReportUnreachableRemoveRace

30e7b0c

Give this test circuit breakers that actually trip when the test is asking them to do so. Release note: None

tbg force-pushed the better-circuit-breaking branch from 5d6bd24 to 30e7b0c Compare October 1, 2021 11:30

tbg marked this pull request as ready for review October 1, 2021 11:30

tbg requested review from a team as code owners October 1, 2021 11:30

tbg requested review from erikgrinaker and removed request for a team October 1, 2021 11:30

knz reviewed Oct 2, 2021

View reviewed changes

erikgrinaker reviewed Oct 4, 2021

View reviewed changes

tbg requested a review from erikgrinaker October 4, 2021 11:26

tbg commented Oct 4, 2021

View reviewed changes

tbg mentioned this pull request Oct 4, 2021

rpc: automatically maintain RPC connections across cluster #70111

Open

tbg mentioned this pull request Nov 8, 2021

kvserver: circuit-break requests to unavailable ranges #71806

Merged

tbg closed this Dec 9, 2021

tbg mentioned this pull request Dec 9, 2021

circuit: add probing-based circuit breaker #73641

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: use async-probing based circuit breakers #70485

rpc: use async-probing based circuit breakers #70485

tbg commented Sep 21, 2021 •

edited

Loading

cockroach-teamcity commented Sep 21, 2021

tbg commented Oct 1, 2021

knz left a comment

erikgrinaker left a comment

tbg left a comment •

edited

Loading

tbg commented Oct 4, 2021

tbg commented Dec 9, 2021

	func newConnectionToNodeID(stopper stop.Stopper, remoteNodeID roachpb.NodeID) Connection {
	c := &Connection{
	initialHeartbeatDone: make(chan struct{}),
	stopper: stopper,
	remoteNodeID: remoteNodeID,
	}
	c.heartbeatResult.Store(heartbeatResult{err: ErrNotHeartbeated})
	return c
	}

rpc: use async-probing based circuit breakers #70485

rpc: use async-probing based circuit breakers #70485

Conversation

tbg commented Sep 21, 2021 • edited Loading

cockroach-teamcity commented Sep 21, 2021

tbg commented Oct 1, 2021

knz left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg left a comment • edited Loading

Choose a reason for hiding this comment

tbg commented Oct 4, 2021

tbg commented Dec 9, 2021

tbg commented Sep 21, 2021 •

edited

Loading

tbg left a comment •

edited

Loading