-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: use async-probing based circuit breakers #70485
Conversation
8cf20ca
to
200e293
Compare
ab1130f
to
6865300
Compare
It's safe to use a RedactableString as a format argument instead of a constant string. This teaches the linter that. Release note: None
e6ebcf8
to
5d6bd24
Compare
See cockroachdb#68419 (comment) for the original discussion. This commit adds a new `circuit` package that uses probing-based circuit breakers. This breaker does *not* recruit the occasional request to carry out the probing. Instead, the circuit breaker is configured with an "asychronous probe" that effectively determines when the breaker should reset. We prefer this approach precisely because it avoids recruiting regular traffic, which is often tied to end-user requests, and led to inacceptable latencies there. The potential downside of the probing approach is that the breaker setup is more complex and there is residual risk of configuring the probe differently from the actual client requests. In the worst case, the breaker would be perpetually tripped even though everything should be fine. This isn't expected - our two uses of circuit breakers are pretty clear about what they protect - but it is worth mentioning as this consideration likely influenced the design of the original breaker. Touches cockroachdb#69888 Touches cockroachdb#70111 Touches cockroachdb#53410 Also, this breaker was designed to be a good fit for: cockroachdb#33007 which will use the `Signal()` call. Release note: None
Keep the circuit breaker use in one place. Also, make the circuit breaker's probe actually try to dial. Release note: None
Release note: None
Release note: None
Release note: None
Release note: None
This test was tripping the breaker but the breaker was configured to ignore that. Now the test gets what it is asking for. Release note: None
Give this test circuit breakers that actually trip when the test is asking them to do so. Release note: None
5d6bd24
to
30e7b0c
Compare
Eh, I did mess something up here in a last round of changes, but I'd say this is still worth a high-level review. It's fairly polished. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have stared hard at the first commit here that's not from #70979 and I really do not understand why this needs to be a redact.SafeString
. What breaks with it being a regular string?
Reviewed 4 of 4 files at r1, 25 of 25 files at r2, 6 of 6 files at r3, 3 of 3 files at r4, 2 of 2 files at r5, 2 of 2 files at r6, 5 of 5 files at r7, 2 of 2 files at r8, 1 of 1 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious how you see this integrating with #70111 down the road. Is this a temporary construct that will be replaced by a broader connection manager, or do you see the connection manager making use of this breaker? It seems to me like the async probes here all use dialing as part of the probe, and that would presumably become the responsibility of the connection manager (along with heartbeats, which would possibly run in the same goroutine as the dialing), so I'm not sure if the extra Breaker layer buys us much there.
Given the fragility of the RPC connection code, I have a slight preference for not changing it too much until we overhaul it. But I see the advantages of making incremental improvements as well.
Reviewed 4 of 4 files at r1, 25 of 25 files at r2, 6 of 6 files at r3, 4 of 5 files at r7, 1 of 2 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/gossip/client.go, line 94 at r3 (raw file):
var conn Gossip_GossipClient cc, err := g.dial(ctx, c.addr.String())
Does this need a timeout?
pkg/gossip/gossip.go, line 1550 at r2 (raw file):
}) } breaker = g.rpcContext.NewBreaker(name, asyncProbe)
I'm not too thrilled about the callers specifying their own probes, I'd prefer the RPC layer to deal with this on its own. Presumably, callers just care about having a live gRPC connection to some service. Is there a particular reason why this needs to be a higher-level concern?
pkg/gossip/gossip.go, line 1540 at r3 (raw file):
func(ctx context.Context) { defer done() _, err := g.dial(ctx, addr.String())
I think this needs a timeout.
pkg/rpc/nodedialer/nodedialer.go, line 329 at r2 (raw file):
t.Reset(nodeDialerCircuitBreakerProbeBackoff) } }) != nil {
nit: maybe use a temporary err
variable here for readability.
pkg/util/circuit/circuitbreaker.go, line 1 at r2 (raw file):
// Copyright 2021 The Cockroach Authors.
nit: I'd call this file breaker.go
, for consistency with the Breaker
struct.
pkg/util/circuit/circuitbreaker.go, line 28 at r2 (raw file):
// from Breaker.Err(), i.e. `errors.Is(err, ErrBreakerOpen()) can be // used to check whether an error originated from some Breaker. func ErrBreakerOpen() error {
Why is this a function rather than exporting errBreakerOpen
?
pkg/util/circuit/circuitbreaker.go, line 46 at r2 (raw file):
// and until then all calls to `Err()` return an error. type Breaker struct { opts unsafe.Pointer // *Options
The unsafe.Pointer
use across this package is a bit unfortunate. I take it a plain ol' mutex has too much overhead?´
pkg/util/circuit/event_handler.go, line 19 at r2 (raw file):
// An EventHandler is reported to by circuit breakers. type EventHandler interface {
Do we actually need this? It seems like we're currently only using it for logging. I think I'd prefer to just keep it simple and do the logging directly in the breaker if that's sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews!
I'm curious how you see this integrating with #70111 down the road.
I was tempted to keep pulling and to push #70111 to a conclusion, but figured it was a bad idea to do so before having gotten additional eyes on it. Now that that is happening, and assuming now is a good time for you to work through this stuff with me, I think we should follow through as I agree that the current structure leaves work to be desired.
I need to dig into this a little bit more, but where I would likely take this is to integrate the breaker with *Connection
, i.e. we would make this map store a *Connection
to any target ever dialed (probably we need some crude eviction policy at some point, but let's ignore that now - easy to add):
Line 301 in 895027e
conns syncmap.Map |
and when we create a connection here:
Lines 240 to 248 in 895027e
func newConnectionToNodeID(stopper *stop.Stopper, remoteNodeID roachpb.NodeID) *Connection { | |
c := &Connection{ | |
initialHeartbeatDone: make(chan struct{}), | |
stopper: stopper, | |
remoteNodeID: remoteNodeID, | |
} | |
c.heartbeatResult.Store(heartbeatResult{err: ErrNotHeartbeated}) | |
return c | |
} |
it's a
type Connection struct {
breaker circuit.Breaker
// other state that needs to be here
}
where dialing automatically checks the breaker & the breaker's probe is just the heartbeat loop (with the semantics that we discussed on the issue: you heartbeat for a while, and if nobody has tried to connect, you eventually stop the heartbeat until someone tries again - you'll notice that this is trivial to achieve with the control the breaker confers upon the probe).
There are some slight UX things to consider here - a given target can be dialed both through a NodeID but also without a NodeID (and in particular an address can "reincarnate" under a new NodeID; we shouldn't ever be logging the wrong one), should breakers take into account the connection classes, etc - I think those can all be figured out though, as long as we don't forget about them.
Happy to chat synchronously about all of this.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/gossip/client.go, line 94 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Does this need a timeout?
I will definitely add one. I'm not sure
pkg/util/circuit/circuitbreaker.go, line 46 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
The
unsafe.Pointer
use across this package is a bit unfortunate. I take it a plain ol' mutex has too much overhead?´
I didn't benchmark and I'll freely admit that this is probably more complex than it should be (started simple, grew more complex, old story). Let's discuss the external API first and I'm happy to swap this out with a "trivial" implementation and to check where that lands us once we've talked through the remainder of the PR.
pkg/util/circuit/event_handler.go, line 19 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Do we actually need this? It seems like we're currently only using it for logging. I think I'd prefer to just keep it simple and do the logging directly in the breaker if that's sufficient.
I don't want the breaker to depend on the log package (not sure if that is what you're suggesting) but we could probably get away with a pure-logging interface here. The reason I didn't do that is because I don't think it's sufficient. The old breaker had an "event" API as well and I think for a good reason - you want to be able to trigger actions depending on what the breaker does. For example, when we use this breaker for #33007, when it trips and untrips I'd like to update relevant gauges. You can't drive that through logging. You could try to do it through the probe but it feels slightly dirty as it's not really the probe's job to track what state the breaker is in.
I do need to add testing for the events though, this was another original motivation for adding this but clearly I haven't capitalized on this yet. The testing for this package is a little thin in general as well.
Lastly, I do just in general appreciate to think of events over thinking of logs. Events are easy to translate to logs but if consumers are primarily handed events I tend to think they'll do better things with them? Not sure, perhaps I just have tendency for premature abstraction here. But looking at etcd/raft
, it has bugged me many times that they use the logging-interface approach and not event-based. All the little things of relevance it's maybe telling us about (term changes, elections, who voted for whom) go nowhere because what we really needed were events and because there's not a chance in hell we'll do string-matching.
Formed a plan with Erik. We're going to do the following in some order that prioritizes "simpler" low-risk changes first, that we then land in reviewable chunks. Refactor the API RPCContext/NodeDialer give to its callers. Instead of the current procedure which is: We’ll make sure the heartbeat loop pings with a timeout and destroys (Closes) the grpc.ClientConn on failure. This strengthens us in fail-slow scenarios as all users of the clientConn will receive an error in a timely manner and can thus react faster. Not sure if this is already happening, but we’ll make sure it does. We’ll also reconsider whether head-of-line-blocking (or anything else) can cause false breaker trips (as the heartbeat pings are using the same underlying http2 connection). We’ll take a look at the metrics reported for RPC roundtrip ping latencies. Concretely, want to make sure that they return infinity when heartbeats fail. We also strongly want to use labels so that the metrics can reflect the connection quality between nodes. We think that the default should be to establish connections to all nodes in the cluster proactively. This "generally" frees callers from ever having to decide between blocking or an unnecessary fail-fast. We will need to see how to get this right, as we don't want the first couple of seconds of a CRDB process to be plagued by random "circuit breaker" errors, but also don't want blocking to seep back into the callers anywhere. For example, we can have the factory methods (i.e. |
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO
Leaving this for another day as it's not in the cards anytime soon. We'll be using the circuit breakers in #71806 though. |
These are initially for use in cockroachdb#71806 but were originally conceived of for cockroachdb#70485, which we are not currently prioritizing. Importantly, this circuit breaker does not recruit a fraction of requests to do the probing, which is desirable for both PR cockroachdb#71806 and PR cockroachdb#70485; requests recruited as probes tend to incur high latency and errors, and we don't want SQL client traffic to experience those. Release note: None
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO
These are initially for use in cockroachdb#71806 but were originally conceived of for cockroachdb#70485, which we are not currently prioritizing. Importantly, this circuit breaker does not recruit a fraction of requests to do the probing, which is desirable for both PR cockroachdb#71806 and PR cockroachdb#70485; requests recruited as probes tend to incur high latency and errors, and we don't want SQL client traffic to experience those. Touches cockroachdb#33007. Release note: None
These are initially for use in cockroachdb#71806 but were originally conceived of for cockroachdb#70485, which we are not currently prioritizing. Importantly, this circuit breaker does not recruit a fraction of requests to do the probing, which is desirable for both PR cockroachdb#71806 and PR cockroachdb#70485; requests recruited as probes tend to incur high latency and errors, and we don't want SQL client traffic to experience those. Touches cockroachdb#33007. Release note: None
73362: kv: don't unquiesce uninitialized replicas r=tbg a=nvanbenschoten In a [support issue](https://github.com/cockroachlabs/support/issues/1340), we saw that 10s of thousands of uninitialized replicas were being ticked regularly and creating a large amount of background work on a node, driving up CPU. This commit updates the Raft quiescence logic to disallow uninitialized replicas from being unquiesced and Tick()'ing themselves. Keeping uninitialized replicas quiesced even in the presence of Raft traffic avoids wasted work. We could Tick() these replicas, but doing so is unnecessary because uninitialized replicas can never win elections, so there is no reason for them to ever call an election. In fact, uninitialized replicas do not even know who their peers are, so there would be no way for them to call an election or for them to send any other non-reactive message. As a result, all work performed by an uninitialized replica is reactive and in response to incoming messages (see `processRequestQueue`). There are multiple ways for an uninitialized replica to be created and then abandoned, and we don't do a good job garbage collecting them at a later point (see #73424), so it is important that they are cheap. Keeping them quiesced instead of letting them unquiesce and tick every 200ms indefinitely avoids a meaningful amount of periodic work for each uninitialized replica. Release notes (bug fix): uninitialized replicas that are abandoned after an unsuccessful snapshot no longer perform periodic background work, so they no longer have a non-negligible cost. 73641: circuit: add probing-based circuit breaker r=erikgrinaker a=tbg These are initially for use in #71806 but were originally conceived of for #70485, which we are not currently prioritizing. Importantly, this circuit breaker does not recruit a fraction of requests to do the probing, which is desirable for both PR #71806 and PR #70485; requests recruited as probes tend to incur high latency and errors, and we don't want SQL client traffic to experience those. Release note: None 73718: kv: pass roachpb.Header by pointer to DeclareKeysFunc r=nvanbenschoten a=nvanbenschoten The `roachpb.Header` struct is up to 160 bytes in size. That's a little too large to be passing by value repeatedly when doing so is easy to avoid. This commit switches to passing roachpb.Header structs by pointer through the DeclareKeysFunc implementations. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR introduces a new circuit breaker package that was first prototyped in cockroachdb#70485. These circuit breakers never recruit regular requests to do the probing but instead have a configurable probe attached that determines when the breaker untrips. (It can be tripped proactively or by client traffic, similar to the old breaker). They are then used to address cockroachdb#33007: when a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, whenever a request (a lease acquisition attempt or a replicated write) does not manage to replicate within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped. The corresponding probe uses a newly introduced `NoopWrite` which is a writing request that does not mutate state but which always goes through the replication layer and which gets to bypass the lease. TODO (generally pulling sizeable chunks out into their own PRs and landing them in some good order): - [ ] rewrite circuit breaker internals to avoid all of the `unsafe` - [ ] make base.SlowRequestThreshold overridable via TestingKnob - [ ] add end-to-end test using TestCluster verifying the tripping and fail-fast behavior under various unavailability conditions (for example blocking during evaluation, or making the liveness range unavailable). - [ ] add version gate for NoopWriteRequest (own PR) - [ ] add targeted tests for NoopWriteRequest (in PR above) - [ ] add cluster setting to disable breakers - [ ] introduce a structured error for circuit breaker failures and file issue for SQL Observability to render this error nicely (translating table names, etc) - [ ] Make sure the breaker also trips on pipelined writes. - [ ] address, file issues for, or explicitly discard any inline TODOs added in the diff. - [ ] write the final release note. Release note (ops change): TODO
First commit is separate PR: #70979
See #68419 (comment) for the original discussion.
This commit adds a new
circuit
package that uses probing-basedcircuit breakers. This breaker does not recruit the occasional
request to carry out the probing. Instead, the circuit breaker
is configured with an "asychronous probe" that effectively
determines when the breaker should reset.
We prefer this approach precisely because it avoids recruiting
regular traffic, which is often tied to end-user requests, and
led to inacceptable latencies there.
The potential downside of the probing approach is that the breaker setup
is more complex and there is residual risk of configuring the probe
differently from the actual client requests. In the worst case, the
breaker would be perpetually tripped even though everything should be
fine. This isn't expected - our two uses of circuit breakers are pretty
clear about what they protect - but it is worth mentioning as this
consideration likely influenced the design of the original breaker.
Touches #69888
Touches #70111
Touches #53410
Also, this breaker was designed to be a good fit for:
#33007
which will use the
Signal()
call.Release note: None