-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: allow circuit-breaker to serve reads #76858
Conversation
1cf3939
to
6149584
Compare
c3ff1d2
to
f844c8c
Compare
This comment was marked as resolved.
This comment was marked as resolved.
c565d73
to
463a6ba
Compare
a98579e
to
a472d0e
Compare
pkg/kv/kvserver/concurrency/testdata/concurrency_manager/poison_policy_err
Outdated
Show resolved
Hide resolved
pkg/kv/kvserver/concurrency/testdata/concurrency_manager/poison_policy_wait
Outdated
Show resolved
Hide resolved
9c19657
to
bbc37ba
Compare
This flakes (rarely but still flakes) with
Is there a better way to have a guaranteed follower read? I imagine there's precedent for this. Also curious that it's a learner, we do call Ah, we need to pass cockroach/pkg/testutils/testcluster/testcluster.go Lines 667 to 669 in 208e2b4
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 26 files at r1, 26 of 27 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/kv/kvserver/replica_raft.go, line 1275 at r4 (raw file):
aErr := roachpb.NewAmbiguousResultError("circuit breaker tripped") aErr.WrappedErr = roachpb.NewError(err) p.signalProposalResult(proposalResult{Err: roachpb.NewError(aErr)})
(in this PR): could you add a comment that this does not cause the request to release its latches?
pkg/kv/kvserver/replica_range_lease.go, line 1126 at r4 (raw file):
ctx context.Context, reqTS hlc.Timestamp, ) (kvserverpb.LeaseStatus, *roachpb.Error) { brSig := r.breaker.Signal()
(in this PR): move this below the fast-path.
pkg/kv/kvserver/replica_range_lease.go, line 1300 at r4 (raw file):
return nil case <-brSig.C(): llHandle.Cancel()
(in this PR): What did we decide about this potentially canceling an ongoing lease acquisition? Is the idea that this is not an issue because we have the brSig.Err()
check above, so we won't spin on quickly-abandoned lease acquisitions?
pkg/kv/kvserver/replica_range_lease.go, line 1364 at r4 (raw file):
// We explicitly ignore the returned handle as we won't block on it. // // TODO(tbg): this ctx is likely cancelled very soon, which will in turn
Good catch.
pkg/kv/kvserver/replica_send.go, line 147 at r4 (raw file):
// have access to the batch, for example the redirectOnOrAcquireLeaseLocked, // for which several callers don't even have a BatchRequest. ctx = withBypassCircuitBreakerMarker(ctx)
(in a later PR): do we still want this withBypassCircuitBreakerMarker
function? Can't we check bypassReplicaCircuitBreakerForBatch(ba)
below when setting the poison policy?
pkg/kv/kvserver/replica_send.go, line 456 at r4 (raw file):
LatchSpans: latchSpans, // nil if g != nil LockSpans: lockSpans, // nil if g != nil PoisonPolicy: pp,
nit: move this above Requests
so it's in the same order as the struct declation.
pkg/kv/kvserver/concurrency/concurrency_control.go, line 193 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
TODO: explain what poisoning means further.
(in this PR): Did you want to say anything more here? This file serves as top-level documentation for concurrency control, so it's the place to document poisoning and its interaction with request sequencing.
pkg/kv/kvserver/concurrency/concurrency_control.go, line 195 at r4 (raw file):
// PoisonReq marks a request's latches as poisoned. This allows requests // waiting for the latches to fail fast. PoisonReq(*Guard)
(in this PR): mention that this can be called multiple times.
pkg/kv/kvserver/concurrency/poison/error.proto, line 19 at r4 (raw file):
import "gogoproto/gogo.proto"; message PoisonedError {
(in this PR): mind giving this a comment explaining what this error means and when a client might see it?
pkg/kv/kvserver/concurrency/poison/policy.proto, line 28 at r4 (raw file):
// [^1]: https://doc.rust-lang.org/std/sync/struct.Mutex.html#poisoning enum Policy { // Policy_Wait instructs a request to return an error upon encountering
(in this PR): this comment is incorrect.
pkg/kv/kvserver/concurrency/poison/policy.proto, line 32 at r4 (raw file):
Wait = 0; // Policy_Error instructs a request to return an error upon encountering
(in this PR): did we want to make this the zero value, because it's the common case? I'm fine either way.
pkg/kv/kvserver/spanlatch/manager.go, line 90 at r4 (raw file):
span roachpb.Span ts hlc.Timestamp done *signal
(in this PR or later): A single request may have many latches, so it's best to keep this struct's memory footprint small. We're only adding a single pointer, so this isn't a big deal, but we can also avoid this. For instance, we could point a latch directly at its *Guard
and then have access to done
and poison
traverse the same pointer. Or limit visibility to prevent misuse with something like:
type signals struct {
done signal
poison poisonSignal
}
type Guard struct {
signals signals
...
}
type latch struct {
...
signals *signals
...
}
pkg/kv/kvserver/spanlatch/manager.go, line 534 at r4 (raw file):
for { select { case <-held.done.signalChan():
nit: for uniformity, consider pulling this into a doneCh
local variable above poisonCh
as well.
pkg/kv/kvserver/spanlatch/signal.go, line 93 at r4 (raw file):
} // poisonSignal is like signal, but its signal method is idempotent.
(in this PR): consider renaming this idempotentSignal
or something more abstract from the concerns of poisoning.
pkg/roachpb/api.go, line 1349 at r4 (raw file):
// circuit breaker. Transferring a lease while the replication layer is // unavailable results in the "old" leaseholder relinquishing the ability // to serve (strong) reads, without being able to hand over the lease.
Nice consideration. Disallowing this seems like a good win.
pkg/kv/kvserver/concurrency/concurrency_manager_test.go, line 263 at r4 (raw file):
}) return c.waitAndCollect(t, mon) case "poison":
tiny nit: we're pretty consistent about new-lines between switch cases in this test.
pkg/kv/kvserver/spanlatch/manager_test.go, line 526 at r4 (raw file):
cha1, _ := m.MustAcquireChExt(ctx, spans("a", "", write, zeroTS), poison.Policy_Wait) cha2, erra2 := m.MustAcquireChExt(ctx, spans("a", "", write, zeroTS), poison.Policy_Error) cha3, erra3 := m.MustAcquireChExt(ctx, spans("a", "", write, zeroTS), poison.Policy_Error)
(in this PR): Could we also test the case where a waiter has Policy_Wait
and has to poison itself in response to seeing a prereq poisoned. The propagation is interesting.
bbc37ba
to
d2041fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! I was able to address all of your comments. Will run some more end-to-end validation (the circuit breaker roachtest, stress for 30 minutes, etc) and merge.
Reviewed 1 of 27 files at r4.
Dismissed @nvanbenschoten from 2 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/kv/kvserver/replica_range_lease.go, line 1300 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
(in this PR): What did we decide about this potentially canceling an ongoing lease acquisition? Is the idea that this is not an issue because we have the
brSig.Err()
check above, so we won't spin on quickly-abandoned lease acquisitions?
Yes, this won't be an issue due to the check here. Definitely don't want to make any changes to lease ctx handling in this PR, though we could file something for separate consideration. I'm not 100% what the "right" thing to do would be. Perhaps we could, when creating the lease request, also "join" a context.WithTimeout(context.Background(), 5*time.Second)
meaning that a lease gets at least 5s to complete, regardless of who triggers it.
pkg/kv/kvserver/replica_send.go, line 147 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
(in a later PR): do we still want this
withBypassCircuitBreakerMarker
function? Can't we checkbypassReplicaCircuitBreakerForBatch(ba)
below when setting the poison policy?
Thanks for pushing, I was able to get rid of it with just a little bit more work (plumbing a signaller
into redirectOrAcquireLeaseForRequest
).
pkg/kv/kvserver/concurrency/poison/policy.proto, line 32 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
(in this PR): did we want to make this the zero value, because it's the common case? I'm fine either way.
I had this mixed up, switched it around.
pkg/kv/kvserver/spanlatch/manager.go, line 534 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: for uniformity, consider pulling this into a
doneCh
local variable abovepoisonCh
as well.
I intentionally didn't do that. There isn't complete symmetry between poison and done: we set poisonCh
to nil when first reading from it (this isn't strictly necessary, but still seems nice). If we pull out doneCh
, the reader has to check whether a similar mechanism is at play there too, which it isn't, so it seemed better not to pull out a local in the first place. This is all inconsequential, of course. If you still prefer the local, no problem.
|
5667512
to
812c295
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this approach compared to the previous one, for this specific use. I'm a bit worried about the loss of generality though. There are failure modes where we may want to block reads. For example, over on #75944 we discussed using circuit breakers to cordon off a range that's seeing application errors, instead of crashing the node. In that case, we definitely want to prevent reads on the failing replica, to avoid serving inconsistent data. Thoughts on how we could accomplish that with this scheme?
Reviewed 7 of 26 files at r1, 16 of 27 files at r4, 14 of 14 files at r5, 2 of 2 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
-- commits, line 52 at r5:
This change feels invasive for stability, but it's still early and I think the refinements here are worth it.
pkg/kv/kvserver/replica_raft.go, line 1266 at r6 (raw file):
func (r *Replica) poisonInflightLatches(err error) { r.raftMu.Lock()
Can we do this? What if the replica is deadlocked, or gets stuck on something else while holding raftMu?
pkg/kv/kvserver/replica_send.go, line 453 at r6 (raw file):
}, requestEvalKind) if pErr != nil { if errors.HasType(pErr.GoError(), (*poison.PoisonedError)(nil)) {
This assumes that latch poisoning is always due to circuit breakers. Do we know of any other potential uses for them?
pkg/kv/kvserver/concurrency/poison/error.go, line 11 at r5 (raw file):
// licenses/APL.txt. package poison
Mild preference for moving all of this down into the concurrency
package. Or is it a dependency cycle thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on how we could accomplish that with this scheme?
I've come around to considering these two different problems to solve. In practice, loss of quorum is the lion's share. The old approach had us regress significantly in terms of follower reads, while "possibly" (but not really; definitely it wasn't tested) doing better in the "catch-all" department.
We can add a big hammer (#75944) via a separate circuit breaker (they're cheap to check, so this isn't an issue - the registry was the expensive part). We will also need to think about when to trigger it (needs to be deadlock proof), etc - it's not like the old approach really got us within striking distance here (not even to mention the need to communicate unhealthy replicas across the cluster discussed on the issue). In short, let's move these questions into the "later" department and discuss during planning.
Reviewed 7 of 26 files at r1, 15 of 27 files at r4, 13 of 14 files at r5, 1 of 2 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/replica_raft.go, line 1266 at r6 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Can we do this? What if the replica is deadlocked, or gets stuck on something else while holding raftMu?
We're assuming here that the Replica functions normally, see the top-level comment.
pkg/kv/kvserver/replica_send.go, line 453 at r6 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This assumes that latch poisoning is always due to circuit breakers. Do we know of any other potential uses for them?
No, but there's no reason we wouldn't be able to generalize this later. The PoisonError
is always local so there isn't a migration concern.
pkg/kv/kvserver/concurrency/poison/error.go, line 11 at r5 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Mild preference for moving all of this down into the
concurrency
package. Or is it a dependency cycle thing?
Since Nathan seems happy with it as is I'm inclined to leave it. This mirrors what we do for the sibling lock
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All sanity checks came back ok, so I'm preparing to get this merged. Tomorrow is off and after Monday I'm on PTO until Mar 15, at which point I can address any additional non-urgent comments.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica circuit breakers (cockroachdb#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. @nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses cockroachdb#74799. Release note: None Release justification: 22.1 project work
812c295
to
55fc27d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've come around to considering these two different problems to solve. In practice, loss of quorum is the lion's share. The old approach had us regress significantly in terms of follower reads, while "possibly" (but not really; definitely it wasn't tested) doing better in the "catch-all" department.
I'm inclined to go along with this, because it's a strict improvement over the current state and being able to serve reads seems like a worthwhile win. But I wouldn't consider the circuit breaker work "done" until it generalizes to arbitrary failures such as deadlocks. I suppose it is a broader problem though, in that those sorts of failures should also cause lease transfers (which they currently don't), so maybe we shouldn't be getting ahead of ourselves.
Reviewed 3 of 3 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
Right, we will call it "circuit breaker for loss of quorum" work done, and there will still be a need for other blast radius mitigations for other ways in which replicas can fail to perform adequately, such as #75944 or a more generalized circuit breaker scheme that can handle "anything" including, say, a |
bors r=erikgrinaker |
Build failed: |
Unrelated flake
bors r=erikgrinaker
(sent from mobile, please excuse my brevity)
…On Thu, Mar 3, 2022, 14:39 craig[bot] ***@***.***> wrote:
Build failed:
- GitHub CI (Cockroach)
<https://teamcity.cockroachdb.com/viewLog.html?buildId=4499778&buildTypeId=Cockroach_UnitTests>
—
Reply to this email directly, view it on GitHub
<#76858 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGXPZAOKR5KN45VUNY7WYLU6C6IFANCNFSM5O7EV3PA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
bors r=erikgrinaker |
Already running a review |
Build succeeded: |
This commit revamps an earlier implementation (#71806) of per-Replica
circuit breakers (#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.
This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.
@nvanbenschoten suggested in #74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.
This commit implements that strategy:
A request's latches are poisoned if its completion is predicated on the
replication layer being healthy. In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:
status, and also regardless of inflight proposals (follower reads
don't observe latches).
not conflict with any of the stuck proposals in the replication
layer (= poisoned latches) can also be served.
In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).
Latch poisoning is added as a first-class concept in the
concurrency
package, and a structured error
PoisonError
is introduced. This errorin particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.
Lease proposals now always use
poison.Policy_Wait
, leaving thefail-fast behavior to the caller. This simplifies things since multiple
callers with their own
poison.Policy
can multiplex onto a singleinflight lease proposal.
Addresses #74799.
Release note: None
Release justification: 22.1 project work