Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: allow circuit-breaker to serve reads #76858

Merged
merged 1 commit into from
Mar 3, 2022

Conversation

tbg
Copy link
Member

@tbg tbg commented Feb 21, 2022

This commit revamps an earlier implementation (#71806) of per-Replica
circuit breakers (#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in #74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy. In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

  • a valid follower read remains valid regardless of the circuit breaker
    status, and also regardless of inflight proposals (follower reads
    don't observe latches).
  • a read that can be served under the current lease and which does
    not conflict with any of the stuck proposals in the replication
    layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the concurrency
package, and a structured error PoisonError is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use poison.Policy_Wait, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own poison.Policy can multiplex onto a single
inflight lease proposal.

Addresses #74799.

Release note: None
Release justification: 22.1 project work

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the circ-breaker-poison-latch branch 3 times, most recently from c3ff1d2 to f844c8c Compare February 22, 2022 20:19
@tbg

This comment was marked as resolved.

@tbg tbg force-pushed the circ-breaker-poison-latch branch 7 times, most recently from c565d73 to 463a6ba Compare February 24, 2022 12:15
@tbg tbg force-pushed the circ-breaker-poison-latch branch 4 times, most recently from a98579e to a472d0e Compare March 1, 2022 11:21
pkg/kv/kvserver/spanlatch/manager.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/spanlatch/manager.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/spanlatch/manager_test.go Show resolved Hide resolved
pkg/kv/kvserver/concurrency/concurrency_control.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/concurrency/concurrency_manager.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/replica_raft.go Show resolved Hide resolved
pkg/kv/kvserver/replica_range_lease.go Outdated Show resolved Hide resolved
@tbg tbg force-pushed the circ-breaker-poison-latch branch 4 times, most recently from 9c19657 to bbc37ba Compare March 2, 2022 13:56
@tbg tbg marked this pull request as ready for review March 2, 2022 13:57
@tbg tbg requested review from a team as code owners March 2, 2022 13:57
@tbg tbg requested review from nvanbenschoten and removed request for a team March 2, 2022 13:57
@tbg
Copy link
Member Author

tbg commented Mar 2, 2022

require.NoError(t, tc.FollowerRead(n2))

image

This flakes (rarely but still flakes) with

    client_replica_circuit_breaker_test.go:123: 
                Error Trace:    client_replica_circuit_breaker_test.go:123
                Error:          Received unexpected error:
                                [NotLeaseHolderError] lease held by different store; r45: replica (n2,s2):2LEARNER not lease holder; current lease is repl=(n1,s1):1 seq=1 start=0,0 epo=1 pro=1646229957.683567896,0
                Test:           TestReplicaCircuitBreaker_LeaseholderTripped

Is there a better way to have a guaranteed follower read? I imagine there's precedent for this. Also curious that it's a learner, we do call tc.AddVotersOrFatal(t, k, tc.Target(1)) when setting up these clusters.

Ah, we need to pass true here at least in the scenario at hand:

if err := tc.waitForNewReplicas(startKey, false /* waitForVoter */, targets...); err != nil {
return roachpb.RangeDescriptor{}, err
}

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 26 files at r1, 26 of 27 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/replica_raft.go, line 1275 at r4 (raw file):

			aErr := roachpb.NewAmbiguousResultError("circuit breaker tripped")
			aErr.WrappedErr = roachpb.NewError(err)
			p.signalProposalResult(proposalResult{Err: roachpb.NewError(aErr)})

(in this PR): could you add a comment that this does not cause the request to release its latches?


pkg/kv/kvserver/replica_range_lease.go, line 1126 at r4 (raw file):

	ctx context.Context, reqTS hlc.Timestamp,
) (kvserverpb.LeaseStatus, *roachpb.Error) {
	brSig := r.breaker.Signal()

(in this PR): move this below the fast-path.


pkg/kv/kvserver/replica_range_lease.go, line 1300 at r4 (raw file):

					return nil
				case <-brSig.C():
					llHandle.Cancel()

(in this PR): What did we decide about this potentially canceling an ongoing lease acquisition? Is the idea that this is not an issue because we have the brSig.Err() check above, so we won't spin on quickly-abandoned lease acquisitions?


pkg/kv/kvserver/replica_range_lease.go, line 1364 at r4 (raw file):

	// We explicitly ignore the returned handle as we won't block on it.
	//
	// TODO(tbg): this ctx is likely cancelled very soon, which will in turn

Good catch.


pkg/kv/kvserver/replica_send.go, line 147 at r4 (raw file):

		// have access to the batch, for example the redirectOnOrAcquireLeaseLocked,
		// for which several callers don't even have a BatchRequest.
		ctx = withBypassCircuitBreakerMarker(ctx)

(in a later PR): do we still want this withBypassCircuitBreakerMarker function? Can't we check bypassReplicaCircuitBreakerForBatch(ba) below when setting the poison policy?


pkg/kv/kvserver/replica_send.go, line 456 at r4 (raw file):

			LatchSpans:      latchSpans, // nil if g != nil
			LockSpans:       lockSpans,  // nil if g != nil
			PoisonPolicy:    pp,

nit: move this above Requests so it's in the same order as the struct declation.


pkg/kv/kvserver/concurrency/concurrency_control.go, line 193 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

TODO: explain what poisoning means further.

(in this PR): Did you want to say anything more here? This file serves as top-level documentation for concurrency control, so it's the place to document poisoning and its interaction with request sequencing.


pkg/kv/kvserver/concurrency/concurrency_control.go, line 195 at r4 (raw file):

	// PoisonReq marks a request's latches as poisoned. This allows requests
	// waiting for the latches to fail fast.
	PoisonReq(*Guard)

(in this PR): mention that this can be called multiple times.


pkg/kv/kvserver/concurrency/poison/error.proto, line 19 at r4 (raw file):

import "gogoproto/gogo.proto";

message PoisonedError {

(in this PR): mind giving this a comment explaining what this error means and when a client might see it?


pkg/kv/kvserver/concurrency/poison/policy.proto, line 28 at r4 (raw file):

// [^1]: https://doc.rust-lang.org/std/sync/struct.Mutex.html#poisoning
enum Policy {
  // Policy_Wait instructs a request to return an error upon encountering

(in this PR): this comment is incorrect.


pkg/kv/kvserver/concurrency/poison/policy.proto, line 32 at r4 (raw file):

  Wait = 0;

  // Policy_Error instructs a request to return an error upon encountering

(in this PR): did we want to make this the zero value, because it's the common case? I'm fine either way.


pkg/kv/kvserver/spanlatch/manager.go, line 90 at r4 (raw file):

	span       roachpb.Span
	ts         hlc.Timestamp
	done       *signal

(in this PR or later): A single request may have many latches, so it's best to keep this struct's memory footprint small. We're only adding a single pointer, so this isn't a big deal, but we can also avoid this. For instance, we could point a latch directly at its *Guard and then have access to done and poison traverse the same pointer. Or limit visibility to prevent misuse with something like:

type signals struct {
    done       signal
    poison     poisonSignal
}

type Guard struct {
    signals signals
    ...
}

type latch struct {
    ...
    signals *signals
    ...
}

pkg/kv/kvserver/spanlatch/manager.go, line 534 at r4 (raw file):

	for {
		select {
		case <-held.done.signalChan():

nit: for uniformity, consider pulling this into a doneCh local variable above poisonCh as well.


pkg/kv/kvserver/spanlatch/signal.go, line 93 at r4 (raw file):

}

// poisonSignal is like signal, but its signal method is idempotent.

(in this PR): consider renaming this idempotentSignal or something more abstract from the concerns of poisoning.


pkg/roachpb/api.go, line 1349 at r4 (raw file):

	// circuit breaker. Transferring a lease while the replication layer is
	// unavailable results in the "old" leaseholder relinquishing the ability
	// to serve (strong) reads, without being able to hand over the lease.

Nice consideration. Disallowing this seems like a good win.


pkg/kv/kvserver/concurrency/concurrency_manager_test.go, line 263 at r4 (raw file):

				})
				return c.waitAndCollect(t, mon)
			case "poison":

tiny nit: we're pretty consistent about new-lines between switch cases in this test.


pkg/kv/kvserver/spanlatch/manager_test.go, line 526 at r4 (raw file):

	cha1, _ := m.MustAcquireChExt(ctx, spans("a", "", write, zeroTS), poison.Policy_Wait)
	cha2, erra2 := m.MustAcquireChExt(ctx, spans("a", "", write, zeroTS), poison.Policy_Error)
	cha3, erra3 := m.MustAcquireChExt(ctx, spans("a", "", write, zeroTS), poison.Policy_Error)

(in this PR): Could we also test the case where a waiter has Policy_Wait and has to poison itself in response to seeing a prereq poisoned. The propagation is interesting.

@tbg tbg force-pushed the circ-breaker-poison-latch branch from bbc37ba to d2041fe Compare March 3, 2022 09:02
@tbg tbg requested a review from nvanbenschoten March 3, 2022 09:13
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I was able to address all of your comments. Will run some more end-to-end validation (the circuit breaker roachtest, stress for 30 minutes, etc) and merge.

Reviewed 1 of 27 files at r4.
Dismissed @nvanbenschoten from 2 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/replica_range_lease.go, line 1300 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

(in this PR): What did we decide about this potentially canceling an ongoing lease acquisition? Is the idea that this is not an issue because we have the brSig.Err() check above, so we won't spin on quickly-abandoned lease acquisitions?

Yes, this won't be an issue due to the check here. Definitely don't want to make any changes to lease ctx handling in this PR, though we could file something for separate consideration. I'm not 100% what the "right" thing to do would be. Perhaps we could, when creating the lease request, also "join" a context.WithTimeout(context.Background(), 5*time.Second) meaning that a lease gets at least 5s to complete, regardless of who triggers it.


pkg/kv/kvserver/replica_send.go, line 147 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

(in a later PR): do we still want this withBypassCircuitBreakerMarker function? Can't we check bypassReplicaCircuitBreakerForBatch(ba) below when setting the poison policy?

Thanks for pushing, I was able to get rid of it with just a little bit more work (plumbing a signaller into redirectOrAcquireLeaseForRequest).


pkg/kv/kvserver/concurrency/poison/policy.proto, line 32 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

(in this PR): did we want to make this the zero value, because it's the common case? I'm fine either way.

I had this mixed up, switched it around.


pkg/kv/kvserver/spanlatch/manager.go, line 534 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: for uniformity, consider pulling this into a doneCh local variable above poisonCh as well.

I intentionally didn't do that. There isn't complete symmetry between poison and done: we set poisonCh to nil when first reading from it (this isn't strictly necessary, but still seems nice). If we pull out doneCh, the reader has to check whether a similar mechanism is at play there too, which it isn't, so it seemed better not to pull out a local in the first place. This is all inconsequential, of course. If you still prefer the local, no problem.

@tbg
Copy link
Member Author

tbg commented Mar 3, 2022

./dev test pkg/kv/kvserver/ --stress --filter Circuit --test-args '-test.timeout 80s' 2>&1 | tee stress.log
4006 runs so far, 0 failures, over 28m30s

@tbg tbg force-pushed the circ-breaker-poison-latch branch 2 times, most recently from 5667512 to 812c295 Compare March 3, 2022 11:07
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this approach compared to the previous one, for this specific use. I'm a bit worried about the loss of generality though. There are failure modes where we may want to block reads. For example, over on #75944 we discussed using circuit breakers to cordon off a range that's seeing application errors, instead of crashing the node. In that case, we definitely want to prevent reads on the failing replica, to avoid serving inconsistent data. Thoughts on how we could accomplish that with this scheme?

Reviewed 7 of 26 files at r1, 16 of 27 files at r4, 14 of 14 files at r5, 2 of 2 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


-- commits, line 52 at r5:
This change feels invasive for stability, but it's still early and I think the refinements here are worth it.


pkg/kv/kvserver/replica_raft.go, line 1266 at r6 (raw file):

func (r *Replica) poisonInflightLatches(err error) {
	r.raftMu.Lock()

Can we do this? What if the replica is deadlocked, or gets stuck on something else while holding raftMu?


pkg/kv/kvserver/replica_send.go, line 453 at r6 (raw file):

		}, requestEvalKind)
		if pErr != nil {
			if errors.HasType(pErr.GoError(), (*poison.PoisonedError)(nil)) {

This assumes that latch poisoning is always due to circuit breakers. Do we know of any other potential uses for them?


pkg/kv/kvserver/concurrency/poison/error.go, line 11 at r5 (raw file):

// licenses/APL.txt.

package poison

Mild preference for moving all of this down into the concurrency package. Or is it a dependency cycle thing?

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on how we could accomplish that with this scheme?

I've come around to considering these two different problems to solve. In practice, loss of quorum is the lion's share. The old approach had us regress significantly in terms of follower reads, while "possibly" (but not really; definitely it wasn't tested) doing better in the "catch-all" department.

We can add a big hammer (#75944) via a separate circuit breaker (they're cheap to check, so this isn't an issue - the registry was the expensive part). We will also need to think about when to trigger it (needs to be deadlock proof), etc - it's not like the old approach really got us within striking distance here (not even to mention the need to communicate unhealthy replicas across the cluster discussed on the issue). In short, let's move these questions into the "later" department and discuss during planning.

Reviewed 7 of 26 files at r1, 15 of 27 files at r4, 13 of 14 files at r5, 1 of 2 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/replica_raft.go, line 1266 at r6 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Can we do this? What if the replica is deadlocked, or gets stuck on something else while holding raftMu?

We're assuming here that the Replica functions normally, see the top-level comment.


pkg/kv/kvserver/replica_send.go, line 453 at r6 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This assumes that latch poisoning is always due to circuit breakers. Do we know of any other potential uses for them?

No, but there's no reason we wouldn't be able to generalize this later. The PoisonError is always local so there isn't a migration concern.


pkg/kv/kvserver/concurrency/poison/error.go, line 11 at r5 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Mild preference for moving all of this down into the concurrency package. Or is it a dependency cycle thing?

Since Nathan seems happy with it as is I'm inclined to leave it. This mirrors what we do for the sibling lock package.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All sanity checks came back ok, so I'm preparing to get this merged. Tomorrow is off and after Monday I'm on PTO until Mar 15, at which point I can address any additional non-urgent comments.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica
circuit breakers (cockroachdb#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses cockroachdb#74799.

Release note: None
Release justification: 22.1 project work
@tbg tbg force-pushed the circ-breaker-poison-latch branch from 812c295 to 55fc27d Compare March 3, 2022 11:36
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've come around to considering these two different problems to solve. In practice, loss of quorum is the lion's share. The old approach had us regress significantly in terms of follower reads, while "possibly" (but not really; definitely it wasn't tested) doing better in the "catch-all" department.

I'm inclined to go along with this, because it's a strict improvement over the current state and being able to serve reads seems like a worthwhile win. But I wouldn't consider the circuit breaker work "done" until it generalizes to arbitrary failures such as deadlocks. I suppose it is a broader problem though, in that those sorts of failures should also cause lease transfers (which they currently don't), so maybe we shouldn't be getting ahead of ourselves.

Reviewed 3 of 3 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@tbg
Copy link
Member Author

tbg commented Mar 3, 2022

But I wouldn't consider the circuit breaker work "done"

Right, we will call it "circuit breaker for loss of quorum" work done, and there will still be a need for other blast radius mitigations for other ways in which replicas can fail to perform adequately, such as #75944 or a more generalized circuit breaker scheme that can handle "anything" including, say, a replica.mu deadlock. Since the original circuit breaker work was always primarily motivated by loss of quorum and the issue thread is gigantic, I'll file a new issue for the latter.

@tbg
Copy link
Member Author

tbg commented Mar 3, 2022

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Mar 3, 2022

Build failed:

@tbg
Copy link
Member Author

tbg commented Mar 3, 2022 via email

@tbg
Copy link
Member Author

tbg commented Mar 3, 2022

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Mar 3, 2022

Already running a review

@craig
Copy link
Contributor

craig bot commented Mar 3, 2022

Build succeeded:

@craig craig bot merged commit fd15d3a into cockroachdb:master Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants