-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: default replica circuit breakers to on #74705
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Comments
tbg
added
the
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
label
Jan 11, 2022
Definitely want #74707 together with some benchmarks that we're not paying dearly in perf. |
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jan 11, 2022
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the "slowness" existing detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, all of this is off via the `kv.replica_circuit_breakers.enabled` cluster setting while we determine which of the many follow-up items to target for 22.1 (see cockroachdb#74705). Primarily though, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are - UI work (cockroachdb#74713) - Metrics (cockroachdb#74505) The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jan 11, 2022
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, all of this is opt-in via the `kv.replica_circuit_breakers.enabled` cluster setting while we determine which of the many follow-up items to target for 22.1 (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jan 13, 2022
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within the timeout specified by the `kv.replica_circuit_breaker.slow_replication_threshold` cluster setting, the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, the cluster setting defaults to zero (disabling the entire mechanism) until some necessary follow-up items have been addressed (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jan 13, 2022
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within the timeout specified by the `kv.replica_circuit_breaker.slow_replication_threshold` cluster setting, the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). Perhaps surprisingly, when the breaker is tripped, *all* traffic to the Replica gets the fail-fast behavior, not just mutations. This is because even though a read may look good to serve based on the lease, we would also need to check for latches, and in particular we would need to fail-fast if there is a transitive dependency on any write (since that write is presumably stuck). This is not trivial and so we don't do it in this first iteration (see cockroachdb#74799). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, the cluster setting defaults to zero (disabling the entire mechanism) until some necessary follow-up items have been addressed (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None touchie
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jan 13, 2022
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within the timeout specified by the `kv.replica_circuit_breaker.slow_replication_threshold` cluster setting, the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). Perhaps surprisingly, when the breaker is tripped, *all* traffic to the Replica gets the fail-fast behavior, not just mutations. This is because even though a read may look good to serve based on the lease, we would also need to check for latches, and in particular we would need to fail-fast if there is a transitive dependency on any write (since that write is presumably stuck). This is not trivial and so we don't do it in this first iteration (see cockroachdb#74799). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, the cluster setting defaults to zero (disabling the entire mechanism) until some necessary follow-up items have been addressed (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None touchie
craig bot
pushed a commit
that referenced
this issue
Jan 14, 2022
71806: kvserver: circuit-break requests to unavailable ranges r=erikgrinaker a=tbg Fixes #33007. Closes #61311. This PR uses the circuit breaker package introduced in #73641 to address issue #33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in #72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, all of this is opt-in via the `kv.replica_circuit_breakers.enabled` cluster setting while we determine which of the many follow-up items to target for 22.1 (see #74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); #74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (#74713) and metrics (#74505). The release note is deferred to #74705 (where breakers are turned on). Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jan 31, 2022
…egistry PR cockroachdb#71806 originally landed with a very simple solution for the problem of joining the breaker's cancellation chain with that of the incoming context: spawning a goroutine. This was never meant to be the solution in prod (spawning goroutines is not cheap) and this PR switches to a real solution (which may still have soom room for improvement). Replica circuit breakers are still off (cockroachdb#74705) and so this change should not change any behavior, unless breakers are turned on. The provided benchmarks show both the "end-to-end" `Replica.Send` as well as the micro-bench `breaker.{R,Un}egister` perf and contrast it with that of having the breakers disabled. This sets the stage for evaluating and, if necessary, minimizing the overhead, which (along with some additional end-to-end testing) then allows us to turn breakers on by default (cockroachdb#74705). `BenchmarkReplicaCircuitBreakerSendOverhead` measures `Replica.Send` of an no-result single-point read. We see that we lose a few percent, though this benchmark is pretty micro already. `BenchmarkReplicaCircuitBreaker_Register` is even more micro, measuring the overhead of making a cancel, and calling `Register()` followed by `Unregister()`. The "old" measurement is essentially just creating a cancelable context (so the +XXXX% numbers don't mean a lot; we're doing work for which there was previously no analog). Here we can clearly see that sharding the mutex map can help, though note that this would already only be visible at the `Replica.Send` level at much higher concurrencies than in this test (16). I also ran the kv95 roachtest with what corresponds to the `mutexmap-1` strategy above (i.e. a single mutex); [raw data here]. In this configuration (relative to disabling breakers) `kv95/enc=false/nodes=1` and `kv95/enc=false/nodes=1/cpu=32` sustained a -1.62% hit (in qps, median over five runs). Once we've chosen a suitable shard count, I expect that to drop a bit as well, and think we can turn on breakers in production, at least from a performance point of view. [raw data here]: https://gist.github.com/tbg/44b53a10d18490a6dabd57e1d1d3225e old = disabled, new = enabled: ``` ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 1.63µs ± 4% 1.77µs ± 5% +8.38% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 1.66µs ± 4% 1.75µs ± 5% +5.74% (p=0.000 n=10+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 1.63µs ± 6% 1.75µs ± 4% +7.48% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 1.58µs ± 7% 1.73µs ± 4% +9.47% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 1.57µs ± 4% 1.62µs ± 6% +3.19% (p=0.046 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 1.56µs ± 5% 1.63µs ± 7% +4.20% (p=0.007 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 1.53µs ± 5% 1.62µs ± 8% +6.01% (p=0.001 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 1.55µs ± 7% 1.61µs ± 5% +3.50% (p=0.049 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 1.57µs ± 7% 1.62µs ± 6% +3.53% (p=0.042 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 1.53µs ± 7% 1.64µs ± 5% +6.84% (p=0.000 n=11+12) name old alloc/op new alloc/op delta ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 1.42kB ± 0% 1.44kB ± 0% +1.25% (p=0.000 n=7+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 1.42kB ± 0% 1.44kB ± 0% +1.19% (p=0.000 n=9+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 1.42kB ± 0% 1.43kB ± 0% +1.14% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 1.42kB ± 0% 1.43kB ± 0% +1.24% (p=0.000 n=9+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 1.42kB ± 0% 1.43kB ± 0% +1.08% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 1.42kB ± 0% 1.43kB ± 0% +1.14% (p=0.000 n=10+10) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 1.42kB ± 0% 1.43kB ± 0% +1.12% (p=0.000 n=10+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 1.42kB ± 0% 1.43kB ± 0% +1.09% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 1.41kB ± 0% 1.43kB ± 0% +1.13% (p=0.000 n=9+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 1.41kB ± 0% 1.43kB ± 0% +1.23% (p=0.000 n=9+12) name old allocs/op new allocs/op delta ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ``` ``` ReplicaCircuitBreaker_Register/X/mutexmap-1-16 30.8ns ±21% 787.6ns ± 2% +2457.74% (p=0.000 n=11+11) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 31.6ns ± 3% 782.0ns ± 2% +2376.38% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 31.0ns ± 0% 778.8ns ± 2% +2409.61% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 31.0ns ± 0% 775.8ns ± 1% +2403.11% (p=0.000 n=10+10) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 31.0ns ± 1% 288.8ns ± 2% +833.06% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 31.1ns ± 1% 324.6ns ± 4% +945.05% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 31.1ns ± 0% 193.3ns ± 2% +522.26% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 31.0ns ± 0% 286.1ns ± 1% +822.80% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 31.0ns ± 0% 194.4ns ± 2% +527.12% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 31.0ns ± 1% 121.3ns ± 2% +291.01% (p=0.000 n=10+12) name old alloc/op new alloc/op delta ReplicaCircuitBreaker_Register/X/mutexmap-1-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) name old allocs/op new allocs/op delta ReplicaCircuitBreaker_Register/X/mutexmap-1-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ``` Created via: ``` $ ./dev bench ./pkg/kv/kvserver/ -f BenchmarkReplicaCircuitBreaker --count 10 | tee out.txt $ for f in enabled=true enabled=false; do grep -F "${f}" out.txt | sed "s/${f}/X/" > "${f}"; done && benchstat 'enabled=false' 'enabled=true' ``` For the record, I also tried a `sync.Map`, but it was vastly inferior for this use case as it is allocation-heavy and its writes are fairly slow. Touches cockroachdb#33007. Fixes cockroachdb#74707. Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Feb 1, 2022
This is a smoke test; I'm planning to enable these across the board in two weeks. Touches cockroachdb#33007. Touches cockroachdb#74705. Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Feb 4, 2022
…egistry PR cockroachdb#71806 originally landed with a very simple solution for the problem of joining the breaker's cancellation chain with that of the incoming context: spawning a goroutine. This was never meant to be the solution in prod (spawning goroutines is not cheap) and this PR switches to a real solution (which may still have soom room for improvement). Replica circuit breakers are still off (cockroachdb#74705) and so this change should not change any behavior, unless breakers are turned on. The provided benchmarks show both the "end-to-end" `Replica.Send` as well as the micro-bench `breaker.{R,Un}egister` perf and contrast it with that of having the breakers disabled. This sets the stage for evaluating and, if necessary, minimizing the overhead, which (along with some additional end-to-end testing) then allows us to turn breakers on by default (cockroachdb#74705). `BenchmarkReplicaCircuitBreakerSendOverhead` measures `Replica.Send` of an no-result single-point read. We see that we lose a few percent, though this benchmark is pretty micro already. `BenchmarkReplicaCircuitBreaker_Register` is even more micro, measuring the overhead of making a cancel, and calling `Register()` followed by `Unregister()`. The "old" measurement is essentially just creating a cancelable context (so the +XXXX% numbers don't mean a lot; we're doing work for which there was previously no analog). Here we can clearly see that sharding the mutex map can help, though note that this would already only be visible at the `Replica.Send` level at much higher concurrencies than in this test (16). I also ran the kv95 roachtest with what corresponds to the `mutexmap-1` strategy above (i.e. a single mutex); [raw data here]. In this configuration (relative to disabling breakers) `kv95/enc=false/nodes=1` and `kv95/enc=false/nodes=1/cpu=32` sustained a -1.62% hit (in qps, median over five runs). Once we've chosen a suitable shard count, I expect that to drop a bit as well, and think we can turn on breakers in production, at least from a performance point of view. [raw data here]: https://gist.github.com/tbg/44b53a10d18490a6dabd57e1d1d3225e old = disabled, new = enabled: ``` ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 1.63µs ± 4% 1.77µs ± 5% +8.38% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 1.66µs ± 4% 1.75µs ± 5% +5.74% (p=0.000 n=10+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 1.63µs ± 6% 1.75µs ± 4% +7.48% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 1.58µs ± 7% 1.73µs ± 4% +9.47% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 1.57µs ± 4% 1.62µs ± 6% +3.19% (p=0.046 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 1.56µs ± 5% 1.63µs ± 7% +4.20% (p=0.007 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 1.53µs ± 5% 1.62µs ± 8% +6.01% (p=0.001 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 1.55µs ± 7% 1.61µs ± 5% +3.50% (p=0.049 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 1.57µs ± 7% 1.62µs ± 6% +3.53% (p=0.042 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 1.53µs ± 7% 1.64µs ± 5% +6.84% (p=0.000 n=11+12) name old alloc/op new alloc/op delta ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 1.42kB ± 0% 1.44kB ± 0% +1.25% (p=0.000 n=7+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 1.42kB ± 0% 1.44kB ± 0% +1.19% (p=0.000 n=9+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 1.42kB ± 0% 1.43kB ± 0% +1.14% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 1.42kB ± 0% 1.43kB ± 0% +1.24% (p=0.000 n=9+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 1.42kB ± 0% 1.43kB ± 0% +1.08% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 1.42kB ± 0% 1.43kB ± 0% +1.14% (p=0.000 n=10+10) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 1.42kB ± 0% 1.43kB ± 0% +1.12% (p=0.000 n=10+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 1.42kB ± 0% 1.43kB ± 0% +1.09% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 1.41kB ± 0% 1.43kB ± 0% +1.13% (p=0.000 n=9+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 1.41kB ± 0% 1.43kB ± 0% +1.23% (p=0.000 n=9+12) name old allocs/op new allocs/op delta ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ``` ``` ReplicaCircuitBreaker_Register/X/mutexmap-1-16 30.8ns ±21% 787.6ns ± 2% +2457.74% (p=0.000 n=11+11) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 31.6ns ± 3% 782.0ns ± 2% +2376.38% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 31.0ns ± 0% 778.8ns ± 2% +2409.61% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 31.0ns ± 0% 775.8ns ± 1% +2403.11% (p=0.000 n=10+10) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 31.0ns ± 1% 288.8ns ± 2% +833.06% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 31.1ns ± 1% 324.6ns ± 4% +945.05% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 31.1ns ± 0% 193.3ns ± 2% +522.26% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 31.0ns ± 0% 286.1ns ± 1% +822.80% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 31.0ns ± 0% 194.4ns ± 2% +527.12% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 31.0ns ± 1% 121.3ns ± 2% +291.01% (p=0.000 n=10+12) name old alloc/op new alloc/op delta ReplicaCircuitBreaker_Register/X/mutexmap-1-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) name old allocs/op new allocs/op delta ReplicaCircuitBreaker_Register/X/mutexmap-1-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ``` Created via: ``` $ ./dev bench ./pkg/kv/kvserver/ -f BenchmarkReplicaCircuitBreaker --count 10 | tee out.txt $ for f in enabled=true enabled=false; do grep -F "${f}" out.txt | sed "s/${f}/X/" > "${f}"; done && benchstat 'enabled=false' 'enabled=true' ``` For the record, I also tried a `sync.Map`, but it was vastly inferior for this use case as it is allocation-heavy and its writes are fairly slow. Touches cockroachdb#33007. Fixes cockroachdb#74707. Release note: None
craig bot
pushed a commit
that referenced
this issue
Feb 4, 2022
75224: kvserver: replace circuit breaker cancel goroutine with per-Replica registry r=erikgrinaker a=tbg PR #71806 originally landed with a very simple solution for the problem of joining the breaker's cancellation chain with that of the incoming context: spawning a goroutine. This was never meant to be the solution in prod (spawning goroutines is not cheap) and this PR switches to a real solution (which may still have soom room for improvement). Replica circuit breakers are still off (#74705) and so this change should not change any behavior, unless breakers are turned on. The provided benchmarks show both the "end-to-end" `Replica.Send` as well as the micro-bench `breaker.{R,Un}egister` perf and contrast it with that of having the breakers disabled. This sets the stage for evaluating and, if necessary, minimizing the overhead, which (along with some additional end-to-end testing) then allows us to turn breakers on by default (#74705). `BenchmarkReplicaCircuitBreakerSendOverhead` measures `Replica.Send` of an no-result single-point read. We see that we lose a few percent, though this benchmark is pretty micro already. `BenchmarkReplicaCircuitBreaker_Register` is even more micro, measuring the overhead of making a cancel, and calling `Register()` followed by `Unregister()`. The "old" measurement is essentially just creating a cancelable context (so the +XXXX% numbers don't mean a lot; we're doing work for which there was previously no analog). Here we can clearly see that sharding the mutex map can help, though note that this would already only be visible at the `Replica.Send` level at much higher concurrencies than in this test (16). I also ran the kv95 roachtest with what corresponds to the `mutexmap-1` strategy above (i.e. a single mutex); [raw data here]. In this configuration (relative to disabling breakers) `kv95/enc=false/nodes=1` and `kv95/enc=false/nodes=1/cpu=32` sustained a -1.62% hit (in qps, median over five runs). Once we've chosen a suitable shard count, I expect that to drop a bit as well, and think we can turn on breakers in production, at least from a performance point of view. [raw data here]: https://gist.github.com/tbg/44b53a10d18490a6dabd57e1d1d3225e old = disabled, new = enabled: ``` ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 1.63µs ± 4% 1.77µs ± 5% +8.38% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 1.66µs ± 4% 1.75µs ± 5% +5.74% (p=0.000 n=10+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 1.63µs ± 6% 1.75µs ± 4% +7.48% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 1.58µs ± 7% 1.73µs ± 4% +9.47% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 1.57µs ± 4% 1.62µs ± 6% +3.19% (p=0.046 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 1.56µs ± 5% 1.63µs ± 7% +4.20% (p=0.007 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 1.53µs ± 5% 1.62µs ± 8% +6.01% (p=0.001 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 1.55µs ± 7% 1.61µs ± 5% +3.50% (p=0.049 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 1.57µs ± 7% 1.62µs ± 6% +3.53% (p=0.042 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 1.53µs ± 7% 1.64µs ± 5% +6.84% (p=0.000 n=11+12) name old alloc/op new alloc/op delta ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 1.42kB ± 0% 1.44kB ± 0% +1.25% (p=0.000 n=7+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 1.42kB ± 0% 1.44kB ± 0% +1.19% (p=0.000 n=9+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 1.42kB ± 0% 1.43kB ± 0% +1.14% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 1.42kB ± 0% 1.43kB ± 0% +1.24% (p=0.000 n=9+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 1.42kB ± 0% 1.43kB ± 0% +1.08% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 1.42kB ± 0% 1.43kB ± 0% +1.14% (p=0.000 n=10+10) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 1.42kB ± 0% 1.43kB ± 0% +1.12% (p=0.000 n=10+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 1.42kB ± 0% 1.43kB ± 0% +1.09% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 1.41kB ± 0% 1.43kB ± 0% +1.13% (p=0.000 n=9+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 1.41kB ± 0% 1.43kB ± 0% +1.23% (p=0.000 n=9+12) name old allocs/op new allocs/op delta ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ``` ``` ReplicaCircuitBreaker_Register/X/mutexmap-1-16 30.8ns ±21% 787.6ns ± 2% +2457.74% (p=0.000 n=11+11) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 31.6ns ± 3% 782.0ns ± 2% +2376.38% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 31.0ns ± 0% 778.8ns ± 2% +2409.61% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 31.0ns ± 0% 775.8ns ± 1% +2403.11% (p=0.000 n=10+10) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 31.0ns ± 1% 288.8ns ± 2% +833.06% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 31.1ns ± 1% 324.6ns ± 4% +945.05% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 31.1ns ± 0% 193.3ns ± 2% +522.26% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 31.0ns ± 0% 286.1ns ± 1% +822.80% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 31.0ns ± 0% 194.4ns ± 2% +527.12% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 31.0ns ± 1% 121.3ns ± 2% +291.01% (p=0.000 n=10+12) name old alloc/op new alloc/op delta ReplicaCircuitBreaker_Register/X/mutexmap-1-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) name old allocs/op new allocs/op delta ReplicaCircuitBreaker_Register/X/mutexmap-1-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ``` Created via: ``` $ ./dev bench ./pkg/kv/kvserver/ -f BenchmarkReplicaCircuitBreaker --count 10 | tee out.txt $ for f in enabled=true enabled=false; do grep -F "${f}" out.txt | sed "s/${f}/X/" > "${f}"; done && benchstat 'enabled=false' 'enabled=true' ``` For the record, I also tried a `sync.Map`, but it was vastly inferior for this use case as it is allocation-heavy and its writes are fairly slow. Touches #33007. Fixes #74707. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Feb 21, 2022
Fixes cockroachdb#74705. Release note: None
blathers-crl bot
pushed a commit
that referenced
this issue
Mar 22, 2022
Fixes #74705. Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Touches #33007.
As of #71806, per-replica circuit breakers are off by default. Once they are in
a state suitable for production use, the default should be flipped to "on", and
this must happen before the 22.1 release branch is cut.
This should be accommodated by additional testing and by a review of all issues
linked to #33007 to determine which one are required before the flip.
The commit that defaults breakers to on should also add the release note, which has intentionally been held back.
Jira issue: CRDB-12220
The text was updated successfully, but these errors were encountered: