Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add metrics for per-Replica circuit breakers #74505

Closed
tbg opened this issue Jan 6, 2022 · 0 comments · Fixed by #74908
Closed

kvserver: add metrics for per-Replica circuit breakers #74505

tbg opened this issue Jan 6, 2022 · 0 comments · Fixed by #74908
Assignees
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented Jan 6, 2022

Part of #33007.

With #71806, Replicas can fail-fast when they detect loss of quorum. We need metrics for these circuit breakers:

  • number of currently tripped breakers (gauge)
  • count of trip events (i.e. transitions of a breaker from healthy to unhealthy).

We also want telemetry on the latter.

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-observability labels Jan 6, 2022
@tbg tbg self-assigned this Jan 6, 2022
tbg added a commit to tbg/cockroach that referenced this issue Jan 11, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the "slowness"
existing detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not)
refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal
was first inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is off via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see cockroachdb#74705).
Primarily though, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); cockroachdb#74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are

- UI work (cockroachdb#74713)
- Metrics (cockroachdb#74505)

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 11, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the existing
"slowness" detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is opt-in via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see cockroachdb#74705).  For
example, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); cockroachdb#74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

Perhaps surprisingly, when the breaker is tripped, *all* traffic to
the Replica gets the fail-fast behavior, not just mutations. This is
because even though a read may look good to serve based on the lease,
we would also need to check for latches, and in particular we would
need to fail-fast if there is a transitive dependency on any write
(since that write is presumably stuck). This is not trivial and so
we don't do it in this first iteration (see cockroachdb#74799).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None

touchie
tbg added a commit to tbg/cockroach that referenced this issue Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

Perhaps surprisingly, when the breaker is tripped, *all* traffic to
the Replica gets the fail-fast behavior, not just mutations. This is
because even though a read may look good to serve based on the lease,
we would also need to check for latches, and in particular we would
need to fail-fast if there is a transitive dependency on any write
(since that write is presumably stuck). This is not trivial and so
we don't do it in this first iteration (see cockroachdb#74799).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None

touchie
craig bot pushed a commit that referenced this issue Jan 14, 2022
71806: kvserver: circuit-break requests to unavailable ranges r=erikgrinaker a=tbg

Fixes #33007.
Closes #61311.

This PR uses the circuit breaker package introduced in #73641 to address
issue #33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the existing
"slowness" detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not) refreshed
(#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in #72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is opt-in via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see #74705).  For
example, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); #74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are UI work (#74713) and metrics (#74505).

The release note is deferred to #74705 (where breakers are turned on).

Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Jan 19, 2022
Add a metric that tracks how many replicas have tripped circuit
breakers.
Add a metric that counts the trip events as well. This can highlight
problems where the circuit is tripped in error and immediately untrips
(which may not be caught by the first metric).

Since tripped circuit breakers highlight an availability problem,
we're also adding an alert/aggregation rule.

Also, as requested by @mwang1026, report the count of trip events via
telemetry as well.

Fixes cockroachdb#74505.

Release note: None
craig bot pushed a commit that referenced this issue Jan 20, 2022
74908: kvserver: add metric for replica circuit breaker r=erikgrinaker a=tbg

Add a metric that tracks how many replicas have tripped circuit
breakers.
Add a metric that counts the trip events as well. This can highlight
problems where the circuit is tripped in error and immediately untrips
(which may not be caught by the first metric).

Also, as requested by @mwang1026, report the latter via telemetry as
well.

Fixes #74505.

Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@craig craig bot closed this as completed in 442e5b9 Jan 20, 2022
gtr pushed a commit to gtr/cockroach that referenced this issue Jan 24, 2022
Add a metric that tracks how many replicas have tripped circuit
breakers.
Add a metric that counts the trip events as well. This can highlight
problems where the circuit is tripped in error and immediately untrips
(which may not be caught by the first metric).

Since tripped circuit breakers highlight an availability problem,
we're also adding an alert/aggregation rule.

Also, as requested by @mwang1026, report the count of trip events via
telemetry as well.

Fixes cockroachdb#74505.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant