-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docs on replica circuit breakers #13676
Conversation
Files changed: |
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify site settings. |
7e7a3c6
to
a5d4eee
Compare
Hi @tbg ! Wrote this up based on my understanding of your writings in cockroachdb/cockroach#71806 and some of the other issues/PRs linked from there Please let me know if anything important is missing, or if something here is just wrong I wasn't entirely sure about putting the config info here (env var / cluster settings), but there didn't seem to be a better place than directly at the point of description Similar to the above the stuff about "limitations". I think I lifted it from something linked above but happy to update if I've misunderstood these, or if another term than "limitations" is more apt Finally, there will be add'l docs work to link to the DB Console / observability bits in DOC-2643 and DOC-2495 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you Rich! The one question I have is whether we usually point folks at the env var. The env var is mostly a fallback - if the cluster is in a bad enough state so that setting cluster settings is no longer an option, the env var can bail us out. I think it would be better to, if we even mention the env var, mention it last. We don't want to steer users towards setting env vars since that is hard to keep track of.
a5d4eee
to
72e545e
Compare
Not necessarily, our docs on the various env vars are pretty ad hoc AFAICT. I mostly included it for completeness' sake
Ah ok. Did not realize - sounds like more of a troubleshooting thing then
I went ahead and removed it from the doc altogether. If it's there someone might use it just because it's there and they prefer env vars for some reason. If it really needs to be added later we can always do that |
PS thanks for the review Tobi! Forgot to say that :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I had two hyper nits, but you can definitely ignore those if you want.
|
||
Per-replica circuit breakers have the following limitations: | ||
|
||
- They cannot prevent hung requests when the node's [liveness range](#epoch-based-leases-table-data) is unavailable. For more information about troubleshooting a cluster that's having node liveness issues, see [Node liveness issues](../cluster-setup-troubleshooting.html#node-liveness-issues). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- They cannot prevent hung requests when the node's [liveness range](#epoch-based-leases-table-data) is unavailable. For more information about troubleshooting a cluster that's having node liveness issues, see [Node liveness issues](../cluster-setup-troubleshooting.html#node-liveness-issues). | |
- They cannot prevent requests from hanging when the node's [liveness range](#epoch-based-leases-table-data) is unavailable. For more information about troubleshooting a cluster that's having node liveness issues, see [Node liveness issues](../cluster-setup-troubleshooting.html#node-liveness-issues). |
|
||
##### Configuration | ||
|
||
Per-replica circuit breakers are enabled by default. Most users will not have to do any configuration to get the benefits of this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per-replica circuit breakers are enabled by default. Most users will not have to do any configuration to get the benefits of this feature. | |
Per-replica circuit breakers are enabled by default. Most users will not have to configure anything to get the benefits of this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some suggestions on how the changes in cockroachdb/cockroach#75809 can be quickly covered here. (Also one wording nit/question.)
I felt this was the best way to address DOC-2643. But @rmloveland if you feel this is out of scope of the Architecture docs, just let me know and I'll try to come up with another solution.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @rmloveland, and @taroface)
v22.1/architecture/replication-layer.md
line 75 at r1 (raw file):
<span class="version-tag">New in v22.1</span>: When individual [ranges](overview.html#architecture-range) become temporarily unavailable, requests to those ranges are refused by a per-replica "circuit breaker" mechanism instead of hanging indefinitely. From a user's perspective, this means that if a [SQL query](sql-layer.html) is going to ultimately fail due to accessing a temporarily unavailable range, a [replica](overview.html#architecture-replica) in that range will trip its circuit breaker (after 60 seconds [by default](#per-replica-circuit-breaker-timeout)) and bubble a `ReplicaUnavailableError` error back up through the system to inform the user why their query did not succeed. These (hopefully transient) errors are also signalled as events in the DB Console's [Replication Dashboard](../ui-replication-dashboard.html). Meanwhile, CockroachDB continues asynchronously probing the range's availability. If the replica becomes available again, the breaker is reset so that it can go back to serving requests normally.
Suggestion:
These (hopefully transient) errors are also signalled as events in the DB Console's [**Replication** dashboard](../ui-replication-dashboard.html) and as "circuit breaker errors" in its [**Problem Ranges** and **Range Status** pages](../ui-debug-pages.html).
v22.1/architecture/replication-layer.md
line 75 at r1 (raw file):
<span class="version-tag">New in v22.1</span>: When individual [ranges](overview.html#architecture-range) become temporarily unavailable, requests to those ranges are refused by a per-replica "circuit breaker" mechanism instead of hanging indefinitely. From a user's perspective, this means that if a [SQL query](sql-layer.html) is going to ultimately fail due to accessing a temporarily unavailable range, a [replica](overview.html#architecture-replica) in that range will trip its circuit breaker (after 60 seconds [by default](#per-replica-circuit-breaker-timeout)) and bubble a `ReplicaUnavailableError` error back up through the system to inform the user why their query did not succeed. These (hopefully transient) errors are also signalled as events in the DB Console's [Replication Dashboard](../ui-replication-dashboard.html). Meanwhile, CockroachDB continues asynchronously probing the range's availability. If the replica becomes available again, the breaker is reset so that it can go back to serving requests normally.
Should this be "of that range"?
Code quote:
in that range
v22.1/architecture/replication-layer.md
line 79 at r1 (raw file):
This feature is designed to increase the availability of your CockroachDB clusters by making them more robust to transient errors. For more information about how to view per-replica circuit breaker events happening on your cluster in the [DB Console](../ui-overview.html), see the [Replication Dashboard](../ui-replication-dashboard.html) documentation.
Suggestion:
For more information about per-replica circuit breaker events happening on your cluster in the [DB Console](../ui-overview.html):
- Open the [**Replication** dashboard](../ui-replication-dashboard.html).
- Open the [**Advanced Debug** page](../ui-debug-pages.html). The **Problem Ranges** page lists the range replicas whose circuit breakers were tripped. The **Range Status** page displays the circuit breaker error message for a given range.
Fixes DOC-2584 Fixes DOC-2643 (Related: issue 2495 in the DOC project) Summary of changes: - Add a new section 'Per-replica circuit breakers' to the 'Replication Layer' page in the architecture docs, which has - Overview of how the feature works at a (very) high level - Something about how to configure it (and a note that most shouldn't need to) - Notes on a couple limitations of its operation - Links to the Replication Dashboard and Advanced Debug pages, which have the circuit breaker observability stuff
72e545e
to
b3627b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Eric! Love a good hyper nit - took 'em both!
Thanks Ryan! I took basically everything with a few minor tweaks. I will also update the PR description and commit message to mention DOC-2643 so that gets closed out when this merges.
re: whether out of scope, normally we don't really link too often to more "userish" things from architecture docs but it would arguably make this info way less useful if we were to fail to say that we have some way to observe these breakers, so let's go with it!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling and @taroface)
v22.1/architecture/replication-layer.md
line 75 at r1 (raw file):
Previously, taroface (Ryan Kuo) wrote…
Should this be "of that range"?
I think of the range as the overall "thing", and each replica is "in" the range, in that the range is "made up of" its various replicas. That said I think "of that range" probably also works. I'll probably leave it as is for now since Tobi looked at it already
v22.1/architecture/replication-layer.md
line 75 at r1 (raw file):
<span class="version-tag">New in v22.1</span>: When individual [ranges](overview.html#architecture-range) become temporarily unavailable, requests to those ranges are refused by a per-replica "circuit breaker" mechanism instead of hanging indefinitely. From a user's perspective, this means that if a [SQL query](sql-layer.html) is going to ultimately fail due to accessing a temporarily unavailable range, a [replica](overview.html#architecture-replica) in that range will trip its circuit breaker (after 60 seconds [by default](#per-replica-circuit-breaker-timeout)) and bubble a `ReplicaUnavailableError` error back up through the system to inform the user why their query did not succeed. These (hopefully transient) errors are also signalled as events in the DB Console's [Replication Dashboard](../ui-replication-dashboard.html). Meanwhile, CockroachDB continues asynchronously probing the range's availability. If the replica becomes available again, the breaker is reset so that it can go back to serving requests normally.
NOICE - updated
v22.1/architecture/replication-layer.md
line 79 at r1 (raw file):
This feature is designed to increase the availability of your CockroachDB clusters by making them more robust to transient errors. For more information about how to view per-replica circuit breaker events happening on your cluster in the [DB Console](../ui-overview.html), see the [Replication Dashboard](../ui-replication-dashboard.html) documentation.
Good info - added it (edited very slightly), thanks!
v22.1/architecture/replication-layer.md
line 85 at r1 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
Per-replica circuit breakers are enabled by default. Most users will not have to configure anything to get the benefits of this feature.
Better, thanks!
v22.1/architecture/replication-layer.md
line 97 at r1 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
- They cannot prevent requests from hanging when the node's [liveness range](#epoch-based-leases-table-data) is unavailable. For more information about troubleshooting a cluster that's having node liveness issues, see [Node liveness issues](../cluster-setup-troubleshooting.html#node-liveness-issues).
Better - thank you
Add docs on replica circuit breakers
Fixes DOC-2584
Fixes DOC-2643
(Related: issue 2495 in the DOC project)
Summary of changes:
Add a new section 'Per-replica circuit breakers' to the 'Replication
Layer' page in the architecture docs, which has
Overview of how the feature works at a (very) high level
Something about how to configure it (and a note that most shouldn't
need to)
Notes on a couple limitations of its operation
Links to the Replication Dashboard and Advanced Debug pages, which
have the circuit breaker observability stuff