Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs on replica circuit breakers #13676

Merged
merged 1 commit into from
May 4, 2022

Conversation

rmloveland
Copy link
Contributor

@rmloveland rmloveland commented Apr 21, 2022

Add docs on replica circuit breakers

Fixes DOC-2584
Fixes DOC-2643

(Related: issue 2495 in the DOC project)

Summary of changes:

  • Add a new section 'Per-replica circuit breakers' to the 'Replication
    Layer' page in the architecture docs, which has

    • Overview of how the feature works at a (very) high level

    • Something about how to configure it (and a note that most shouldn't
      need to)

    • Notes on a couple limitations of its operation

    • Links to the Replication Dashboard and Advanced Debug pages, which
      have the circuit breaker observability stuff

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@github-actions
Copy link

Files changed:

@netlify
Copy link

netlify bot commented Apr 21, 2022

Netlify Preview

Name Link
🔨 Latest commit b3627b1
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/6272cb313a27b600081bb116
😎 Deploy Preview https://deploy-preview-13676--cockroachdb-docs.netlify.app/docs/v22.1/architecture/replication-layer
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@rmloveland rmloveland marked this pull request as draft April 21, 2022 20:56
@rmloveland rmloveland force-pushed the 20220420-replica-circuit-breakers branch 3 times, most recently from 7e7a3c6 to a5d4eee Compare April 26, 2022 20:17
@rmloveland rmloveland marked this pull request as ready for review April 26, 2022 20:19
@rmloveland rmloveland requested a review from tbg April 26, 2022 20:27
@rmloveland
Copy link
Contributor Author

Hi @tbg !

Wrote this up based on my understanding of your writings in cockroachdb/cockroach#71806 and some of the other issues/PRs linked from there

Please let me know if anything important is missing, or if something here is just wrong

I wasn't entirely sure about putting the config info here (env var / cluster settings), but there didn't seem to be a better place than directly at the point of description

Similar to the above the stuff about "limitations". I think I lifted it from something linked above but happy to update if I've misunderstood these, or if another term than "limitations" is more apt

Finally, there will be add'l docs work to link to the DB Console / observability bits in DOC-2643 and DOC-2495

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you Rich! The one question I have is whether we usually point folks at the env var. The env var is mostly a fallback - if the cluster is in a bad enough state so that setting cluster settings is no longer an option, the env var can bail us out. I think it would be better to, if we even mention the env var, mention it last. We don't want to steer users towards setting env vars since that is hard to keep track of.

@rmloveland rmloveland force-pushed the 20220420-replica-circuit-breakers branch from a5d4eee to 72e545e Compare May 3, 2022 14:30
@rmloveland
Copy link
Contributor Author

The one question I have is whether we usually point folks at the env var

Not necessarily, our docs on the various env vars are pretty ad hoc AFAICT. I mostly included it for completeness' sake

The env var is mostly a fallback - if the cluster is in a bad enough state so that setting cluster settings is no longer an option, the env var can bail us out.

Ah ok. Did not realize - sounds like more of a troubleshooting thing then

I think it would be better to, if we even mention the env var, mention it last. We don't want to steer users towards setting env vars since that is hard to keep track of.

I went ahead and removed it from the doc altogether. If it's there someone might use it just because it's there and they prefer env vars for some reason. If it really needs to be added later we can always do that

@rmloveland rmloveland requested a review from ericharmeling May 3, 2022 14:32
@rmloveland
Copy link
Contributor Author

PS thanks for the review Tobi! Forgot to say that :-)

Copy link
Contributor

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I had two hyper nits, but you can definitely ignore those if you want.


Per-replica circuit breakers have the following limitations:

- They cannot prevent hung requests when the node's [liveness range](#epoch-based-leases-table-data) is unavailable. For more information about troubleshooting a cluster that's having node liveness issues, see [Node liveness issues](../cluster-setup-troubleshooting.html#node-liveness-issues).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- They cannot prevent hung requests when the node's [liveness range](#epoch-based-leases-table-data) is unavailable. For more information about troubleshooting a cluster that's having node liveness issues, see [Node liveness issues](../cluster-setup-troubleshooting.html#node-liveness-issues).
- They cannot prevent requests from hanging when the node's [liveness range](#epoch-based-leases-table-data) is unavailable. For more information about troubleshooting a cluster that's having node liveness issues, see [Node liveness issues](../cluster-setup-troubleshooting.html#node-liveness-issues).


##### Configuration

Per-replica circuit breakers are enabled by default. Most users will not have to do any configuration to get the benefits of this feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Per-replica circuit breakers are enabled by default. Most users will not have to do any configuration to get the benefits of this feature.
Per-replica circuit breakers are enabled by default. Most users will not have to configure anything to get the benefits of this feature.

Copy link
Contributor

@taroface taroface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some suggestions on how the changes in cockroachdb/cockroach#75809 can be quickly covered here. (Also one wording nit/question.)

I felt this was the best way to address DOC-2643. But @rmloveland if you feel this is out of scope of the Architecture docs, just let me know and I'll try to come up with another solution.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @rmloveland, and @taroface)


v22.1/architecture/replication-layer.md line 75 at r1 (raw file):

<span class="version-tag">New in v22.1</span>: When individual [ranges](overview.html#architecture-range) become temporarily unavailable, requests to those ranges are refused by a per-replica "circuit breaker" mechanism instead of hanging indefinitely. 

From a user's perspective, this means that if a [SQL query](sql-layer.html) is going to ultimately fail due to accessing a temporarily unavailable range, a [replica](overview.html#architecture-replica) in that range will trip its circuit breaker (after 60 seconds [by default](#per-replica-circuit-breaker-timeout)) and bubble a `ReplicaUnavailableError` error back up through the system to inform the user why their query did not succeed. These (hopefully transient) errors are also signalled as events in the DB Console's [Replication Dashboard](../ui-replication-dashboard.html). Meanwhile, CockroachDB continues asynchronously probing the range's availability. If the replica becomes available again, the breaker is reset so that it can go back to serving requests normally.

Suggestion:

These (hopefully transient) errors are also signalled as events in the DB Console's [**Replication** dashboard](../ui-replication-dashboard.html) and as "circuit breaker errors" in its [**Problem Ranges** and **Range Status** pages](../ui-debug-pages.html).

v22.1/architecture/replication-layer.md line 75 at r1 (raw file):

<span class="version-tag">New in v22.1</span>: When individual [ranges](overview.html#architecture-range) become temporarily unavailable, requests to those ranges are refused by a per-replica "circuit breaker" mechanism instead of hanging indefinitely. 

From a user's perspective, this means that if a [SQL query](sql-layer.html) is going to ultimately fail due to accessing a temporarily unavailable range, a [replica](overview.html#architecture-replica) in that range will trip its circuit breaker (after 60 seconds [by default](#per-replica-circuit-breaker-timeout)) and bubble a `ReplicaUnavailableError` error back up through the system to inform the user why their query did not succeed. These (hopefully transient) errors are also signalled as events in the DB Console's [Replication Dashboard](../ui-replication-dashboard.html). Meanwhile, CockroachDB continues asynchronously probing the range's availability. If the replica becomes available again, the breaker is reset so that it can go back to serving requests normally.

Should this be "of that range"?

Code quote:

in that range

v22.1/architecture/replication-layer.md line 79 at r1 (raw file):

This feature is designed to increase the availability of your CockroachDB clusters by making them more robust to transient errors.

For more information about how to view per-replica circuit breaker events happening on your cluster in the [DB Console](../ui-overview.html), see the [Replication Dashboard](../ui-replication-dashboard.html) documentation.

Suggestion:

For more information about per-replica circuit breaker events happening on your cluster in the [DB Console](../ui-overview.html): 

- Open the [**Replication** dashboard](../ui-replication-dashboard.html).
- Open the [**Advanced Debug** page](../ui-debug-pages.html). The **Problem Ranges** page lists the range replicas whose circuit breakers were tripped. The **Range Status** page displays the circuit breaker error message for a given range.

Fixes DOC-2584
Fixes DOC-2643

(Related: issue 2495 in the DOC project)

Summary of changes:

- Add a new section 'Per-replica circuit breakers' to the 'Replication
  Layer' page in the architecture docs, which has

  - Overview of how the feature works at a (very) high level

  - Something about how to configure it (and a note that most shouldn't
    need to)

  - Notes on a couple limitations of its operation

  - Links to the Replication Dashboard and Advanced Debug pages, which
    have the circuit breaker observability stuff
@rmloveland rmloveland force-pushed the 20220420-replica-circuit-breakers branch from 72e545e to b3627b1 Compare May 4, 2022 18:51
Copy link
Contributor Author

@rmloveland rmloveland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Eric! Love a good hyper nit - took 'em both!

Thanks Ryan! I took basically everything with a few minor tweaks. I will also update the PR description and commit message to mention DOC-2643 so that gets closed out when this merges.

re: whether out of scope, normally we don't really link too often to more "userish" things from architecture docs but it would arguably make this info way less useful if we were to fail to say that we have some way to observe these breakers, so let's go with it!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling and @taroface)


v22.1/architecture/replication-layer.md line 75 at r1 (raw file):

Previously, taroface (Ryan Kuo) wrote…

Should this be "of that range"?

I think of the range as the overall "thing", and each replica is "in" the range, in that the range is "made up of" its various replicas. That said I think "of that range" probably also works. I'll probably leave it as is for now since Tobi looked at it already


v22.1/architecture/replication-layer.md line 75 at r1 (raw file):

<span class="version-tag">New in v22.1</span>: When individual [ranges](overview.html#architecture-range) become temporarily unavailable, requests to those ranges are refused by a per-replica "circuit breaker" mechanism instead of hanging indefinitely. 

From a user's perspective, this means that if a [SQL query](sql-layer.html) is going to ultimately fail due to accessing a temporarily unavailable range, a [replica](overview.html#architecture-replica) in that range will trip its circuit breaker (after 60 seconds [by default](#per-replica-circuit-breaker-timeout)) and bubble a `ReplicaUnavailableError` error back up through the system to inform the user why their query did not succeed. These (hopefully transient) errors are also signalled as events in the DB Console's [Replication Dashboard](../ui-replication-dashboard.html). Meanwhile, CockroachDB continues asynchronously probing the range's availability. If the replica becomes available again, the breaker is reset so that it can go back to serving requests normally.

NOICE - updated


v22.1/architecture/replication-layer.md line 79 at r1 (raw file):

This feature is designed to increase the availability of your CockroachDB clusters by making them more robust to transient errors.

For more information about how to view per-replica circuit breaker events happening on your cluster in the [DB Console](../ui-overview.html), see the [Replication Dashboard](../ui-replication-dashboard.html) documentation.

Good info - added it (edited very slightly), thanks!


v22.1/architecture/replication-layer.md line 85 at r1 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…
Per-replica circuit breakers are enabled by default. Most users will not have to configure anything to get the benefits of this feature.

Better, thanks!


v22.1/architecture/replication-layer.md line 97 at r1 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…
- They cannot prevent requests from hanging when the node's [liveness range](#epoch-based-leases-table-data) is unavailable. For more information about troubleshooting a cluster that's having node liveness issues, see [Node liveness issues](../cluster-setup-troubleshooting.html#node-liveness-issues).

Better - thank you

@rmloveland rmloveland merged commit cc7245b into master May 4, 2022
@rmloveland rmloveland deleted the 20220420-replica-circuit-breakers branch May 4, 2022 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants