Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make some of crdb_internal.ranges_no_leases available in case of a large scale outage #81216

Closed
joshimhoff opened this issue May 12, 2022 · 1 comment
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity O-sre For issues SRE opened or otherwise cares about tracking. X-stale

Comments

@joshimhoff
Copy link
Collaborator

joshimhoff commented May 12, 2022

Is your feature request related to a problem? Please describe.
crdb_internal.ranges_no_leases is very useful. It also plays nice with the new crdb_internal.probe_ranges. By joining the two, you know which ranges are unavailable / slow + what those ranges are used for + which nodes have replicas for those ranges.

crdb_internal.probe_ranges is available, unless meta2 is down, and if meta2 is down, we get a trace of the failed query to meta2 as per #81107. This is nice! More about this at #79546 (comment).

OTOH, crdb_internal.ranges_no_leases requires the availability of more than meta2, even in cases where the fields selected out only require meta2. For one, system.namspace is a hard dep, according to my reading of the code + testing.

Describe the solution you'd like
I'd like a version of crdb_internal.ranges_no_leases that is available so long as meta2 is available. Here are a few possible approaches:

  1. Introduce a new table called crdb_internal.ranges_minimal that only reads meta2.
  2. Change the availability properties of crdb_internal.ranges_no_leases, so that if only meta2-derived columns are needed (SELECT range_id, start_pretty, replicas ...), meta2 is the only range queried.
  3. Include info like start key, end key, & replicas in crdb_internal.probe_ranges.

2 seems like the nicest option, tho looking at the code, I don't immediately see how to do it. If 2 is hard, maybe 1 is good enough.

Describe alternatives you've considered
Could do nothing.

Additional context
Relevant investigation, spurred by an outage where we had to look at gorourine dumps to understand what range was down, where the down range was a system range: #79546 (comment). My motivation here is to enable a newer SRE to use crdb_internal.probe_ranges to understand proximate cause and to mitigate (try restarting / downing leaseholder), before escalating to KV L2, as that reduces time to fix in a cloud context.

CC @tbg @Santamaura

Jira issue: CRDB-15293

@joshimhoff joshimhoff added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. A-kv-observability labels May 12, 2022
@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity O-sre For issues SRE opened or otherwise cares about tracking. X-stale
Projects
None yet
Development

No branches or pull requests

2 participants