Make some of crdb_internal.ranges_no_leases
available in case of a large scale outage
#81216
Labels
A-kv-observability
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
no-issue-activity
O-sre
For issues SRE opened or otherwise cares about tracking.
X-stale
Is your feature request related to a problem? Please describe.
crdb_internal.ranges_no_leases
is very useful. It also plays nice with the newcrdb_internal.probe_ranges
. By joining the two, you know which ranges are unavailable / slow + what those ranges are used for + which nodes have replicas for those ranges.crdb_internal.probe_ranges
is available, unless meta2 is down, and if meta2 is down, we get a trace of the failed query to meta2 as per #81107. This is nice! More about this at #79546 (comment).OTOH,
crdb_internal.ranges_no_leases
requires the availability of more than meta2, even in cases where the fields selected out only require meta2. For one,system.namspace
is a hard dep, according to my reading of the code + testing.Describe the solution you'd like
I'd like a version of
crdb_internal.ranges_no_leases
that is available so long as meta2 is available. Here are a few possible approaches:crdb_internal.ranges_minimal
that only reads meta2.crdb_internal.ranges_no_leases
, so that if only meta2-derived columns are needed (SELECT range_id, start_pretty, replicas ...
), meta2 is the only range queried.crdb_internal.probe_ranges
.2 seems like the nicest option, tho looking at the code, I don't immediately see how to do it. If 2 is hard, maybe 1 is good enough.
Describe alternatives you've considered
Could do nothing.
Additional context
Relevant investigation, spurred by an outage where we had to look at gorourine dumps to understand what range was down, where the down range was a system range: #79546 (comment). My motivation here is to enable a newer SRE to use
crdb_internal.probe_ranges
to understand proximate cause and to mitigate (try restarting / downing leaseholder), before escalating to KV L2, as that reduces time to fix in a cloud context.CC @tbg @Santamaura
Jira issue: CRDB-15293
The text was updated successfully, but these errors were encountered: