diff --git a/docs/RFCS/20190619_constraint_conformance_report.md b/docs/RFCS/20190619_constraint_conformance_report.md new file mode 100644 index 000000000000..8a9f663e4af8 --- /dev/null +++ b/docs/RFCS/20190619_constraint_conformance_report.md @@ -0,0 +1,516 @@ +- Feature Name: Insights into Constraint Conformance +- Status: completed +- Start Date: 2019-06-19 +- Authors: Andrei Matei +- RFC PR: [#38309](https://github.com/cockroachdb/cockroach/issues/14113) +- Cockroach Issue: [#14113](https://github.com/cockroachdb/cockroach/issues/14113) + [#19644](https://github.com/cockroachdb/cockroach/issues/19644) + [#31597](https://github.com/cockroachdb/cockroach/issues/31597) + [#26757](https://github.com/cockroachdb/cockroach/issues/26757) + [#31551](https://github.com/cockroachdb/cockroach/issues/31551) + +# Summary + +The set of features described here aim to provide admins with visibility into +aspects of replication and replica placement. In particular admins will be able +to access a report that details which replication constraints are violated. + + +# Motivation + +As of this writing, CRDB administrators have poor insight into data placement in +relation to the constraints they have set and in relation with other +considerations - specifically replication factor and replica diversity. This can +lead to real problems: it’s unclear when a whole locality can be turned down +without losing quorum for any ranges. +As a result of this work, admins will also be able to script actions to be taken +when partitioning, or another zone config change, has been “fully applied”. + + +# Guide-level explanation + +## Background + +Reminder for a few CRDB concepts: + +**Replication zone configs and constraints:** key spans can be grouped into +*“replication zones” and properties can be set for each zone through a “zone +*config”. Among others, a zone config specifies the replication factor for the +*data in the zone (i.e. the replication factor of all the ranges in the zone), a +*list of constraints, and a list of lease preference policies. +**Replication Constraints:** used to restrict the set of nodes/stores that can +*hold a replica for the respective ranges. Constraints can say that only nodes +*marked with certain localities or attributes are to be used or that certain +*nodes/stores have to be avoided. Constraints can also be specified at the level +*of a replica, not only a range; for example, constraints can be set to require +*that some ranges have one replica in locality A, one in B and one in C. +**Lease preferences:** Lease preferences are expressed just like constraints, except +they only affect the choice of a leaseholder. For example, one can say that the +leaseholder needs to always be in a specific locality. Unlike constraints, not +being able to satisfy a lease preference doesn’t stop a lease acquisition. +**Replica diversity:** Given a set of constraints (or no constraints), CRDB +tries to split replicas evenly between localities/sublocalities that match the +constraint. The implementation computes a "diversity score" between a pair of +replicas as the inverse of the length of the common prefix of localities of the +two respective stores. It then computes a range's diversity score as the +average diversity score across all replica pairs. The allocator considers a +range under-diversified if there is another store that could replace an +existing replica and result in a higher range diversity score. + +## Constraint conformance + +Currently the only insight an administrator has into the health of replication +is via the Admin UI which displays a counter for under-replicated ranges and one +for unavailable ranges, and timeseries for over-replicated/under-replicated/unavailable. +There’s nothing about constraints, lease preferences or diversity. Also there’s +no information on over-replicated ranges - although hopefully those will go away +once we get atomic group membership changes. Also, the way in which these +counters are computed is defective: the counters are incoherent and, for +example, if all the nodes holding a partition go away, the ranges in that +partition are not counted (see +https://github.com/cockroachdb/cockroach/issues/19644). + +Besides the general need for an administrator to know that everything is +copacetic, there are some specific cases where the lack of information is +particularly unfortunate: + + +1. Testing fault tolerance: A scenario people ran into was testing CRDB’s +resiliency to networking failures. A cluster is configured with 3 Availability +Zones (AZs) and the data is constrained to require a replica in every AZ (or +even without constraints, the implicit diversity requirement would drive towards +the same effect). One would think that an AZ can go away at any time and the +cluster should survive. But that’s only colloquially true. After an AZ goes away +and subsequently recovers, another AZ cannot disappear right away without +availability loss; the 2nd AZ has to stay up until constraint conformance is +re-established. Otherwise, the 2nd AZ likely has 2 replicas for the ranges that +were migrated away during the first outage. +As things stand, it’s entirely unclear how one is supposed to know when it’s +safe to bring down the 2nd AZ. (Of course, here we’re talking about disaster +testing, not the use of node decomissioning.) +1. Partitioning a table and running performance tests that assume the partioning +is in place. For example, say I want to partition my data between Romania and +Bulgaria and test that my Bulgarian clients indeed start getting local +latencies. Well, partitioning is obviously not instantaneous and so it’s +entirely unclear how long I’m supposed to wait until I can conduct my test. +Same with lease preferences. + +## Proposal + +We’ll be introducing three ways for admins to observe different aspects of +replication state: a cluster-wide constraint conformance report visible in the +AdminUI, new jobs for tracking the progress of constraint changes, and two +crdb_internal virtual tables that can be queried for similar information to the +report - one table would list zone-level constraint violations, another would +list the violations on a more granular per-object basis. +For providing some insights into the diversity of ranges in different zone +configs, there's also going to be a crdb_internal virtual table that lists the +localities whose unavailability would result in range unavailability (e.g. "if +locality US goes away, half of the ranges in the default zone would lose +quorum"). + + +### crdb_internal.replication_report(cached bool (default true)) + +We'll introduce the `crdb_internal.replication_report()` function, which +returns a report about constraint violations and critical localities as a +(single) JSON record. +``` +{ + // The time when this report was generated. + generated_at timestamp, + // All the cluster's zones are listed. + zones [ + { + // zone_name and config_sql correspond to what `show zone configuration` + // returns. + zone_name string, + config_sql string, + + // Each constraint that is violated is present here. + violations [ + { + // enum: under replication, over replication, replica placement, + // lease preference + violation_type string, + // spec is the part of the zone config that is violated. + constraint string, + // A human readable message describing the violation. + // For example, for a violation_type="under replication", the message + // might be "64MB/1TB of data only (1/20000 ranges) only have 2 replica(s). + // 128MB/1TB of data only (2/20000 ranges) only have 1 replica(s)." + message string, + // the number of ranges for which this locality is critical. + ranges int, + // the sum of the size of all critical ranges (in MB/s). + data_size_mb int, + // the average rate (in ranges/s and MiB/s) by which this quantity + // changed since some previous time the report was generated. A negative rate + // means things are improving (i.e. amount of data in violation is going down). + // NULL if no previous data is available to compute the velocity. + change_rate_ranges_per_sec int, + change_rate_mb_per_sec int, + } + ] + + // The localities that are critical for the current zone. + critical_localities [ + { + locality string, + // the number of ranges for which this locality is critical. + ranges int, + // the sum of the size of all critical ranges (in MB/s). + data_size_mb int, + change_rate_ranges_per_sec int, + change_rate_mb_per_sec int, + } + ] + } + ] +} +``` + +The report is generated periodically (every minute) by a process described +below and stored in the database. Through the optional function argument, one +can ask for the cached report to be returned, or for a fresh report to be +generated. + + +Notes about violations: +1. All zones are present in the `zones` array, but only violated constrains are + present in the `violations` array. +2. Per-replica replication constraints (i.e. the constraints of the form `2 + replicas: region:us, 1 replica: region:eu`) are split into their + constituents for the purposes of this report. In this example, the report + will list violations separately for the 2 us replicas vs the 1 eu replica. + Also, the per-replica constraints are considered "minimums", not exact + numbers. I.e. if a range has 2 replicas in eu, it is not in violation of the + eu constraint. Of course, it might be in violation of other constraints. +3. Per-replica constraint conformance doesn't care about dead nodes. For + example, if a constraint specifies that 2 replicas must be placed in the US + and, according to a range descriptor, the range has two replicas in the US + but one of the nodes is unresponsive/dead/decomissioned, the constraint will + be considered satisfied. Dead nodes will promptly trigger under-replication + violations, though (see next bullet). + The idea here is that, if a node dies, we'd rather not instantly generate a + ton of violations for the purpose of this report. +4. The notion of a node being alive or dead will be more nuanced in this + reporting code than it is in the allocator. The allocator only concerns + itself with nodes that have been dead for 5 minutes (`cluster setting + server.time_until_store_dead`) - that’s when it starts moving replicas away. + Nodes that died more recently are as good as live ones. But not so for this + report; we can’t claim that everything is fine for 5 minutes after a node’s + death. For the purposes of reporting under-replication, constraint + conformance (and implicitly lease preference conformance), a node will be + considered dead a few seconds after it failed to ping its liveness record. + Replicas on dead nodes are discarded; expired leases by themselves don't + trigger any violation (as specified elsewhere). + The message describing the violation due to dead nodes will have information + about what ranges are being rebalanced by the allocator and what ranges are + not yet receiving that treatment. +5. Violations of inherited constraints are reported only at the level of the + zone hierarchy where the constraint is introduced; they are not reported at + the lower levels. +6. The `message` field will generally contain the info from the `ranges` and + `data_size_mb` fields, but in an unstructured way. In the example message in + the JSON ("64MB/1TB of data only (1/20000 ranges) only have 2 replica(s). + 128MB/1TB of data only (2/20000 ranges) only have 1 replica(s)."), the + message has more detailed information than the numerical fields; the + numberical fields count all ranges in violation of the respective constraint + without any further bucketing. +7. Since this report is possibly stale, the zone information presented can + differ from what `show all zone configurations` returns - it might not + include newly-created zones or newly applied configurations. That's a + feature as it allows the consumer to reason about whether a recent change to + zone configs had been picked up or not. + + +#### Critical localities + +The `critical_localities` field exposes information about what localities would +cause data unavailability if the respective localities were to become +unavailable - a measure related to replica diversity. +For every zone, we'll count how many ranges would lose quorum by the dissapearance +of any single locality. For the purposes of this report, we consider both more +narrow and more wide localities. E.g. if a node has +`--locality=country=us,region=east`, then we report for both localities `c=us` +and `c=us,r=east`. For the purposes of this report, we also consider a node id +as the last component of the node's locality. So, if, say, node 5 has +`--locality=country=us,region=east`, then for its ranges we actually consider +all of `c=us`, `c=us/r=east` and `c=us,r=east,n=5` as localities to report for. +If a locality is not critical for a range, it does not appear under that zone. +If a locality is critical for some data, that data also counts for the +criticality of any wider locality (e.g. if `us/east` is critical for 1MB of +data, `us` will be critical for (at least) that 1MB). + +This report strictly covers critical localities; it does not answer all +questions one might have about the data's risk. For one, there's no way to +answer questions about the risk stemming from combinations of localities (e.g. +if I'd like my data to survive the loss of one region plus a node in another +region, or of (any) two regions, this report does not help. + +This report is not directly related to the notion of diversity considered by +the allocator - which tries to maximize diversity. We don't report on whether +the allocator is expected to take any action. + +### Data collection + +A background process will compute the report once per minute or so (see the +[Detailed design section](#detailed-design)). The "velocity information" in the +report is generated by looking at the raw numbers in the previous version of +the report and computing the delta. + +### AdminUI conformance report + +We’ll add a page in the AdminUI showing the contents of the report. I'm +imagining that in the future we can allow one to drill down into the different +zones by expanding a zone into the databases/tables/indexes/partitions +contained in the zone. This is future work though. + +The already existing range counters on the Cluster Overview page - +under-replicated/over-replicated/unavailable ranges - will change to be backed +by this implementation. This will be beneficial; the current implementation, +based on different nodes reporting their counts at different time, is +inconsistent and blind to failures of individual nodes to report. We'll also +show an asterisk linking to the conformance report if there are any constraint +violations. +The timeseries for the under-replicated/over-replicated/unavailable ranges will +also be backed by the process generating this report. The range 1 leaseholder +node is the one computing these values and so it will be one writing them to +the timeseries, but we need to do it in such a way as to not double count when +the lease changes hands. I'm not sure how to assure that exactly. Maybe the +node writing the counts will itself write 0 for all the other nodes. + +## New jobs + +We’re also going to have some data-moving operations roughly represented as +jobs: repartitioning and altering zone config properties. The idea here is to be +able to track the progress of an operation that was explicitly triggered through +a SQL command as opposed to observing the state of the cluster as a whole. +For example, when a table is repartitioned, we’ll look at how much data of that +table is now placed incorrectly (i.e. it violates a constraint) and we’ll +consider the job complete once all the table’s data is conformant with all the +constraints. Note that the initial partitioning of a table does not create any +data movement (as all the partitions inherit the same zone config) and so we +won’t create a job for it. + +Altering a zone config (for a table, partition or index) will similarly cause a +job to be created. The job will be considered completed once all the ranges +corresponding to the table/index/partition are in conformance with all the +constraints. That is, the first time when all ranges in the affected zone are +found to be in compliance, we’ll declare success. + +Note that the association of range movement with a job, and thus the computation +of the job’s progress, is fairly loose. The exact replication effects being +enacted by a particular partition or zone change will be immaterial to the +progress status of the respective job; the only thing will be considered for +computing the progress is the conformance of the ranges affected by the change +with all the constraints (not just the constraints being modified) - including +pre-existing constraints for the respective zone and constraints inherited from +parent zones. Doing something more exact seems hard. However this can lead to +funny effects like a progress going backwards if the conformance worsens (i.e. +number of con-conformant ranges increases for whatever reason) while a job is +“running”. + +The updating of the progress report of these jobs will be done through the +periodic background process that also updates the cluster conformance report. +The completion %age is based on how many ranges were found to be non-conformant +with the change the job is tracking when the job is created. + +The jobs are not cancelable by users. However a job is considered completed if +the particular partitioning or zone config alter it's tracking is superseded by +another operation (i.e. if the partitioning is changed again in case of +partitioning jobs or if the same zone config is changed again). Removing a +table partitioning or a zone config similarly marks all ongoing jobs tracking +the respective object as complete and creates a new job (tracking the table or +the parent zone config, respectively). + + +## Detailed design + +The data powering the virtual tables and the jobs progress will be produced by +a process running on the leaseholder for range 1. Every minute, this process +will scan all of the meta2 range descriptors together with all the zone configs +(using a consistent scan slightly in the past), as well as collect lease +information (see below). For each descriptor, it’ll use logic factored out of +the allocator for deciding what constraints the range is violating (if any). +For each constraint, it’ll aggregate the (sum of sizes of) ranges violating +that constraint. Same for lease preference and replication factor. Ranges with +no active lease are considered to be conformant with any lease preference. +There will also be a way to generate the data on demand. + +Critical localities are determined as follows: for each range we consider every +locality that contains a replica (so, every prefix of the --locality for all +the stores involved plus the node id). If that locality contains half or more +of the replicas, it is considered critical. A count of critical ranges / +locality is maintained. + +Since we're scanning meta2, the respective node has an opportunity to update +its range descriptor cache. Which suggests that perhaps all nodes should do +this. + +The resulting report will be saved in storage under a regular (versioned) key. +We'll store it in proto form (as opposed to the JSON in which it is returned by +the SQL function). The velocity information is not stored; it is computed on +demand by reading past versions of the report, finding the one more than a +minute away (so, not necessarily the most recent if that one is very recent) +and computing the delta. + +### Collecting lease and size information + +The current leaseholder is not reflected in the range descriptor (i.e. in +meta2). So, in order to report on leaseholder preference conformance, the node +generating the report needs to collect this information in another way. +Similarly for range size info. +We'll add a new RPC - `Internal.RangesInfo` - asking a node to return all the leases for its replicas +(including info on ranges with no lease). The node generating the report will +ask all other nodes for information on the leases and sizes of all its replicas. +The report generator will join that information with meta2. +This will be a streaming RPC returning information in range key order, so that +the aggregator node doesn't have to hold too much range information in memory +at a time - instead the aggregator can do paginated (i.e. limit) scans over the +meta2 index and stream info from a merger of all the `RangeInfo` responses. + +Since lease information is gathered distinctly from range information, the two +might not be consistent: ranges could have appeared and dissapeared in between. +To keep it simple, we'll consider the meta2 the source of truth. Among the +information we get for a range, the most recent lease is considered. For ranges +for which the info we get from all the replicas disagrees with meta2, we'll +consider the range to be without a lease and have size 0. + +The implementation of the view `crdb_internal.ranges` will also change to take +advantage of this new RPC. Currently, it's a view on top of the virtual table +`crdb_internal.ranges_no_leases` executing an +`crdb_internal.lease_holder(start_key)` (i.e. a `LeaseInfo` request) for every +range. That's too expensive. Once we moved to the new mechanism, we can also +deprecate `crdb_internal.ranges_no_leases`. + +As an incidental benefit of collecting lease information this way, the node +collecting all this information can update its leaseholder cache. Which +suggests that perhaps all nodes should exchange this info with each other. + +When generating the report, RPCs to all the nodes will be issued in parallel. +Timeouts will be used. The fact that info on each range is reported by all its +replicas makes us able to tolerate some RPC failures. + +Service definition: + +``` +service Internal { + rpc RangesInfo(empty) returns (RangesInfoResponse) {} +} + +message RangesInfoResponse { + repeated message ranges { + range_descriptor; + lease; + float size_mb; + } +} +``` + +### Notes + +The notion of a node being alive or dead will be more nuanced in this reporting +code than it is in the allocator. The allocator only concerns itself with nodes +that have been dead for 5 minutes (`cluster setting server.time_until_store_dead`) +- that’s when it starts moving replicas away. Nodes that died more recently are +as good as live ones. But not so for this report; we can’t claim that everything +is fine for 5 minutes after a node’s death. For the purposes of reporting +under-replication, constraint conformance (and implicitly lease preference +conformance), a node will be considered dead a few seconds after it failed to +ping its liveness record. Replicas on dead nodes are discarded; expired leases +by themselves don't trigger any violation (as specified elsewhere). + +When sub-zones are involved and constraints are inherited from parent to child, +violations of inherited constraints will be counted towards the parent zone, not +the child zone. + +## Rationale and alternatives + +For the presentation of the report, a natural alternative is to present it +through system table(s) or virtual table(s). That would me a more usual way of +presenting it to SQL clients than a function returning a JSON record, and it +would probably allow easier filtering. There's also not much precedent in CRDB +for returning JSON from system SQL interfaces. But unfortunately a tabular +interface does not seem to be very convenient for this case: how do you expose +the timestamp when the report is taken? How do you expose the versions of the +zone configs used? Even if you find a way to expose them, how does one query +across them consistently? There are obviously different options available, but +none of them look very clean. So I think the time is ripe to create this JSON +precedent. The computation of the "velocity" fields is also simpler with the +report being one big struct. Otherwise it wasn't entirely clear how one would +store and access the historical information required. +Note that CRDB already has some JSON processing capabilities built-in, so it is +feasible to retrieve only parts of the report (between the +`jsonb_array_elements` and the JSON path operators). + +Another way to get this report would be akin to what we do now for counting +unavailable ranges: have each node report the counts for the ranges that it's +resposible for (i.e. it is the leaseholder or, if there's no lease, it's the +smallest node among the replicas). That would save the need to read meta2; each +node can use the in-memory replicas structure. The downside is that dealing with +failures of a node to report is tricky and also, since each node would report at +a different time, the view across sub-reports would be inconsistent possibly +leading to non-sensical counts. + +An alternative for collecting the lease information through an RPC is to have +nodes gossip their info. This would have the advantage that all nodes can keep +their cache up to date. But the quantities of data involved might be too +large. Or yet another option is to have the all the nodes periodically write +all their lease info into the database, and have the report generator node read +it from there instead of requesting it through RPCs. This sounds better for +scalability (to higher number of nodes) but also sounds more difficult to +implement. We'll start with the RPCs and keep this for later. + +For reporting on diversity, instead of (or perhaps in addition to) the critical +localities proposed above, another idea is to report, on a per-"locality-level" +basis, the amount of data that would lose quorum if different numbers of +instances of that level were to become unavailable. For example, in a two-level +hierarchy of localities (say, country+region) plus an implicit third level +which is the node, we'd report something like: if one country is lost, this +much data is lost. If 2 countries are lost - this much. Similar for one region, +two regions, ... n regions. Similar for up to n nodes. +Comparing with the critical regions proposal as alternatives (although they +don't necessarily need to be either or), the advantages would be: +- you get some information on the criticality of combinations of localities. +Namely, you get info an combinations of values on the same level in the +hierarchy. You don't, however get info on combinations across levels: e.g. you +still can't tell if you can survive the failure of any one dc + any other one +node. + +And the disadvantages: +- no locality names are presented. You can see that there is at least one + country that's critical, or perhaps (at least one) a pair of countries, but + you can't tell which one. +- with locality hierarchies, it's unclear if lower level should be counted + across higher levels. For example, if there's region and zone, and in the us + region there are no critical zones, but in the eu region there are some, + should we just say "there are critical zones", or should we say "there are + crtical zones in eu"? +- it's also unclear how exactly to deal with jagged localities - where the keys + of the levels don't match between nodes. And the even more pathological case + where even the same locality key appears on different levels (one node has + `country=us, dc=dc1`, another one has just `dc=dc2`). One proposal is to not + report anything in such cases, but that has downsides - adding a node with no + `--locality` would make the whole report disappear. + +## Out of scope + +1. More user actions could create the types of rebalancing-related jobs that + we've discussed here: for example adding nodes in a new locality which would + trigger rebalancing for the purposes of diversity. That's left for the future. +2. More granular reporting at the level of a database/table/index/partition + instead of simply at the zone level. +3. Timeseries of rebalancing progress per constraint violation. This RFC + proposes reporting one number per constraint violation - the average rate of + improvement over the past minute, but no way to get historical information + about these rates. Part of the difficulty in actually recording timeseries + is that it's unclear how our timeseries platform would work for a dynamic + set of timeseries (i.e. constraint violations come and go). + +## Unresolved questions + +1. Should the SQL statements that now create jobs also become synchronous (i.e. + only return to the client once the corresponding job is done)? A la schema + changes. If not, should they return the id of the job they've created?