kvserver: support non-voting replicas #51943

aayushshah15 · 2020-07-27T21:07:05Z

Currently, CRDB only supports voting replicas: replicas that participate in quorum. This means that a high replication factor comes with the cost of high write latencies, since more replicas need to achieve consensus on every write. This trade-off is undesirable if the use-case only cares about using these replicas to serve follower reads, and not for fault tolerance. It is especially undesirable when the inter-replica latency between these voting replicas is high (i.e. when they are spread out across multiple distant regions).

Non-voting replicas would follow the raft log (thus be able to serve follower reads), but would not participate in quorum. This means that they have, essentially, no impact on write latencies at the cost of not being able to be used for fault-tolerance. They unlock use-cases where, for instance, a quorum of voting replicas lives in North America, but we'd still like to serve low-latency follower reads in EU, Asia and Australia, without having this cripple write latencies.

This is a (rough) tracking issue for the major changes required to introduce non-voting replicas (also referred to as long-lived learners due to some implementation details). They are listed roughly in the order in which (I believe) they should be implemented. This issue will be updated as new considerations are discovered.

New Replica type for non-voting replicas (done)

The desire for this stems from the assumptions made about learners being short lived in a couple of different places (covered below). There are valid reasons for some of those assumptions to stay in place for the short lived learners, but they don’t apply to long lived learners.
Pull requests: #53715

Zone config syntax to enable creation of non-voters

Pull requests: #57184

Compatibility with range merges (done)

Currently, we don’t merge a range if it has a learner replica. In the AdminMerge transaction, if the LHS leaseholder detects learner replicas for either of the involved ranges, it just returns an error. It seems like we did it this way in order to avoid working out any potential edge cases since the learners were only supposed to be ephemeral. We’ll need to make merges work with (at least) long lived learners.
Pull requests: #56197

Interactions with raft snapshot queue and replicate queue (done)

The atomic replication change logic “manually” sends a snapshot to the short lived learner that it instantiates in the first phase of the change. The raft snapshot queue, thus, entirely ignores the learners it comes across. Once the long lived learners are added, they'll need to be able to receive snapshots from raft snapshot queue just like VOTER replicas.

Replicas are added by the atomic replication change logic in two phases, the first phase creates a short lived learner and the second promotes it to a voter. If the co-ordinating node dies before the second phase, we’re left with an orphaned learner replica. The replicate queue, thus, removes any learners it sees.

Allocator level changes

We need to teach the allocator to obey the specified constraints to place long lived learners in the correct localities. Here is a preliminary, non-exhaustive list of considerations we need to be aware of while making this change:

One heuristic we care a lot about when placing VOTER replicas across a cluster is diversity. The primary motivation for this is fault tolerance, which is not a consideration for learner replicas. We probably want to place learners primarily based on the localities of the incoming traffic (“follow the workload”).
We currently do not include traffic served via follower reads in stats for load based splitting. We'll probably want to change that as we start seeing topologies that make heavy use of learner replicas.
If all the nodes that qualify under the constraints specified for learners are ones that already have VOTER replicas on them, do we just silently omit these learners?

Allow distsender to route follower read requests to learners

DistSender currently ignores learners when routing queries. We’ll want to start including long lived learners here if the request is such that it can be served via follower-reads.
Pull requests: #58772

Roachtest that exercises & asserts use of follower reads

Currently, we lack large scale integration tests that test for “proper” use of follower reads (“are we correctly serving follower reads for a given compatible workload?”). We’ll want to write a roachtest that asserts, for a pre-determined workload, a minimum threshold of requests that need to be served via follower reads.

Quota pool prevents raft leaders from getting too far ahead of followers

This means a slow learner can slow down regular raft traffic. We’ll want to consider whether we should allow long lived learners to potentially trail further behind than other types of replicas are allowed to.

The text was updated successfully, but these errors were encountered:

blathers-crl · 2020-07-27T21:07:06Z

Hi @aayushshah15, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

aayushshah15 · 2020-07-27T21:13:46Z

cc @tbg

This is pretty rough but @andreimatei had mentioned that I should get your eyes on this in case there's potential for ill-defined interactions (or things I should double-check/be careful about) with your work on atomic replication changes.

aayushshah15 · 2020-07-28T16:59:03Z

cc @lunevalex @nvanbenschoten

As we work towards adding support for persistent learner replicas, we'd like to decouple it from the ephemeral learner state. Conceptually, learners in their current state are an _internal_ intermediate state used for atomic replication changes, whereas persistent learners are meant to be a user-visible state explicitly chosen to be placed in specific localities. There are also some practical reasons that motivated this change: 1. The replicate queue, when it discovers a learner replica, infers that it must be a relic of a failed `ChangeReplicas` run, and clears it up. 2. The raft snapshot queue doesn't send it snapshots since these learners are supposed to receive it directly from the `ChangeReplicas` txn. 3. We disallow range merges when we detect that either of the concerned ranges have a learner replica, as in our current state, this is likely because there is an ongoing atomic configuration change. In all of these scenarios it seems safer & more efficient to be able to differentiate between persistent & ephemeral learners based on the ReplicaType (as opposed to some other solution like a `isPersistent` bit on the replica descriptor) This PR performs this change with a combination of Goland-rename and then grepping for string literals and comments that talk about learners. Note that the `SnapshotRequest_LEARNER` type has not been changed in this PR. Touches cockroachdb#51943. Release note: None

Before this patch, the DistSender would only route BatchRequests to replicas of types `VOTER_FULL` and `VOTER_INCOMING`. This patch changes that to let the DistSender route requests (for instance, follower read requests) to `NON_VOTER` replicas as well. Makes progress on cockroachdb#51943 Release note: None

This PR teaches `AdminChangeReplicas` to atomically promote voters to non-voters, demote voters to non-voters or swap voters with non-voters via joint consensus. Fixes cockroachdb#58499 Informs cockroachdb#51943 Release note: None

58627: kvserver: support atomic promotions and demotions of non-voting replicas r=aayushshah15 a=aayushshah15 This PR teaches `AdminChangeReplicas` to atomically promote voters to non-voters, demote voters to non-voters or swap voters with non-voters via joint consensus. The approach followed by this PR tries to coalesce `ReplicationChanges` of types `ADD_VOTER` and `REMOVE_NON_VOTER` on a given target as promotions of non-voters into voters and likewise, `ADD_NON_VOTER` and `REMOVE_VOTER` changes for a given target as demotions of voters into non-voters. When all 4 of these operations are simultaneously passed into a `ChangeReplicasRequest`, the patch will try to execute them as one atomic swap of a voter with a non-voter. Fixes #58499 Informs #51943 Release note: None 60799: sql: include virtual table names in crash reports r=knz a=rafiss This uses the functionality that we already had in place for telemetry data. This might potentially make some crash reports more useful without compromising privacy. Release note (general change): Crash reports that are sent to Cockroach Labs now no longer redact the names of builtin virtual tables from the crdb_internal, information_schema, and pg_catalog schemas. 60938: sql: drop partitions and zone configs from RBR tables on region drop r=otan a=arulajmani Closes #58340 Release note (sql change): `ALTER DATABASE ... DROP REGION ...` repartitions regional by row tables to remove the partition for the removed region and removes the zone congfiguration for the partition as well. 60939: bazel: correct BUILD file generation order r=irfansharif a=irfansharif pkg/BUILD.bazel has the (auto-generated) directive that instructs bazel to strip out the /pkg prefix. If we start off without pkg/BUILD.bazel[1], the earlier gazelle invocations don't learn about the prefix strip. [1]: Maybe we shouldn't be able to, I was testing to see if there were no extraneous diffs by re-generating the file entirely. Release note: None Co-authored-by: Aayush Shah <aayush.shah15@gmail.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: arulajmani <arulajmani@gmail.com> Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

andreimatei · 2021-03-22T20:17:40Z

Non-voting replicas exist. This issue is no longer useful.

aayushshah15 added A-kv-distribution Relating to rebalancing and leasing. A-kv-replication-constraints labels Jul 27, 2020

aayushshah15 self-assigned this Jul 27, 2020

aayushshah15 added this to the 20.2 milestone Jul 27, 2020

aayushshah15 mentioned this issue Jul 29, 2020

kv: rename ReplicaType LEARNER to LEARNER_EPHEMERAL #52048

Closed

aayushshah15 changed the title ~~kvserver: make learners long lived~~ kvserver: support long-lived learners Aug 7, 2020

johnrk-zz assigned awoods187 and aayushshah15 and unassigned aayushshah15 and awoods187 Aug 10, 2020

nvanbenschoten mentioned this issue Aug 31, 2020

rfc: Non-Blocking Transactions #52745

Merged

aayushshah15 mentioned this issue Oct 15, 2020

kvserver: allow non-voting replicas to be created via ChangeReplicas #53715

Merged

andreimatei closed this as completed Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: support non-voting replicas #51943

kvserver: support non-voting replicas #51943

aayushshah15 commented Jul 27, 2020 •

edited

Loading

blathers-crl bot commented Jul 27, 2020

aayushshah15 commented Jul 27, 2020

aayushshah15 commented Jul 28, 2020

andreimatei commented Mar 22, 2021

kvserver: support non-voting replicas #51943

kvserver: support non-voting replicas #51943

Comments

aayushshah15 commented Jul 27, 2020 • edited Loading

New Replica type for non-voting replicas (done)

Zone config syntax to enable creation of non-voters

Compatibility with range merges (done)

Interactions with raft snapshot queue and replicate queue (done)

Allocator level changes

Allow distsender to route follower read requests to learners

Roachtest that exercises & asserts use of follower reads

Quota pool prevents raft leaders from getting too far ahead of followers

blathers-crl bot commented Jul 27, 2020

aayushshah15 commented Jul 27, 2020

aayushshah15 commented Jul 28, 2020

andreimatei commented Mar 22, 2021

aayushshah15 commented Jul 27, 2020 •

edited

Loading