From c9e3b3fdf6712616c8fd138f092cb640ef15ae3e Mon Sep 17 00:00:00 2001 From: Nikhil Benesch Date: Sat, 22 Dec 2018 12:30:55 -0500 Subject: [PATCH] tech-notes: add note on range merges Release note: None --- docs/tech-notes/range-merges.md | 930 +++++++++++++++++++++++++++++++ pkg/storage/client_merge_test.go | 2 +- 2 files changed, 931 insertions(+), 1 deletion(-) create mode 100644 docs/tech-notes/range-merges.md diff --git a/docs/tech-notes/range-merges.md b/docs/tech-notes/range-merges.md new file mode 100644 index 000000000000..32a5afd9045e --- /dev/null +++ b/docs/tech-notes/range-merges.md @@ -0,0 +1,930 @@ +# Range merges + +**Last update:** January 7, 2018 + +**Original author:** Nikhil Benesch + +This document serves as an end-to-end description of the implementation of range +merges. The target reader is someone who is reasonably familiar with core +but unfamiliar with either the "how" or the "why" of the range merge +implementation. + +The most complete documentation, of course, is in the code, tests, and the +surrounding comments, but those pieces are necessarily split across several +files and packages. That scattered knowledge is centralized here, without +excessive detail that is likely to become stale. + +## Table of Contents + +* [Overview](#overview) +* [Implementation details](#implementation-details) + * [Preconditions](#preconditions) + * [Initiating a merge](#initiating-a-merge) + * [AdminMerge race](#adminmerge-race) + * [Merge transaction](#merge-transaction) + * [Transfer of power](#transfer-of-power) + * [Snapshots](#snapshots) + * [Merge queue](#merge-queue) +* [Subtle complexities](#subtle-complexities) + * [Range descriptor generation](#range-descriptor-generations) + * [Misaligned replica sets](#misaligned-replica-sets) + * [Replica GC](#replica-gc) + * [Transaction record GC](#transaction-record-gc) + * [Unanimity](#unanimity) +* [Safety recap](#safety-recap) +* [Appendix](#appendix) + * [Key encoding oddities](#key-encoding-oddities) + +## Overview + +A range merge begins when two adjacent ranges are selected to be merged +together. For example, suppose our adjacent ranges are _P_ and _Q_, somewhere in +the middle of the keyspace: + +``` +--+-----+-----+-- + | P | Q | +--+-----+-----+-- +``` + +We'll call _P_ the left-hand side (LHS) of the merge, and _Q_ the right-hand +side (RHS) of the merge. For reasons that will become clear later, we also refer +to _P_ as the subsuming range and _Q_ as the subsumed range. + +The merge is coordinated by the LHS. The coordinator begins by verifying that a) +the two ranges are, in fact, adjacent, and b) that the replica sets of the two +ranges are aligned. Replica set alignment is a term that is currently only +relevant to merges; it means that the set of stores with replicas of the LHS +exactly matches the set of stores with replicas of the RHS. For example, this +replica set is aligned: + +``` +Store 1 Store 2 Store 3 Store 4 ++-----+ +-----+ +-----+ +-----+ +| P Q | | P Q | | | | P Q | ++-----+ +-----+ +-----+ +-----+ +``` + +By requiring replica set alignment, the merge operation is reduced to a metadata +update, albeit a tricky one, as the stores that will have a copy of the merged +range PQ already have all the constituent data, by virtue of having a copy of +both P and Q before the merge begins. + +After verifying that the merge is sensible, the coordinator transactionally +updates the implicated range descriptors, adjusting P's range descriptor so that +it extends to Q's end and deleting Q's range descriptor. + +Then, the coordinator needs to atomically move responsibility for the data in +the RHS to the LHS. This is tricky, as the lease on the LHS may be held by a +different store than the lease on the RHS, The coordinator notifies the RHS that +it is about to be subsumed and it is prohibited from serving any additional read +or write traffic. Only when the coordinator has received an acknowledgement from +_every_ replica of the RHS, indicating that no traffic is possibly being served +on the RHS, does the coordinator commit the merge transaction. + +Like with splits, the merge transaction is committed with a special "commit +trigger" that instructs the receiving store to update its in-memory bookkeeping +to match the updates to the range descriptors in the transaction. The moment the +merge transaction is considered committed, the merge is complete! + +At the time of writing, merges are only initiated by the merge queue, which is +responsible both for locating ranges that are in need of a merge and aligning +their replica sets before initating the merge. + +The remaining sections cover each of these steps in more detail. + +## Implementation details + +### Preconditions + +Not any two ranges can be merged. The first and most obvious criterion is that +the two ranges must be adjacent. Suppose a simplified cluster that has only +three ranges, A, B, and C: + +``` ++-----+-----+-----+ +| A | B | C | ++-----+-----+-----+ +``` + +Ranges A and B can be merged, as can ranges B and C, but not ranges A and C, as +they are not adjacent. Note that adjacent ranges are equivalently referred to +as "neighbors", as in, range B is range A's right-hand neighbor. + +The second criterion is that the replica sets must be aligned. To illustrate, +consider a four node cluster with the default 3x replication. The allocator has +attempted to balance ranges as evenly as possible: + +``` +Node 1 Node 2 Node 3 Node 4 ++-----+ +-----+ +-----+ +-----+ +| P Q | | P | | Q | | P Q | ++-----+ +-----+ +-----+ +-----+ +``` + +Notice how node 2 does not have a copy of Q, and node 3 does not have a copy of +P. These replica sets are considered "misaligned." Aligning them requires +rebalancing Q from node 3 to node 2, or rebalancing P from node 2 to node 3: + +``` +Node 1 Node 2 Node 3 Node 4 ++-----+ +-----+ +-----+ +-----+ +| P Q | | P Q | | | | P Q | ++-----+ +-----+ +-----+ +-----+ + +Node 1 Node 2 Node 3 Node 4 ++-----+ +-----+ +-----+ +-----+ +| P Q | | | | P Q | | P Q | ++-----+ +-----+ +-----+ +-----+ +``` + +We explored an alternative merge implementation that did not require aligned +replica sets, but found it to be unworkable. See the [misaligned replica sets +misstep](#misaligned-replica-sets) for details. + +### Initiating a merge + +A merge is initiated by sending a AdminMerge request to a range. Like other +admin commands, DistSender will automatically route the request to the +leaseholder of the range, but there is no guarantee that the store will retain +its lease while the admin command is executing. + +Note that an AdminMerge request takes no arguments, as there is no choice in +what range will be merged. The recipient of the AdminMerge will always be the +LHS, subsuming range, and its right neighbor at the moment that the +AdminMerge command begins executing will always be the RHS, subsumed range. + +It would have been reasonable to have instead used the RHS to coordinate the +merge. That is, the RHS would have been the subsuming range, and the LHS would +have been the subsumed range. Using the LHS turned out to be easier, however, +due to an oddity of key encoding and range bounds. It is trivial for a range to +send a request to its right neighbor, as it simply addresses the request to its +end key, but it is difficult to send a request to its left neighbor, as there +is no function to get the key that immediately precedes the range's start key. +See the [key encoding oddities](#key-encoding-oddities) section of the appendix +for details. + +Using the LHS to coordinate also yields nice symmetry for splits, where the +range that coordinates a split becomes the LHS of the split. Maintaining this +symmetry means that a range's start key never changes during its lifetime, +while its end key may change arbitrarily in response to splits and merges. + +At the time of writing, only the [merge queue](#merge-queue) initiates merges, +and it does so by bypassing DistSender and invoking the AdminMerge command +directly on the local replica. At some point in the future, we may wish to +expose manual merges via SQL, at which point the SQL layer will need to send +proper AdminMerge requests through the KV API. + +#### AdminMerge race + +At present, AdminMerge requests are subject to a small race. It is possible for +the ranges implicated by an AdminMerge request to split or merge between when +the client decides to send an AdminMerge request and when the AdminMerge request +is processed. + +For example, suppose the client decides that _P_ and _Q_ should be merged and +sends an AdminMerge request to _P_. It is possible that, before the AdminMerge +request is processed, _P_ splits into _P1_ and _P2_. The +AdminMerge request will thus result in _P1_ and _P2_ +merging together, and not the desired _P_ and _Q_. + +The race could have been avoided if the AdminMerge request required that the +descriptors for the implicated ranges were provided as arguments to the request. +Then the merge could be aborted if the merge transaction discovered that either +of the implicated ranges did not match the corresponding descriptor in the +AdminMerge request arguments, forming a sort of optimistic lock. + +Fortunately, the race is rare in practice. If it proves to be a problem, the +scheme described above would be easy to implement while maintaining backwards +compatibility. + +### Merge transaction + +The merge transaction piggybacks on CockroachDB's serializability to provide +much of the necessary synchronization for the bookkeeping updates. For example, +merges cannot occur concurrently with any splits or replica changes on the +implicated ranges, because the merge transaction will naturally conflict with +those split transaction and change replicas transactions, as both transactions +will attempt to write updated range descriptors and conflict. No additional code +was needed to enforce this, as our standard transaction conflict detection +mechanisms kick in here (write intents, the timestamp cache, the spanlatch +manager, etc.). + +Note that there was one surprising synchronization problem that was not +immediately handled by serializability. See [range descriptor +generations](#range-descriptor-generations) for details. + +The standard KV operations that the merge transaction performs are: + + * Reading the LHS descriptor and RHS descriptor, and verifying that their + replica sets are aligned. + * Updating the local and meta copy of the LHS descriptor to reflect + the widened end key. + * Deleting the local and meta copy of the RHS descriptor. + * Writing an entry to the `system.rangelog` table. + +These operations are the essence of a merge, and in fact update all necessary +on-disk data! All the remaining complexity exists to update in-memory metadata +while the cluster is live. + +Note that the merge transaction's KV operations are not fundamentally dependent +and so could theoretically be performed in any order. There are, however, +several implementation details that enforce some ordering constraints. + +First, the merge transaction record needs to be located on the LHS. +Specifically, the transaction record needs to live on the subsuming range, as +the commit trigger that actually applies the merge to the replica's in-memory +state runs on the range where the transaction record lives. The transaction +record is created on the range that the transaction writes first; therefore, the +merge transaction is careful to update the local copy of the LHS descriptor as +its first operation, since the local copy of the LHS descriptor lives on the +LHS. + +Second, the merge transaction must ensure that, when it issues the delete +request to remove the local copy of the RHS descriptor, the resulting intent is +actually written to disk. (See the [transfer of power](#transfer-of-power) +subsection for why this is required.) Thanks to [transactional +pipelining][#26599], KV writes can return early, before their intents have +actually been laid down. The intents are not required to make it to disk until +the moment before the transaction commits. The merge transaction simply disables +pipelining to avoid this hazard. + +As the last step before the commit, the merge transaction needs to freeze the +RHS, then wait for _every_ replica of the RHS to apply all outstanding commands. +This ensures that, when the merge commits, every LHS replica can blindly assume +that it has perfectly up-to-date data for the RHS. To quickly recap, this is +guaranteed because 1) the replica sets were aligned when the merge transaction +started, 2) rebalances that would misalign the replica sets will conflict with +the merge transaction, causing one of the transactions to abort, 3) the RHS is +frozen and cannot process any new commands, and 4) every replica of the RHS is +caught up on all commands. The process of freezing the RHS and waiting for every +replica to catch up is covered more thoroughly in the [transfer of +power](#transfer-of-power) subsection. + +Finally, the merge transaction commits, attaching a special [merge commit +trigger] to the end transaction request. This trigger has three +responsibilities: + + 1. It ensures the end transaction request knows which intents it can resolve + locally. Intents that live on the RHS range would naively appear to belong + to a different range than the one containing the transaction record (i.e., + the LHS), but if the merge is committing then the LHS is subsuming the RHS + and thus the intents can be resolved locally. + + In fact, it's critical that these intents are considered local, because + local intents are resolved synchronously while remote intents are resolved + asynchronously. We need to maintain the invariant that, when a store boots + up and discovers an intent on its local copy of a range descriptor, it can + simply ignore the intent. Because we enforce that these intents are + resolved synchronously with the commit of the merge transaction, we are + guaranteed that, if we see an intent on a local range descriptor, it is + because the merge has not yet been committed, and it is safe to load the + replica. If the intent was resolved asynchronously, we could attempt to + load both the post-merge subsuming replica and the subsumed replica, which + would overlap and crash the node. + + 2. It adjusts the LHS's MVCCStats to incorporate the subsumed range's + MVCCstats. + + 3. It attaches a [merge payload to the replicated proposal result][pd-flag]. + When each replica of the LHS applies the command, it will notice the merge + payload and adjust the store's in-memory state to match the changes to the + on-disk state that were committed by the transaction. This entails + atomically removing the RHS replica from the store and widening the LHS + replica. + + This operation involves a delicate dance of acquiring store locks and locks + for both replicas, in a certain order, at various points in the Raft + command application flow. The details are too intricate to be worth + describing here, especially considering that these tangled interactions + between a store and its replicas are due for a refactor. The best thing to + do, if you're interested in the details, is to trace through all references + to `storagepb.ReplicatedEvalResult.Merge`. + +[#26599]: https://github.com/cockroachdb/cockroach/pull/26599 +[merge commit trigger]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/batcheval/cmd_end_transaction.go#L984-L994 +[pd-flag]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/batcheval/cmd_end_transaction.go#L1033-L1035 + +#### Transfer of power + +The trickiest part of the merge transaction is making the LHS responsible for +the keyspace that was previously owned by the RHS. The transfer of power must be +performed atomically; otherwise, all manner of consistency violations can occur. +This is hopefully obvious, but here's a quick example to drive the point home. +Suppose _P_ and _Q_ simultaneously consider themselves responsible for the key +_q_. _Q_ could then allow a write to _q_ at time 1 at the same time that _P_ +allowed a read of _q_ at time 2. Consistency requires that either the read see +the write, as the read is executing at a higher timestamp, or that the write is +bumped to time 3. But because _P_ and _Q_ have separate spanlatch managers, no +synchronization will occur, and the read might fail to see the write! + +The transfer of power is complicated by the fact that there is no guarantee that +the leases on the LHS and the RHS will be aligned, nor is there any +straightforward way to provide such a guarantee. (Aligned leaseholders would +allow the merge transaction to use a simple in-memory lock to make the transfer +of power atomic.) The leaseholder of either range might fail at any moment, at +which point the lease can be acquired, after it expires, by any other live +member of the range. + +Since leaseholder assignment is infeasible, the merge transaction implements +what is essentially a distributed lock. + +The lock is initialized by sending a [Subsume][subsume-request] request to the +RHS. This is a single-purpose request that exists solely for use in the merge +transaction. It is unlikely to ever be useful in another situation, and (ab)uses +several implementation details to provide the necessary synchronization. + +When the Subsume request returns, the RHS has made three important promises: + + 1. There are no commands in flight. + 2. Any future requests will block until the merge transaction completes. + If the merge transaction commits, the requests will be bounced with a + RangeNotFound error. If the merge transaction aborts, the requests will + be processed as normal. + 3. If the RHS loses its lease, the new leaseholder will adhere to promises + 1 and 2. + +The Subsume request provides promise 1 by declaring that it reads and writes all +addressable keys in the range. This is a bald-faced lie, as the Subsume request +only reads one key and writes no keys, but it forces a flush of the command +queue, as no other commands can possibly execute in parallel with a command that +claims to write all keys. + +**TODO(benesch,nvanbenschoten):** Actually, concurrent reads at lower timestamps +are permitted. Is this a problem? Maybe. Answering this question is difficult +and requires reasoning about the causal chain established by the sequence of +requests sent by the merge transaction. + +It provides promise 2 by flipping [a bit][merge-bit] on the replica that +indicates that a subsumption is in progress. When the bit is active, the +replica blocks processing of all requests. + +Importantly, the bit needs to be cleared when the merge transaction completes, +so that the requests are not blocked forever. This is the responsibility of the +[*merge watcher goroutine*][watcher]. Determining whether a transaction has +committed or not is conceptually simple, but the details are brutally +complicated. See the [transaction record gc](#transaction-record-gc) section +for details. + +Note that the Subsume request only runs on the leaseholder, and therefore the +merge bit is only set on the leaseholder and the watcher goroutine only runs on +the leaseholder. This is perfectly safe, as none of the follower replicas can +process requests. + +Promise 3 is actually not provided by the Subsume request at all, but by a hook +in the lease acquisition code path. Whenever a replica acquires a lease, it +checks to see whether its local descriptor has a deletion intent. If it does, it +can infer that a subsumption is in progress, as nothing else leaves a deletion +intent on a range descriptor. In that case, the replica, instead of serving +traffic, flips the merge bit and launches its own merge watcher goroutine, just +as the Subsume command would have. This means there can actually be multiple +replicas of the RHS with the merge bit set and a merge watcher goroutine +running—assuming the old leaseholder did not crash but lost its lease for other +reasons—but this does not cause any problems. + +[subsume-request]: https://github.com/cockroachdb/cockroach/blob/d6adf24cae788d7cd967feadae8e9c0388ce5273/pkg/storage/batcheval/cmd_subsume.go#L56-L86 +[merge-bit]: https://github.com/cockroachdb/cockroach/blob/d6adf24cae788d7cd967feadae8e9c0388ce5273/pkg/storage/replica.go#L361-L364 +[watcher]: https://github.com/cockroachdb/cockroach/blob/d6adf24cae788d7cd967feadae8e9c0388ce5273/pkg/storage/replica.go#L2817-L2926 + +### Snapshots + +The LHS might be advanced over the command that commits a merge with a snapshot. +That means that all the complicated bookkeeping that normally takes place when a +replica processes a command with a non-nil `ReplicatedEvalResult.Merge` is +entirely skipped! Most problematically, the snapshot will need to widen the +receiving replica, but there will be a replica of the RHS in the way—remember, +this is guaranteed by replica set alignment. In fact, the snapshot could be +advancing over several merges, in which case there will be several RHSes to +subsume. + +This turns out to be relatively straightforward to handle. If an initialized +replica receives a snapshot that widens it, it can infer that a merge occured, +and it simply subsumes all replicas that are overlapped by the snapshot in one +shot. This requires the same intricate, delicate dance mentioned at the end of +the [merge transaction](#merge-transaction) section. After all, applying a +widening snapshot is simply the bulk version of applying a merge command +directly. The details are again too complicated to go into here, but you can +begin your own exploration by starting with this call to +[Replica.maybeAcquireSnapshotMergeLock][code-start] and tracing how the +returned `subsumedRepls` value is used. + +[code-start]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/replica.go#L4071-L4072 + +### Merge queue + +The merge queue, like most of the other queues in the system, runs on every +store and periodically scans all replicas for which that store holds the lease. +For each replica, the merge queue evaluates whether it should be merged with +its right neighbor. Looking rightward is a natural fit for the reasons +described in the [key encoding oddities](#key-encoding-oddities) section of +the appendix. + +In some ways, the merge queue has an easy job. For any given range _P_ and its +right neighbor _Q_, the merge queue synthesizes the hypothetical merged range +_PQ_ and asks whether the split queue would immediately split that merged range. +If the split queue would immediately split _PQ_, then obviously _P_ and _Q_ +should not be merged; otherwise, the ranges _should_ be merged! This means that +any improvement to our split heuristics also improves our merge heuristics with +essentially no extra work. For example, load-based splitting hardly required any +changes to the merge queue. + +Note that, to avoid thrashing, ranges at or above the minimum size threshold +(8MB) are never considered for a merge. The minimum size threshold is +configurable on a per-zone basis. + +Unfortunately, constructing the hypothetical merged range _PQ_ requires +information about _Q_ that only _Q_'s leaseholder maintains, like the amount of +load that _Q_ is currently experiencing. The merge queue must send a +`RangeStats` RPC to collect this information from _Q_'s leaseholder, because +there is no guarantee that the current store is _Q_'s leaseholder—or that the +current store even has a replica of _Q_ at all. + +To prevent unnecessary RPC chatter, the merge queue uses several heuristics to +avoid sending `RangeStats` requests when it seems like the merge is unlikely to +be permitted. For example, if it determines that _P_ and _Q_ store different +tables, the split between them is mandatory and it won't bother sending the +`RangeStats` requests. Similarly, if _P_ is above the minimum size threshold, it +doesn't bother asking about _Q_. + +## Subtle complexities + +### Range descriptor generations + +There was one knotty race that was not immediately eliminated by transaction +serializability. Suppose we have our standard aligned replica set situation: + +``` +Store 1 Store 2 Store 3 Store 4 ++-----+ +-----+ +-----+ +-----+ +| P Q | | P Q | | | | P Q | ++-----+ +-----+ +-----+ +-----+ +``` + +In an unfortunate twist of fate, a rebalance of P from store 2 to store 3 +begins at the same time as a merge of P and Q begins. Let's quickly cover the +valid outcomes of this race. + + 1. The rebalance commits before the merge. The merge must abort, as the + replica sets of P and Q are no longer aligned. + + 2. The merge commits before the rebalance starts. The rebalance should + voluntarily abort, as the decision to rebalance P needs to be updated in + light of the merge. It is not, however, a correctness problem if the + rebalance commits; it simply results in rebalancing a larger range than + may have been intended. + + 3. The merge commits before the rebalance ends, but after the rebalance has + sent a preemptive snapshot to store 3. The rebalance must abort, as + otherwise the preemptive snapshot it sent to store 3 is a ticking time + bomb. + + To see why, suppose the rebalance commits. Since the preemptive snapshot + predates the commit of the merge transaction, the new replica on store 3 + will need to be streamed the Raft command that commits the merge + transaction. But applying this merge command is disastrous, as store 3 + does not have a replica of Q to merge! This is a very subtle way in which + replica set alignment can be subverted. + +Guaranteeing the correct outcome in case 1 is easy. The merge transaction simply +checks for replica set alignment by transactionally reading the range descriptor +for _P_ and the range descriptor for _Q_ and verifying that they list the same +replicas. Serializability guarantees the rest. + +Case 2 is similarly easy to handle. The rebalance transaction simply verifies +that the range descriptor used to make the rebalance decision matches the range +descriptor that it reads transactionally. + +Case 3, however, has an extremely subtle pitfall. It seems like the solution for +case 2 should apply: simply abort the transaction if the range descriptor +changes between when the preemptive snapshot is sent and when the rebalance +transaction starts. But, as it turns out, this is not quite foolproof. What if, +between when the preemptive snapshot is sent and when the rebalance transaction +starts, _P_ and _Q_ merge together and then split at exactly the same key? The +range descriptor for _P_ will look entirely unchanged to the rebalance +transaction! + +The solution was to add a generation counter to the range descriptor: + +```protobuf +message RangeDescriptor { + // ... + + // generation is incremented on every split and every merge, i.e., whenever + // the end_key of this range changes. It is initialized to zero when the range + // is first created. + optional int64 generation = 6; +} +``` + +It is no longer possible for a range descriptor to be unchanged by a sequence of +splits and merges, as every split and merge will bump the generation counter. +Rebalances can thus detect if a merge commits between when the preemptive +snapshot is sent and when the transaction begins, and abort accordingly. + +### Misaligned replica sets + +An early implementation allowed merges between ranges with misaligned replica +sets. The intent was to simplify matters by avoiding replica rebalancing. + +Consider again our example misaligned replica set: + +``` +Store 1 Store 2 Store 3 Store 4 ++-----+ +-----+ +-----+ +-----+ +| P Q | | P | | Q | | P Q | ++-----+ +-----+ +-----+ +-----+ + +P: (s1, s2, s4) +Q: (s1, s3, s4) +``` + +Note that there are two perspectives shown here. The store boxes represent the +replicas that are *actually* present on that store, from the perspective of the +store itself. The descriptor tuples at the bottom represent the stores that are +considered to be members of the range, from the perspective of the most recently +committed range descriptor. + +Now, to merge P and Q in this situation without aligning their replica sets, +store 2 needed to be provided a copy of store 3's data. To accomplish this, a +copy of Q's data was stashed in the merge trigger, and P would write this data +into its store when applying the merge trigger. + +There was, sadly, a large and unforeseen problem with lagging replicas. +Suppose store 2 loses network connectivity a moment before ranges P and Q +are merged. Note that store 2 is not required for P and Q to merge, because +only a quorum is required on the LHS to commit a merge. Now the situation +looks like this: + +``` +Store 1 Store 2 Store 3 Store 4 ++-----+ xxxxxxx +-----+ +-----+ +| PQ | | P | | | | PQ | ++-----+ xxxxxxx +-----+ +-----+ + +PQ: (s1, s2, s4) +``` + +There is nothing stopping the newly merged PQ range from immediately splitting +into P and Q'. Note that P is the same range as the original P (i.e., it has +the same range ID) and so the replica on P is still considered a member, while +Q' is a new range, with a new ID, that is unrelated to Q: + +``` +Store 1 Store 2 Store 3 Store 4 ++-----+ xxxxxxx +-----+ +-----+ +| P Q'| | P | | | | P Q'| ++-----+ xxxxxxx +-----+ +-----+ + +P: (s1, s2, s4) +Q': (s1, s2, s4) +``` + +When store 2 comes back online, it will start catching up on missed messages. +But notice how store 2 is considered to be a member of Q', because it was a +member of P before the split. The leaseholder for Q' will notice that store 2's +replica is out of date and send over a snapshot so that store 2 can initialize +its replica... and all that might happen before store 2 manages to apply the +merge command for PQ. If so, applying the merge command for PQ will explode, +because the keyspace of the merged range PQ intersects with the keyspace of Q'! + +By requiring aligned replica sets, we sidestep this problem. The RHS is, in +effect, a lock on the post-merge keyspace. Suppose we find ourselves in the +analogous situation with replica sets aligned: + +``` +Store 1 Store 2 Store 3 Store 4 ++-----+ xxxxxxx +-----+ +-----+ +| P Q'| | P Q | | | | P Q'| ++-----+ xxxxxxx +-----+ +-----+ + +P: (s1, s2, s4) +Q': (s1, s2, s4) +``` + +Here, PQ split into P and Q' immediately after merging, but notice how store 2 +has a replica of both P and Q because we required replica set alignment during +the merge. That replica of Q prevents store 2 from initializing a replica of Q' +until either store 2's replica of P applies the merge command (to PQ) and the +split command (to P and Q'), or store 2's replica of P is rebalanced away. + +### Replica GC + +Per the discussion in the last section, we use the replica of the RHS as a lock +on the keyspace extension. This means that we need to be careful not to GC this +replica too early. + +It's easiest to see why this is a problem if we consider the case where one +replica is extremely slow in applying a merge: + +``` +Store 1 Store 2 Store 3 Store 4 ++-----+ +-----+ +-----+ +-----+ +| PQ | | PQ | | | | P Q | ++-----+ +-----+ +-----+ +-----+ + +PQ: (s1, s2, s4) +``` + +Here, _P_ and _Q_ have just merged. Store 4 hasn't yet processed the merge while +stores 1 and 2 have. + +The replica GC queue is continually scanning for replicas that are no longer a +member of their range. What if the replica GC queue on store 4 scans its replica +of _Q_ at this very moment? It would notice that the _Q_ range has been merged +away and, conceivably, conclude that _Q_ could be garbage collected. This would +be disastrous, as when _P_ finally applied the merge trigger it would no longer +have a replica of _Q_ to subsume! + +One potential solution would be for the replica GC queue to refuse to GC +replicas for ranges that have been merged away. But that could result in +replicas getting permanently stuck. Suppose that, before store 4 applies the +merge transaction, the _PQ_ range is rebalanced away to store 3: + +``` +Store 1 Store 2 Store 3 Store 4 ++-----+ +-----+ +-----+ +-----+ +| PQ | | PQ | | PQ | | P Q | ++-----+ +-----+ +-----+ +-----+ + +PQ: (s1, s2, s3) +``` + +Store 4's replica of _P_ will likely never hear about the merge, as it is no +longer a member of the range and therefore not receiving any additional Raft +messages from the leader, so it will never subsume _Q_. The replica GC queue +_must_ be capable of garbage collecting _Q_ in this case. Otherwise _Q_ will +be stuck on store 4 forever, permanently preventing the store from ever +acquiring a new replica that overlaps with that keyspace. + +Solving this problem turns out to be quite tricky. What the replica GC queue +wants to know when it discovers that _Q_'s range has been subsumed is whether +the local replica of _Q_ might possibly still be subsumed by its local left +neighbor _P_. It can't just ask the local _P_ whether it's about to apply a +merge, since _P_ might be lagging behind, as it is here, and have no idea that a +merge is about to occur. + +So the problem reduces to proving that _P_ cannot apply a merge trigger that +will subsume _Q_. The chosen approach is to fetch the current range descriptor +for _P_ from the meta index. If that descriptor exactly matches the local +descriptor, thanks to [range descriptor generations](#range-descriptor-generations), +we are assured that there are no merge triggers that _P_ has yet to apply, and +_Q_ can safely be GC'd. + +Note that it is possible to form long chains of replicas that can only be GC'd +from left to right; the GC queue is not aware of these dependencies and +therefore processes such chains extremely inefficiently (i.e., by processing +replicas in an arbitrary order instead of the necessary order). These chains +turn out to be extremely rare in practice. + +There is one additional subtlety here. Suppose we have two adjacent ranges, _O_ +and _Q_. _O_ has just split into _O_ and _P_, but store 3 is lagging and has not +yet processed the split. + +``` +STATE 1 + + Store 1 Store 2 Store 3 Store 4 ++-------+ +-------+ +-------+ +-------+ +| O P Q | | O P Q | | O Q | | | ++-------+ +-------+ +-------+ +-------+ + +O: (s1, s2, s3) +P: (s1, s2, s3) +Q: (s1, s2, s3) +``` + +At this point, suppose the leader for the new range _P_ decides that store 3 +will need a snapshot to catch up, and starts sending the snapshot over the +network. This will be important later. At the same time, _P_ and _Q_ merge while +store 3 is still lagging. + +It may seem strange that this merge is permitted, but notice how the replica +sets are aligned according to the descriptors, even though store 3 does not +physically have a replica of _P_ yet. Here's the new state of the world: + +``` +STATE 2 + + Store 1 Store 2 Store 3 Store 4 ++-------+ +-------+ +-------+ +-------+ +| O PQ | | O PQ | | O Q | | | ++-------+ +-------+ +-------+ +-------+ + +O: (s1, s2, s3) +PQ: (s1, s2, s3) +``` + +Finally, _O_ is rebalanced from store 3 to store 4 and garbage collected on +store 4: + +``` +STATE 3 + + Store 1 Store 2 Store 3 Store 4 ++-------+ +-------+ +-------+ +-------+ +| O PQ | | O PQ | | Q | | O | ++-------+ +-------+ +-------+ +-------+ + +O: (s1, s2, s4) +PQ: (s1, s2, s3) +``` + +The replica GC queue might reasonably think that store 3's replica of _Q_ is out +of date, as _Q_ has no left neighbor that could subsume it. But, at any moment +in time, store 3 could finish receiving the snapshot for _P_ that was started +between state 1 and state 2. Crucially, this snapshot predates the merge, so it +will need to apply the merge trigger... and the replica for _Q_ had better be +present on the store! + +This hazard is avoided by requiring that all replicas of the LHS are initialized +before a merge begins. This prevents a transition from state 1 to state 2, as +the merge of _P_ and _Q_ cannot occur until store 3 initializes its replica of +_P_. The AdminMerge command will wait a few seconds in the hope that store 3 +catches up quickly; otherwise, it will refuse to launch the merge transaction. +It is therefore impossible to end up in a dangerous state, like state 3, and it +is thus safe for the replica GC queue to GC _Q_ if its left neighbor is +generationally up to date. + +### Transaction record GC + +The merge watcher goroutine needs to wait until the merge transaction completes, +and determine whether or not the transaction committed or aborted. This turns +out to be brutally complicated, thanks to the aggressive garbage collection of +transaction records. + +The watcher goroutine begins by sending a PushTxn request to the merge +transaction. It can easily discover the necessary arguments for the PushTxn +request, that is, the ID and key of the current merge transaction record, +because they're recorded in the intent that the merge transaction has left on +the RHS's local copy of the descriptor. + +If the PushTxn request reports that the merge transaction committed, we're +guaranteed that the merge transaction did, in fact, complete. That means that we +can mark the RHS replica as destroyed, bounce all requests back to DistSender +(where they'll get retried on the subsuming range), and clean up the watcher +goroutine. The RHS replica will be cleaned up either when the LHS replica +subsumes it or when the replica GC queue notices that it has been abandoned. + +If the PushTxn request instead reports that the merge transaction aborted, we're +not guaranteed that the merge transaction actually aborted. The merge +transaction may have committed so quickly that its transaction record was +garbage collected before our PushTxn request arrived. The PushTxn incorrectly +interprets this state to mean that the transaction was aborted, when, in fact, +it was committed and GCed. To be fair, we're somewhat abusing the PushTxn +request here. Outside of merges, a PushTxn request is only sent when a pending +intent is discovered, and transaction records can't be GCed until all their +intents have been resolved. + +So we need some way to determine whether a merge transaction was actually +aborted. What we do is look for the effects of the merge transaction in meta2. +If the merge aborted, we'll see our own range descriptor, with our range ID, in +meta2. By contrast, if the merge committed, we'll see a range descriptor for a +different range in meta2. + +This complexity is extremely unfortunate, and turns what should be a simple +goroutine + +```go +go func() { + <-txn.Done() // wait for txn to complete + if txn.Committed() { + repl.MarkDestroyed("replica subsumed") + } + repl.UnblockRequests() +} +``` + +into [150 lines of hard to follow code][code]. + +[code]: https://github.com/cockroachdb/cockroach/blob/82bcd948384e6a482cdd7c916c0aaca32367a7b0/pkg/storage/replica.go#L2813-L2920 + +### Unanimity + +The largest conceptual incongruity with the current merge implementation is the +fact that it requires unanimous consent from all replicas, instead of a majority +quorum, like everything else in Raft. Further confusing matters, only the RHS +needs unanimous consent; a merge can proceed with only majority consent from the +LHS. In fact, it's even a bit more subtle: while only a majority of the LHS +replicas need to vote on the merge command, all LHS replicas need to confirm +that they are initialized for the merge to start. + +There is no theoretical reason that merges need unanimous consent, but the +complexity of the implementation quickly skyrockets without it. For example, +suppose you adjusted the transfer of power so that only a majority of replicas +on the RHS need to be fully up to date before the merge commits. Now, when +applying the merge trigger, the LHS needs to check to see if its copy of the RHS +is up to date; if it's not, the LHS needs to throw away its entire Raft state +and demand a snapshot from the leader. This is both unsightly—our code is worse +off every time we reach into Raft—and less efficient than the existing +implementation, as it requires sending a potentially multi-megabyte snapshot if +one replica of the RHS is just a little bit behind in applying the latest +commands. + +It's possible that these problems could be mitigated while retaining the ability +to merge with a minority of replicas offline, but an obvious solution did not +present itself. On the bright side, having too many ranges is unlikely to cause +acute performance problems; that is, a situation where a merge is critical to +the health of a cluster is difficult to imagine. Unlike large ranges, which can +appear suddenly and require an immediate split or log truncation, merges are +only required when there are on the order of tens of thousands of excessively +small ranges, which takes a long time to build up. + +## Safety recap + +This section is a recap of the various mechanisms, which are described in +detail above, that work together to ensure that merges do not violate +consistency. + +The first broad safety mechanism is replica set alignment, which is required so +that every store participating in the merge has a copy of both halves of the +data in the merged range. Replica sets are first optimistically by the merge +queue. The replicas might drift apart, e.g., because the ranges in question were +also targeted for a rebalance by the replicate queue, so the merge transaction +verifies that the replica sets are still aligned from within the transaction. If +a concurrent split or rebalance were to occur on the implicated ranges, +transactional isolation kicks in and aborts one of the transactions, so we know +that the replica sets are still aligned at the moment that the merge commits. + +Crucially, we need to maintain alignment until the merge applies on all replicas +that were present at the time of the merge. This is enforced by refusing to +GC a replica of the RHS of a merge unless it can be proven that the store does +not have a replica of the LHS that predates the merge, _nor_ will it acquire +a replica of the LHS that predates the merge. Proving that it does not currently +have a replica of the LHS that predates the merge is fairly straightforward: +we simply prove that the local left neighbor's generation is the newest +generation, as indicated by the LHS's meta descriptor. Proving that the store +will _never_ acquire a replica of the LHS that predates the merge is harder— +there could be a snapshot in flight that the LHS is entirely unaware of. So +instead we require that replicas of the LHS in a merge are initialized on every +store before the merge can begin. + +The second broad safety mechanism is the range freeze. This ensures that the +subsuming range and the subsumed range do not serve traffic at the same time, +which would lead to clear consistency violations. The mechanism works by tying +the freeze to the lifetime of the merge transaction; the merge will not commit +until all replicas of the RHS are verified to be frozen, and the replicas of the +RHS will not unfreeze unless the merge transaction is verified to be aborted. +Lease transfers are freeze-aware, so the freeze will persist even if the lease +moves around on the RHS during the merge or if the leaseholder restarts. The +implementation of the freeze ab(uses) the spanlatch manager, to flush out +in-flight commands on the RHS, an intent on the local range descriptor, to +ensure the freeze persists if the lease is transfered, and an RPC that +repeatedly polls the RHS to wait until it is fully caught up. + +## Appendix + +The appendix contains digressions that are not directly pertinent to range +merges, but are not covered in documentation elsewhere. + +### Key encoding oddities + +Lexicographic ordering of keys of unbounded length has the interesting property +that it is always possible to construct the key that immediately succeeds a +given key, but it is not always possible to construct the key that immediately +precedes a given key. + +In the following diagrams `\xNN` represents a byte whose value in hexadecimal is +`NN`. The null byte is thus `\x00` and the maximal byte is thus `\xff`. + +Now consider a sequence of keys that has no gaps: + +``` +a +a\x00 +a\x00\x00 +``` + +No gaps means that there are no possible keys that can sort between any of the +members of the sequence. For example, there is, simply, no key that sorts +between `a` and `a\x00`. + +Because we can construct such a sequence, we must have next and previous +operations over the sequence, which, given a key, construct the immediately +following key and the immediately preceding key, respectively. We can see from +the diagram that the next operation appends a null byte (`\x00`), while the +previous operation strips off that null byte. + +But what if we want to perform the previous operation on a key that does not end +in a null byte? For example, what is the key that immediately precedes `b`? It's +not `a`, because `a\x00` sorts between `a` and `b`. Similarly, it's not `a\xff`, +because `a\xff\xff` sorts between `a\xff` and `b`. This process continues +inductively until we conclude the key that immediately precedes `b` is +`a\xff\xff\xff...`, where there are an infinite number of trailing `\xff` bytes. + +It is not possible to represent this key in CockroachDB without infinite space. +You could imagine designing the key encoding with an additional bit that means, +"pretend this key has an infinite number of trailing maximal bytes," but +CockroachDB does not have such a bit. + +The upshot is that it is trivial to advance in the keyspace using purely +lexical operations, but it is impossible to reverse in the keyspace with purely +lexical operations. + +This problem pervades the system. Given a range that spans from `StartKey`, +inclusive, to `EndKey`, exclusive, it is trivial to address a request to +following range, but *not* the preceding range. To route a request to a range, +we must construct a key that lives inside that range. Constructing such a key +for the following range is trivial, as the end key of a range is, by definition, +contained in the following range. But constructing such a key for the preceding +range would require constructing the key that immediately precedes `StartKey`, +which is not possible with CockroachDB's key encoding. diff --git a/pkg/storage/client_merge_test.go b/pkg/storage/client_merge_test.go index c045fef23a51..6b2486381757 100644 --- a/pkg/storage/client_merge_test.go +++ b/pkg/storage/client_merge_test.go @@ -2417,7 +2417,7 @@ func (h *slowSnapRaftHandler) HandleSnapshot( // merge, and correctly refuse to GC its replica of B. In this case, however, // S3's replica of A is uninitialized and thus doesn't know its start and end // key, so S3 will instead discover some more-distant left neighbor of B. This -// distant neighbor might very well be up-to-date, and S3 will incorreclty +// distant neighbor might very well be up-to-date, and S3 will incorrectly // conclude that it can GC its replica of B! // // So say S3 GCs its replica of B. There are now two paths that A might take.