-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tech-notes: add note on range merges #33334
Conversation
docs/tech-notes/range-merges.md
Outdated
|
||
By requiring replica set alignment, the merge operation is reduced to a | ||
metadata update, albiet a tricky one, as the stores that will have a copy of | ||
the merged range PQ already have all the constitudent data, by virtue of having |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/constiduent/constituent
docs/tech-notes/range-merges.md
Outdated
it is about to be subsumed and it is prohibited from serving any additional read | ||
or write traffic. Only when the coordinator has received an acknowledgement from | ||
_every_ replica of the RHS, indicating that no traffic is possibly being served | ||
on the RHS, does the coordinator commit the merge transaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge deal, but I take it that this means merges don't work 3-node clusters with one node down, similar to how the lack of quiescence can cause massive performance hits in such clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's right. There's a theoretical that you could merge with a simple majority, but it didn't seem worth it. It would be massively complicated. Not sure if I'll have time to write it up formally, so let me know if you're curious and I can scrawl it down in a comment at least.
docs/tech-notes/range-merges.md
Outdated
|
||
Notice how node 2 does not have a copy of Q, and node 3 does not have a copy of | ||
P. These replica sets are considered "misaligned." Aligning them requires | ||
rebalancing Q from node 3 to node 2, or rebalancing P from node 2 to node 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth calling out that this is done proactively by the LHS leaseholder in order to enable merges, even if the details of how are too implementation-specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, definitely need to add a section on the merge queue.
docs/tech-notes/range-merges.md
Outdated
command begins executing will always be the RHS, subsumed range. | ||
|
||
It would perhaps have been wise to require that the AdminMerge specify the | ||
descriptor of the intented subsumption target. It is, at present, possible for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/intented/intended
docs/tech-notes/range-merges.md
Outdated
It would perhaps have been wise to require that the AdminMerge specify the | ||
descriptor of the intented subsumption target. It is, at present, possible for | ||
someone to send an admin merge to P, intending to merge P's right neighbor Q, | ||
but Q can split in the meantime and so instead P merges with a different range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it's also possible that a different client had already asked P and Q to merge, and so a second/repeated request could cause PQ and R to merge?
docs/tech-notes/range-merges.md
Outdated
the command queue, etc.). | ||
|
||
There was, however, one knotty edge case that was not immediately eliminated by | ||
transaction serializability. Suppose we have our standard misaligned replica |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example below is already aligned. I assume you meant s/misaligned/aligned/?
|
||
In an unfortunate twist of fate, a rebalance of P from store 2 to store 3 | ||
begins at the same time as a merge of P and Q begins. Let's quickly cover the | ||
valid outcomes of this race. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the valid outcomes, but are they all possible? Are there any important mechanisms in place ensuring no other outcomes occur?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they're all possible. Perhaps this section needs a bit more love. The idea is that either the rebalance commits first or the merge commits first. (One of those things must happen. I guess there are cases where the rebalance aborts or the merge aborts for unrelated reasons, but those don't matter.) Once you assume that the rebalance commits first, then the merge must abort. If you instead assume that the merge commits first, the rebalance must abort—but this case is broken into two because there the rationale for why the abort happens depends on precisely when the merge commit occurs.
docs/tech-notes/range-merges.md
Outdated
operation must abort, as it will be sending a preemptive snapshot for P | ||
to store 3. | ||
|
||
1. The merge commits and the rebalance fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incredibly minor, but 1, 2, 3, 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so far. @tbg and @bdarnell should give this a thorough read.
Reviewable status: complete! 1 of 0 LGTMs obtained
docs/tech-notes/range-merges.md, line 62 at r1 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
Not a huge deal, but I take it that this means merges don't work 3-node clusters with one node down, similar to how the lack of quiescence can cause massive performance hits in such clusters.
I believe we can now quiesce when a node is down. This change was made in 2.1 and is powered by node liveness.
docs/tech-notes/range-merges.md, line 65 at r1 (raw file):
Like with splits, the merge transaction is committed with a special "commit trigger" that instructs the receiving store to update its in-memory bookkeping
s/bookkeping/bookkeeping/g
docs/tech-notes/range-merges.md, line 100 at r1 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
I think it's worth calling out that this is done proactively by the LHS leaseholder in order to enable merges, even if the details of how are too implementation-specific.
I think the details are interesting. If alignment is required, something needs to be powering it. What is that something? How does that mechanism interact with other rebalancing mechanisms?
docs/tech-notes/range-merges.md, line 185 at r1 (raw file):
1. The merge commits and the rebalance fails. The merge transaction uses internal commit triggers to...
Reminder to flesh out the ...
.
docs/tech-notes/range-merges.md, line 190 at r1 (raw file):
## Transfer of power The trickiest part of a merge is
Leaving us hanging, eh?
docs/tech-notes/range-merges.md, line 196 at r1 (raw file):
## Missteps ### Misaligned replica sets
This section is awesome. It wasn't clear to why alignment was necessary. This make it clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained
docs/tech-notes/range-merges.md, line 20 at r1 (raw file):
A range merge begins when two adjacent ranges are selected to be merged together. For example, suppose our adjacent ranges are _Q_ and _R_, somewhere in
Nit: Use P and Q instead of Q and R for consistency with the rest of the doc.
docs/tech-notes/range-merges.md, line 50 at r1 (raw file):
metadata update, albiet a tricky one, as the stores that will have a copy of the merged range PQ already have all the constitudent data, by virtue of having a copy of both P and Q before the merge begins.
You discuss this a couple of paragraphs down but I think it's worth adding here that P and Q may not be a consistent snapshot of the combined range before the merge begins, but we synchronize them during the merge as described below.
docs/tech-notes/range-merges.md, line 58 at r1 (raw file):
Then, the coordinator needs to atomically move responsibility for the data in the RHS to the LHS. This is tricky, as the lease on the LHS may be held by a different store than the lease on the RHS, The coordinator notifies the RHS that
For future discussion/footnote: Did we consider transferring the RHS's lease to the LHS's leaseholder, and if so why did we choose this path?
docs/tech-notes/range-merges.md, line 62 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I believe we can now quiesce when a node is down. This change was made in 2.1 and is powered by node liveness.
In this case we can't simply make a liveness-based decision. We need to completely remove the range, which requires waiting for time-until-store-dead (and even then, I don't think we currently remove a dead replica unless there is a live one to replace it with, in the hopes that waiting for that replica to come back will be better than running with a lowered replication factor. I'm not sure that that's buying us anything, though).
docs/tech-notes/range-merges.md, line 137 at r1 (raw file):
an oddity of key encoding and range bounds. It is trivial for a range to send a request to its right neighbor, as it simply addresses the request to its end key, but difficult to send a request to its left neighbor.
This is also more consistent with splits, which modify the end key while leaving the start key alone. The start key of a range never changes after its creation, only the end key does. (I don't think we knowingly rely on this anywhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained
docs/tech-notes/range-merges.md, line 33 at r1 (raw file):
to _Q_ as the subsuming range and _R_ as the subsumed range. The merge is coordinated by the LHS. The coordinator begins by verifying that a)
The coordinator also aligns them if necessary.
docs/tech-notes/range-merges.md, line 48 at r1 (raw file):
By requiring replica set alignment, the merge operation is reduced to a metadata update, albiet a tricky one, as the stores that will have a copy of
albeit
docs/tech-notes/range-merges.md, line 58 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
For future discussion/footnote: Did we consider transferring the RHS's lease to the LHS's leaseholder, and if so why did we choose this path?
We considered it, but it was difficult to achieve. The difficult part is making sure that no other node grabs the lease before the merge is complete or aborted. This required a mechanism that was basically identical to letting any node be the leaseholder in the first place. Or so I remember.
docs/tech-notes/range-merges.md, line 62 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
In this case we can't simply make a liveness-based decision. We need to completely remove the range, which requires waiting for time-until-store-dead (and even then, I don't think we currently remove a dead replica unless there is a live one to replace it with, in the hopes that waiting for that replica to come back will be better than running with a lowered replication factor. I'm not sure that that's buying us anything, though).
I think we could in theory truncate our Raft log such that the down node is either caught up to the "golden index" or cut off. In doing so, we know it can't come back and serve anything invalid. I don't think we should be in a rush to implement something like this, but it seems like an appealing option. I don't like how this turns log truncations into something we need to guarantee for correctness, though.
docs/tech-notes/range-merges.md, line 137 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This is also more consistent with splits, which modify the end key while leaving the start key alone. The start key of a range never changes after its creation, only the end key does. (I don't think we knowingly rely on this anywhere).
I also don't know that we knowingly rely on this anywhere, but I would be scared to change it.
It's also nice for debuggability to have a consistent anchor key for each range.
docs/tech-notes/range-merges.md, line 147 at r1 (raw file):
## Merge transaction The merge transaction piggybacks on CockroachDB's strict serializability to
CockroachDB isn't strictly serializable, but it's linearizable on each key, which is ultimately what counts here.
docs/tech-notes/range-merges.md, line 155 at r1 (raw file):
additional code was needed to enforce this, as our standard transaction conflict detection mechanisms kick in here (write intents, the timestamp cache, the command queue, etc.).
do you mean span latch manager? ;-)
docs/tech-notes/range-merges.md, line 259 at r1 (raw file):
to store 2 so that store 2 can initialize its replica... and all that might happen before store 2 manages to apply the merge command for PQ. When store 2 came back online.
Feels like the conclusion is missing. This seems unfortunate, but not the end of the world so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a bunch of updates. Comments have only been sporadically addressed. I'll have to finish this up tomorrow.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
docs/tech-notes/range-merges.md, line 48 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
albeit
Done.
docs/tech-notes/range-merges.md, line 49 at r1 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
s/constiduent/constituent
Done.
docs/tech-notes/range-merges.md, line 65 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
s/bookkeping/bookkeeping/g
Done.
docs/tech-notes/range-merges.md, line 127 at r1 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
s/intented/intended
Done.
docs/tech-notes/range-merges.md, line 259 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Feels like the conclusion is missing. This seems unfortunate, but not the end of the world so far.
Yep, conclusion is added now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
docs/tech-notes/range-merges.md, line 48 at r1 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Done.
You sure?
docs/tech-notes/range-merges.md, line 223 at r2 (raw file):
If the merge txn aborts, the RHS processes the blocked requests. * Determining when the merge txn is actually committed is really, really hard thanks to txn record GC. TODO: Add a subtle complexity section about this?
Yes please!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More updates, but still a lot of feedback to address.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
docs/tech-notes/range-merges.md, line 48 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
You sure?
Oops, guess I didn't push my last rev last night. Done for real now.
The general structure is done now, I think. Just three more sections to flesh out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
docs/tech-notes/range-merges.md, line 58 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
We considered it, but it was difficult to achieve. The difficult part is making sure that no other node grabs the lease before the merge is complete or aborted. This required a mechanism that was basically identical to letting any node be the leaseholder in the first place. Or so I remember.
Yeah, let's just make sure that gets added to the doc.
docs/tech-notes/range-merges.md, line 62 at r1 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Yep, that's right. There's a theoretical that you could merge with a simple majority, but it didn't seem worth it. It would be massively complicated. Not sure if I'll have time to write it up formally, so let me know if you're curious and I can scrawl it down in a comment at least.
A "3-node cluster with one node down" is a cluster that can't survive any failures. Framed that way, it doesn't seem like a problem to me that merges are disabled when the cluster is in a delicate state that can't survive a failure, since that state should always be remedied as quickly as possible.
@nvanbenschoten points out that we maybe never handled the case where a
read is excuting on the RHS at the same time as the Subsume request because
the command queue doesn’t care about past reads.
…On Fri, Jan 4, 2019 at 5:31 PM Ben Darnell ***@***.***> wrote:
***@***.**** commented on this pull request.
Reviewed 2 of 2 files at r4.
*Reviewable <https://reviewable.io/reviews/cockroachdb/cockroach/33334>*
status: [image: ] complete! 0 of 0 LGTMs obtained (and 1 stale)
------------------------------
*docs/tech-notes/range-merges.md, line 58 at r1
<https://reviewable.io/reviews/cockroachdb/cockroach/33334#-LUma9ZS8xJCaAdHR1G4:-LVP2reWAvunM6ZHAryY:b-8x2axk>
(raw file
<https://github.com/cockroachdb/cockroach/blob/db65065616345b2108498189edebeb56a529457a/docs/tech-notes/range-merges.md#L58>):*
*Previously, tbg (Tobias Grieger) wrote…*
We considered it, but it was difficult to achieve. The difficult part is
making sure that no other node grabs the lease before the merge is complete
or aborted. This required a mechanism that was basically identical to
letting any node be the leaseholder in the first place. Or so I remember.
Yeah, let's just make sure that gets added to the doc.
------------------------------
*docs/tech-notes/range-merges.md, line 62 at r1
<https://reviewable.io/reviews/cockroachdb/cockroach/33334#-LULyJ5xrfiULarfwjGy-r1-62:-LVP3Bij1FTmIqNt_-IT:bltrkbx>
(raw file
<https://github.com/cockroachdb/cockroach/blob/db65065616345b2108498189edebeb56a529457a/docs/tech-notes/range-merges.md#L62>):*
*Previously, benesch (Nikhil Benesch) wrote…*
Yep, that's right. There's a theoretical that you could merge with a
simple majority, but it didn't seem worth it. It would be massively
complicated. Not sure if I'll have time to write it up formally, so let me
know if you're curious and I can scrawl it down in a comment at least.
A "3-node cluster with one node down" is a cluster that can't survive any
failures. Framed that way, it doesn't seem like a problem to me that merges
are disabled when the cluster is in a delicate state that can't survive a
failure, since that state should always be remedied as quickly as possible.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#33334 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA15IOO2rUMr7ySU-ipojv51i7TS1at2ks5u_9ZdgaJpZM4ZfjRD>
.
|
Just as an FYI, I plan to finish this up this week. I'm not sure exactly when I'll have time yet, though. |
c9e3b3f
to
6885ebf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this is pretty much done. Just a heads up I haven't copyedited yet, so apologies for any typos/weirdly worded sentences. I think I've covered pretty much all of the complexity I encountered though!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
docs/tech-notes/range-merges.md, line 33 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The coordinator also aligns them if necessary.
I don't consider the merge queue to be part of the coordinator; when I think of the coordinator I just think of the stuff inside AdminMerge. E.g. if we add an AdminMerge RPC, there will be the client who aligns the replica sets, then sends the RPC, and then the "coordinator" is the thing that actually handles the AdminMerge RPC.
FWIW I added a note to this section about how the merge queue typically does the alignment.
docs/tech-notes/range-merges.md, line 50 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
You discuss this a couple of paragraphs down but I think it's worth adding here that P and Q may not be a consistent snapshot of the combined range before the merge begins, but we synchronize them during the merge as described below.
Done.
docs/tech-notes/range-merges.md, line 58 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Yeah, let's just make sure that gets added to the doc.
This is covered in the "transfer of power" section now!
docs/tech-notes/range-merges.md, line 62 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
A "3-node cluster with one node down" is a cluster that can't survive any failures. Framed that way, it doesn't seem like a problem to me that merges are disabled when the cluster is in a delicate state that can't survive a failure, since that state should always be remedied as quickly as possible.
This is now touched upon in the "unanimity" section.
docs/tech-notes/range-merges.md, line 100 at r1 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Yeah, definitely need to add a section on the merge queue.
Done.
docs/tech-notes/range-merges.md, line 129 at r1 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
I assume it's also possible that a different client had already asked P and Q to merge, and so a second/repeated request could cause PQ and R to merge?
Yep.
docs/tech-notes/range-merges.md, line 137 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I also don't know that we knowingly rely on this anywhere, but I would be scared to change it.
It's also nice for debuggability to have a consistent anchor key for each range.
Done.
docs/tech-notes/range-merges.md, line 147 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
CockroachDB isn't strictly serializable, but it's linearizable on each key, which is ultimately what counts here.
Done.
docs/tech-notes/range-merges.md, line 155 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
do you mean span latch manager? ;-)
Heh, thanks. Done.
docs/tech-notes/range-merges.md, line 158 at r1 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
The example below is already aligned. I assume you meant s/misaligned/aligned/?
Yep, thanks.
docs/tech-notes/range-merges.md, line 183 at r1 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
Incredibly minor, but 1, 2, 3, 1?
Done.
docs/tech-notes/range-merges.md, line 185 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Reminder to flesh out the
...
.
Done.
docs/tech-notes/range-merges.md, line 190 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Leaving us hanging, eh?
Done.
docs/tech-notes/range-merges.md, line 196 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
This section is awesome. It wasn't clear to why alignment was necessary. This make it clear.
Glad to hear it. FYI I've updated the section so that it has more of a conclusion.
docs/tech-notes/range-merges.md, line 223 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Yes please!
Done.
Ping! I have some bandwidth to address review feedback this week and next, but after that my time will be much more limited. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knz: would be curious about your review here.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)
docs/tech-notes/range-merges.md, line 284 at r6 (raw file):
resolved synchronously with the commit of the merge transaction, we are guaranteed that, if we see an intent on a local range descriptor, it is because the merge has not yet been committed, and it is safe to load the
nit: the merge could've been committed, but "this replica" hasn't applied it yet, in which case we want to read the pre-intent descriptor. Maybe this is too far out in the weeds, but I feel that this could've gotten me to scratch my head if I were less familiar with matters.
docs/tech-notes/range-merges.md, line 356 at r6 (raw file):
claims to write all keys. **TODO(benesch,nvanbenschoten):** Actually, concurrent reads at lower timestamps
Heh, was just about to add this same comment. The subsume request populates the read timestamp cache at its timestamp, right? That would mean that reads below are fair game.
docs/tech-notes/range-merges.md, line 400 at r6 (raw file):
receiving replica, but there will be a replica of the RHS in the way—remember, this is guaranteed by replica set alignment. In fact, the snapshot could be advancing over several merges, in which case there will be several RHSes to
Even worse, over any combination of merges and splits
docs/tech-notes/range-merges.md, line 406 at r6 (raw file):
replica receives a snapshot that widens it, it can infer that a merge occured, and it simply subsumes all replicas that are overlapped by the snapshot in one shot. This requires the same intricate, delicate dance mentioned at the end of
I think you should give an idea of this case: replica P gets a snapshot that would subsume a right neighbor Q. But what if the snapshot is slightly stale and Q actually split of from a newer version of P? In that case, you can't subsume Q because that would remove a perfectly healthy replica. This won't happen because the original replica that was merged into P would block Q from ever getting onto that node. So add a forward-reference to the replicaGC section below to let readers know that the explanations are coming up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r5, 1 of 1 files at r6.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)
docs/tech-notes/range-merges.md, line 55 at r6 (raw file):
The merge is coordinated by the LHS. The coordinator begins by verifying that a) the two ranges are, in fact, adjacent, and b) that the replica sets of the two
I think also c) that the replicas are full, that is, they have received the underlying KV data for the range.
docs/tech-notes/range-merges.md, line 72 at r6 (raw file):
range _PQ_ already have all the constituent data, by virtue of having a copy of both _P_ and _Q_ before the merge begins. Note that _P_ and _Q_ do not need to form a consistent snapshot before the merge begins; they'll be appropriately
I do not know that the term "consistent snapshot" means.
docs/tech-notes/range-merges.md, line 95 at r6 (raw file):
At the time of writing, merges are only initiated by the merge queue, which is responsible both for locating ranges that are in need of a merge and aligning their replica sets before initating the merge.
*initiating
docs/tech-notes/range-merges.md, line 170 at r6 (raw file):
for details. Using the LHS to coordinate also yields nice symmetry for splits, where the
I'd put this paragraph first. Even if there was no key encoding oddity, this symmetry would be sufficient to motivate the choice.
docs/tech-notes/range-merges.md, line 198 at r6 (raw file):
Then the merge could be aborted if the merge transaction discovered that either of the implicated ranges did not match the corresponding descriptor in the AdminMerge request arguments, forming a sort of optimistic lock.
I would recommend the paragraph to suggest passing the end key of the merge as argument, instead of the descriptors. As you said before, the start key does not change so the replica that receives the AdminMerge req knows it. Passing the end key signals intent (possibly from SQL, later) to form one unit of storage for a given key span. The merge queue could then be in charge of performing multiple merges, as much as necessary to coalesce the start/end key span into a single range.
This may race with splits, but defining the semantics this way would just mean the merge will take longer until all the splits, then the resolving merges, are finalized.
docs/tech-notes/range-merges.md, line 352 at r6 (raw file):
The Subsume request provides promise 1 by declaring that it reads and writes all addressable keys in the range. This is a bald-faced lie, as the Subsume request only reads one key and writes no keys, but it forces a flush of the command
translate the phrase "flush of the command queue" into something about the span latch manager
docs/tech-notes/range-merges.md, line 586 at r6 (raw file):
When store 2 comes back online, it will start catching up on missed messages. But notice how store 2 is considered to be a member of Q', because it was a
"store 2 is considered to be a member of Q'" - isn't this averted by the introduction of the range generation number?
docs/tech-notes/range-merges.md, line 800 at r6 (raw file):
into [150 lines of hard to follow code][code].
I think it would be good to explicitly remind the reader that this goroutine is spawn on every incoming request to the RHS; if the node holding the RHS goes down, then back up, the command hold is guaranteed to be cleared.
docs/tech-notes/range-merges.md, line 862 at r6 (raw file):
there could be a snapshot in flight that the LHS is entirely unaware of. So instead we require that replicas of the LHS in a merge are initialized on every store before the merge can begin.
Just for my education: what happens when a merge an an up/down-replication are issued concurrently? How do you ensure that the cardinality of the set of replicas is known and fixed for the duration of the merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing. I learned that each replica has some local keys in addition to the logical KV space for the upper layers. This includes the range desc, transaction abort spans, etc.
When the RHS is subsumed, I expect the following to happen:
- not sure about the txn abort span, are their keys prefixed by the range ID in any way? Are they prefixed by the txn ID? Depending on how they're keyed, they may need to be migrated to the LHS. At the very least, this should be spelled out in the tech note.
- there must be a rocksdb delete event on the range desc of the RHS, otherwise that space (albeit tiny) will never be reclaimed. Is this done in the impl?
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)
e0c9841
to
7b0d951
Compare
Release note: None
7b0d951
to
ef6ded1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re @knz:
- not sure about the txn abort span, are their keys prefixed by the range ID in any way? Are they prefixed by the txn ID? Depending on how they're keyed, they may need to be migrated to the LHS. At the very least, this should be spelled out in the tech note.
Good point. The abort span is indeed copied from RHS to LHS. I've added a bit to the merge description trigger about this.
- there must be a rocksdb delete event on the range desc of the RHS, otherwise that space (albeit tiny) will never be reclaimed. Is this done in the impl?
Yep, it's done in the merge transaction itself:
The standard KV operations that the merge transaction performs are:
- Reading the LHS descriptor and RHS descriptor, and verifying that their
replica sets are aligned.- Updating the local and meta copy of the LHS descriptor to reflect
the widened end key.- Deleting the local and meta copy of the RHS descriptor.
- Writing an entry to the
system.rangelog
table.
More generally: I've addressed mostly everything! I'm going to hit the big green button on the strength of the earlier LGTMs, knowing full well there are probably a few nits remaining. I'm simply out of time. If you come across any problems, please just open a PR and go to town! I'm happy to review if you CC me, but don't feel obligated.
🍻
bors r+
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @a-robinson, @bdarnell, @knz, @petermattis, and @tbg)
docs/tech-notes/range-merges.md, line 55 at r6 (raw file):
Previously, knz (kena) wrote…
I think also c) that the replicas are full, that is, they have received the underlying KV data for the range.
In fact this is explicitly not checked here; that's what's meant below about how P and Q needn't form a consistent snapshot. It would probably be a worthwhile optimization to not attempt to merge if one replica is lagging behind, but no such optimization is performed today.
docs/tech-notes/range-merges.md, line 72 at r6 (raw file):
Previously, knz (kena) wrote…
I do not know that the term "consistent snapshot" means.
Reworded to be more clear!
docs/tech-notes/range-merges.md, line 95 at r6 (raw file):
Previously, knz (kena) wrote…
*initiating
Thanks, done.
docs/tech-notes/range-merges.md, line 170 at r6 (raw file):
Previously, knz (kena) wrote…
I'd put this paragraph first. Even if there was no key encoding oddity, this symmetry would be sufficient to motivate the choice.
Sure, done.
docs/tech-notes/range-merges.md, line 198 at r6 (raw file):
Previously, knz (kena) wrote…
I would recommend the paragraph to suggest passing the end key of the merge as argument, instead of the descriptors. As you said before, the start key does not change so the replica that receives the AdminMerge req knows it. Passing the end key signals intent (possibly from SQL, later) to form one unit of storage for a given key span. The merge queue could then be in charge of performing multiple merges, as much as necessary to coalesce the start/end key span into a single range.
This may race with splits, but defining the semantics this way would just mean the merge will take longer until all the splits, then the resolving merges, are finalized.
That makes sense, but it would require more than just the change you suggest of passing the end key instead of the full descriptor to AdminMerge
. The AdminMerge
request is dumb, and merges exactly what you ask it to, provided that the merge wouldn't violate correctness. It's the merge queue that decides what ranges make sense to merge and calls AdminMerge
on them—and I think it's important that if the merge queue decides to merge P and Q because they're both nice and small and adjacent, then the AdminMerge
request will bail if any of those decisions are based on invalid assumptions, like P or Q growing in size or splitting.
What you're asking for is an API to answer user requests like, "please merge everything from /Table/51/gorilla to /Table/52/monkey". That's an important request! And I agree we should handle that by repeatedly retrying merges of all ranges in that span, even if some automatic splits occur concurrently, until just one range remains (or until whatever the minimum number of ranges is remains). When the SQL to make such a request exists, it will be possible to build an implementation on top of an AdminMerge
request that behaves like the one proposed in this section, because it can simply retry the merge requests until they succeed. But note that the converse is not true; it is not possible to implement a non-racy merge queue on top of an AdminMerge
request that handles its own retries.
docs/tech-notes/range-merges.md, line 284 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: the merge could've been committed, but "this replica" hasn't applied it yet, in which case we want to read the pre-intent descriptor. Maybe this is too far out in the weeds, but I feel that this could've gotten me to scratch my head if I were less familiar with matters.
Good point. Done.
docs/tech-notes/range-merges.md, line 352 at r6 (raw file):
Previously, knz (kena) wrote…
translate the phrase "flush of the command queue" into something about the span latch manager
Done.
docs/tech-notes/range-merges.md, line 356 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Heh, was just about to add this same comment. The subsume request populates the read timestamp cache at its timestamp, right? That would mean that reads below are fair game.
Yep, seems scary.
docs/tech-notes/range-merges.md, line 400 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Even worse, over any combination of merges and splits
Done.
docs/tech-notes/range-merges.md, line 406 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I think you should give an idea of this case: replica P gets a snapshot that would subsume a right neighbor Q. But what if the snapshot is slightly stale and Q actually split of from a newer version of P? In that case, you can't subsume Q because that would remove a perfectly healthy replica. This won't happen because the original replica that was merged into P would block Q from ever getting onto that node. So add a forward-reference to the replicaGC section below to let readers know that the explanations are coming up.
Oh, hm, I was trying to avoid explaining or even motivating those hard-to-explain scenarios more than once, especially since the replica GC section has nice pictures. What I was trying to reference here was just the delicate synchronization to update internal bookeeping. I've updated the wording to hopefully be a bit more clear.
docs/tech-notes/range-merges.md, line 586 at r6 (raw file):
Previously, knz (kena) wrote…
"store 2 is considered to be a member of Q'" - isn't this averted by the introduction of the range generation number?
Unfortunately not. When a range splits, its replica set is copied verbatim from its parent range, which in this case includes s2. At the time the merge and split transactions commit, it's not yet known that s2 has gone down; from the perspective of s1 and s4, s2 is just a little bit slow, so they don't see any need to downreplicate it away.
docs/tech-notes/range-merges.md, line 800 at r6 (raw file):
Previously, knz (kena) wrote…
I think it would be good to explicitly remind the reader that this goroutine is spawn on every incoming request to the RHS; if the node holding the RHS goes down, then back up, the command hold is guaranteed to be cleared.
I don't follow. This goroutine is only spawned once per merge transaction on the RHS leader. (I've added a note to the doc to clarify this.) In fact if the RHS leaseholder goes down and come back up, we'll relaunch this goroutine before processing any requests!
docs/tech-notes/range-merges.md, line 862 at r6 (raw file):
Previously, knz (kena) wrote…
Just for my education: what happens when a merge an an up/down-replication are issued concurrently? How do you ensure that the cardinality of the set of replicas is known and fixed for the duration of the merge?
Up/down replication will need to modify the same range descriptor(s) that are implicated in the merge. This creates a serializability conflict, and so we're guaranteed that at least one of the transactions (either the merge or replication change) will fail. There's a slightly wordier explanation in the "Merge transaction" section.
Bors crashed. bors r+ |
24394: RFCs: add range merges RFC r=benesch a=benesch This is very much a WIP. I wrote it mostly to collect my own thoughts on range merges. At this stage I'm most interested in getting buy-in on building a general-case prototype, as opposed to the special-cases I lay out in the alternatives section, but I'm of course eager to get feedback on the rest of this design. 33334: tech-notes: add note on range merges r=benesch a=benesch Very much still in progress. Opening a PR in case folks want to follow along. Release note: None Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
Build succeeded |
Very much still in progress. Opening a PR in case folks want to follow along.
Release note: None