Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tech-notes: add note on range merges #33334

Merged
merged 1 commit into from
Apr 10, 2019
Merged

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Dec 22, 2018

Very much still in progress. Opening a PR in case folks want to follow along.

Release note: None

@benesch benesch added the do-not-merge bors won't merge a PR with this label. label Dec 22, 2018
@cockroach-teamcity
Copy link
Member

This change is Reviewable


By requiring replica set alignment, the merge operation is reduced to a
metadata update, albiet a tricky one, as the stores that will have a copy of
the merged range PQ already have all the constitudent data, by virtue of having
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/constiduent/constituent

it is about to be subsumed and it is prohibited from serving any additional read
or write traffic. Only when the coordinator has received an acknowledgement from
_every_ replica of the RHS, indicating that no traffic is possibly being served
on the RHS, does the coordinator commit the merge transaction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal, but I take it that this means merges don't work 3-node clusters with one node down, similar to how the lack of quiescence can cause massive performance hits in such clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's right. There's a theoretical that you could merge with a simple majority, but it didn't seem worth it. It would be massively complicated. Not sure if I'll have time to write it up formally, so let me know if you're curious and I can scrawl it down in a comment at least.


Notice how node 2 does not have a copy of Q, and node 3 does not have a copy of
P. These replica sets are considered "misaligned." Aligning them requires
rebalancing Q from node 3 to node 2, or rebalancing P from node 2 to node 3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth calling out that this is done proactively by the LHS leaseholder in order to enable merges, even if the details of how are too implementation-specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely need to add a section on the merge queue.

command begins executing will always be the RHS, subsumed range.

It would perhaps have been wise to require that the AdminMerge specify the
descriptor of the intented subsumption target. It is, at present, possible for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/intented/intended

It would perhaps have been wise to require that the AdminMerge specify the
descriptor of the intented subsumption target. It is, at present, possible for
someone to send an admin merge to P, intending to merge P's right neighbor Q,
but Q can split in the meantime and so instead P merges with a different range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's also possible that a different client had already asked P and Q to merge, and so a second/repeated request could cause PQ and R to merge?

the command queue, etc.).

There was, however, one knotty edge case that was not immediately eliminated by
transaction serializability. Suppose we have our standard misaligned replica
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example below is already aligned. I assume you meant s/misaligned/aligned/?


In an unfortunate twist of fate, a rebalance of P from store 2 to store 3
begins at the same time as a merge of P and Q begins. Let's quickly cover the
valid outcomes of this race.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the valid outcomes, but are they all possible? Are there any important mechanisms in place ensuring no other outcomes occur?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they're all possible. Perhaps this section needs a bit more love. The idea is that either the rebalance commits first or the merge commits first. (One of those things must happen. I guess there are cases where the rebalance aborts or the merge aborts for unrelated reasons, but those don't matter.) Once you assume that the rebalance commits first, then the merge must abort. If you instead assume that the merge commits first, the rebalance must abort—but this case is broken into two because there the rationale for why the abort happens depends on precisely when the merge commit occurs.

operation must abort, as it will be sending a preemptive snapshot for P
to store 3.

1. The merge commits and the rebalance fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incredibly minor, but 1, 2, 3, 1?

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: so far. @tbg and @bdarnell should give this a thorough read.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


docs/tech-notes/range-merges.md, line 62 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Not a huge deal, but I take it that this means merges don't work 3-node clusters with one node down, similar to how the lack of quiescence can cause massive performance hits in such clusters.

I believe we can now quiesce when a node is down. This change was made in 2.1 and is powered by node liveness.


docs/tech-notes/range-merges.md, line 65 at r1 (raw file):

Like with splits, the merge transaction is committed with a special "commit
trigger" that instructs the receiving store to update its in-memory bookkeping

s/bookkeping/bookkeeping/g


docs/tech-notes/range-merges.md, line 100 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I think it's worth calling out that this is done proactively by the LHS leaseholder in order to enable merges, even if the details of how are too implementation-specific.

I think the details are interesting. If alignment is required, something needs to be powering it. What is that something? How does that mechanism interact with other rebalancing mechanisms?


docs/tech-notes/range-merges.md, line 185 at r1 (raw file):

  1. The merge commits and the rebalance fails.

The merge transaction uses internal commit triggers to...

Reminder to flesh out the ....


docs/tech-notes/range-merges.md, line 190 at r1 (raw file):

## Transfer of power

The trickiest part of a merge is

Leaving us hanging, eh?


docs/tech-notes/range-merges.md, line 196 at r1 (raw file):

## Missteps

### Misaligned replica sets

This section is awesome. It wasn't clear to why alignment was necessary. This make it clear.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


docs/tech-notes/range-merges.md, line 20 at r1 (raw file):

A range merge begins when two adjacent ranges are selected to be merged
together. For example, suppose our adjacent ranges are _Q_ and _R_, somewhere in

Nit: Use P and Q instead of Q and R for consistency with the rest of the doc.


docs/tech-notes/range-merges.md, line 50 at r1 (raw file):

metadata update, albiet a tricky one, as the stores that will have a copy of
the merged range PQ already have all the constitudent data, by virtue of having
a copy of both P and Q before the merge begins.

You discuss this a couple of paragraphs down but I think it's worth adding here that P and Q may not be a consistent snapshot of the combined range before the merge begins, but we synchronize them during the merge as described below.


docs/tech-notes/range-merges.md, line 58 at r1 (raw file):

Then, the coordinator needs to atomically move responsibility for the data in
the RHS to the LHS. This is tricky, as the lease on the LHS may be held by a
different store than the lease on the RHS, The coordinator notifies the RHS that

For future discussion/footnote: Did we consider transferring the RHS's lease to the LHS's leaseholder, and if so why did we choose this path?


docs/tech-notes/range-merges.md, line 62 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I believe we can now quiesce when a node is down. This change was made in 2.1 and is powered by node liveness.

In this case we can't simply make a liveness-based decision. We need to completely remove the range, which requires waiting for time-until-store-dead (and even then, I don't think we currently remove a dead replica unless there is a live one to replace it with, in the hopes that waiting for that replica to come back will be better than running with a lowered replication factor. I'm not sure that that's buying us anything, though).


docs/tech-notes/range-merges.md, line 137 at r1 (raw file):

an oddity of key encoding and range bounds. It is trivial for a range to send a
request to its right neighbor, as it simply addresses the request to its end
key, but difficult to send a request to its left neighbor.

This is also more consistent with splits, which modify the end key while leaving the start key alone. The start key of a range never changes after its creation, only the end key does. (I don't think we knowingly rely on this anywhere).

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


docs/tech-notes/range-merges.md, line 33 at r1 (raw file):

to _Q_ as the subsuming range and _R_ as the subsumed range.

The merge is coordinated by the LHS. The coordinator begins by verifying that a)

The coordinator also aligns them if necessary.


docs/tech-notes/range-merges.md, line 48 at r1 (raw file):

By requiring replica set alignment, the merge operation is reduced to a
metadata update, albiet a tricky one, as the stores that will have a copy of

albeit


docs/tech-notes/range-merges.md, line 58 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

For future discussion/footnote: Did we consider transferring the RHS's lease to the LHS's leaseholder, and if so why did we choose this path?

We considered it, but it was difficult to achieve. The difficult part is making sure that no other node grabs the lease before the merge is complete or aborted. This required a mechanism that was basically identical to letting any node be the leaseholder in the first place. Or so I remember.


docs/tech-notes/range-merges.md, line 62 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

In this case we can't simply make a liveness-based decision. We need to completely remove the range, which requires waiting for time-until-store-dead (and even then, I don't think we currently remove a dead replica unless there is a live one to replace it with, in the hopes that waiting for that replica to come back will be better than running with a lowered replication factor. I'm not sure that that's buying us anything, though).

I think we could in theory truncate our Raft log such that the down node is either caught up to the "golden index" or cut off. In doing so, we know it can't come back and serve anything invalid. I don't think we should be in a rush to implement something like this, but it seems like an appealing option. I don't like how this turns log truncations into something we need to guarantee for correctness, though.


docs/tech-notes/range-merges.md, line 137 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is also more consistent with splits, which modify the end key while leaving the start key alone. The start key of a range never changes after its creation, only the end key does. (I don't think we knowingly rely on this anywhere).

I also don't know that we knowingly rely on this anywhere, but I would be scared to change it.
It's also nice for debuggability to have a consistent anchor key for each range.


docs/tech-notes/range-merges.md, line 147 at r1 (raw file):

## Merge transaction

The merge transaction piggybacks on CockroachDB's strict serializability to

CockroachDB isn't strictly serializable, but it's linearizable on each key, which is ultimately what counts here.


docs/tech-notes/range-merges.md, line 155 at r1 (raw file):

additional code was needed to enforce this, as our standard transaction
conflict detection mechanisms kick in here (write intents, the timestamp cache,
the command queue, etc.).

do you mean span latch manager? ;-)


docs/tech-notes/range-merges.md, line 259 at r1 (raw file):

to store 2 so that store 2 can initialize its replica... and all that might
happen before store 2 manages to apply the merge command for PQ. When store 2
came back online.

Feels like the conclusion is missing. This seems unfortunate, but not the end of the world so far.

Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a bunch of updates. Comments have only been sporadically addressed. I'll have to finish this up tomorrow.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


docs/tech-notes/range-merges.md, line 48 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

albeit

Done.


docs/tech-notes/range-merges.md, line 49 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/constiduent/constituent

Done.


docs/tech-notes/range-merges.md, line 65 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/bookkeping/bookkeeping/g

Done.


docs/tech-notes/range-merges.md, line 127 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/intented/intended

Done.


docs/tech-notes/range-merges.md, line 259 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Feels like the conclusion is missing. This seems unfortunate, but not the end of the world so far.

Yep, conclusion is added now.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


docs/tech-notes/range-merges.md, line 48 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Done.

You sure?


docs/tech-notes/range-merges.md, line 223 at r2 (raw file):

  If the merge txn aborts, the RHS processes the blocked requests.
* Determining when the merge txn is actually committed is really, really hard
  thanks to txn record GC. TODO: Add a subtle complexity section about this?

Yes please!

Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More updates, but still a lot of feedback to address.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


docs/tech-notes/range-merges.md, line 48 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You sure?

Oops, guess I didn't push my last rev last night. Done for real now.

@benesch benesch requested a review from a team January 3, 2019 21:29
@benesch
Copy link
Contributor Author

benesch commented Jan 3, 2019

The general structure is done now, I think. Just three more sections to flesh out.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


docs/tech-notes/range-merges.md, line 58 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We considered it, but it was difficult to achieve. The difficult part is making sure that no other node grabs the lease before the merge is complete or aborted. This required a mechanism that was basically identical to letting any node be the leaseholder in the first place. Or so I remember.

Yeah, let's just make sure that gets added to the doc.


docs/tech-notes/range-merges.md, line 62 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Yep, that's right. There's a theoretical that you could merge with a simple majority, but it didn't seem worth it. It would be massively complicated. Not sure if I'll have time to write it up formally, so let me know if you're curious and I can scrawl it down in a comment at least.

A "3-node cluster with one node down" is a cluster that can't survive any failures. Framed that way, it doesn't seem like a problem to me that merges are disabled when the cluster is in a delicate state that can't survive a failure, since that state should always be remedied as quickly as possible.

@benesch
Copy link
Contributor Author

benesch commented Jan 7, 2019 via email

@benesch
Copy link
Contributor Author

benesch commented Jan 7, 2019

Just as an FYI, I plan to finish this up this week. I'm not sure exactly when I'll have time yet, though.

@benesch benesch force-pushed the tn-range-merges branch 3 times, most recently from c9e3b3f to 6885ebf Compare January 7, 2019 22:11
Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is pretty much done. Just a heads up I haven't copyedited yet, so apologies for any typos/weirdly worded sentences. I think I've covered pretty much all of the complexity I encountered though!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


docs/tech-notes/range-merges.md, line 33 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The coordinator also aligns them if necessary.

I don't consider the merge queue to be part of the coordinator; when I think of the coordinator I just think of the stuff inside AdminMerge. E.g. if we add an AdminMerge RPC, there will be the client who aligns the replica sets, then sends the RPC, and then the "coordinator" is the thing that actually handles the AdminMerge RPC.

FWIW I added a note to this section about how the merge queue typically does the alignment.


docs/tech-notes/range-merges.md, line 50 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

You discuss this a couple of paragraphs down but I think it's worth adding here that P and Q may not be a consistent snapshot of the combined range before the merge begins, but we synchronize them during the merge as described below.

Done.


docs/tech-notes/range-merges.md, line 58 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, let's just make sure that gets added to the doc.

This is covered in the "transfer of power" section now!


docs/tech-notes/range-merges.md, line 62 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

A "3-node cluster with one node down" is a cluster that can't survive any failures. Framed that way, it doesn't seem like a problem to me that merges are disabled when the cluster is in a delicate state that can't survive a failure, since that state should always be remedied as quickly as possible.

This is now touched upon in the "unanimity" section.


docs/tech-notes/range-merges.md, line 100 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Yeah, definitely need to add a section on the merge queue.

Done.


docs/tech-notes/range-merges.md, line 129 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I assume it's also possible that a different client had already asked P and Q to merge, and so a second/repeated request could cause PQ and R to merge?

Yep.


docs/tech-notes/range-merges.md, line 137 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I also don't know that we knowingly rely on this anywhere, but I would be scared to change it.
It's also nice for debuggability to have a consistent anchor key for each range.

Done.


docs/tech-notes/range-merges.md, line 147 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

CockroachDB isn't strictly serializable, but it's linearizable on each key, which is ultimately what counts here.

Done.


docs/tech-notes/range-merges.md, line 155 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

do you mean span latch manager? ;-)

Heh, thanks. Done.


docs/tech-notes/range-merges.md, line 158 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

The example below is already aligned. I assume you meant s/misaligned/aligned/?

Yep, thanks.


docs/tech-notes/range-merges.md, line 183 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Incredibly minor, but 1, 2, 3, 1?

Done.


docs/tech-notes/range-merges.md, line 185 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Reminder to flesh out the ....

Done.


docs/tech-notes/range-merges.md, line 190 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Leaving us hanging, eh?

Done.


docs/tech-notes/range-merges.md, line 196 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This section is awesome. It wasn't clear to why alignment was necessary. This make it clear.

Glad to hear it. FYI I've updated the section so that it has more of a conclusion.


docs/tech-notes/range-merges.md, line 223 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yes please!

Done.

@benesch benesch changed the title [wip] tech-notes: add note on range merges tech-notes: add note on range merges Jan 7, 2019
@benesch benesch removed the do-not-merge bors won't merge a PR with this label. label Jan 7, 2019
@benesch
Copy link
Contributor Author

benesch commented Jan 10, 2019

Ping! I have some bandwidth to address review feedback this week and next, but after that my time will be much more limited.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

thanks a lot!

@knz: would be curious about your review here.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)


docs/tech-notes/range-merges.md, line 284 at r6 (raw file):

     resolved synchronously with the commit of the merge transaction, we are
     guaranteed that, if we see an intent on a local range descriptor, it is
     because the merge has not yet been committed, and it is safe to load the

nit: the merge could've been committed, but "this replica" hasn't applied it yet, in which case we want to read the pre-intent descriptor. Maybe this is too far out in the weeds, but I feel that this could've gotten me to scratch my head if I were less familiar with matters.


docs/tech-notes/range-merges.md, line 356 at r6 (raw file):

claims to write all keys.

**TODO(benesch,nvanbenschoten):** Actually, concurrent reads at lower timestamps

Heh, was just about to add this same comment. The subsume request populates the read timestamp cache at its timestamp, right? That would mean that reads below are fair game.


docs/tech-notes/range-merges.md, line 400 at r6 (raw file):

receiving replica, but there will be a replica of the RHS in the way—remember,
this is guaranteed by replica set alignment. In fact, the snapshot could be
advancing over several merges, in which case there will be several RHSes to

Even worse, over any combination of merges and splits


docs/tech-notes/range-merges.md, line 406 at r6 (raw file):

replica receives a snapshot that widens it, it can infer that a merge occured,
and it simply subsumes all replicas that are overlapped by the snapshot in one
shot. This requires the same intricate, delicate dance mentioned at the end of

I think you should give an idea of this case: replica P gets a snapshot that would subsume a right neighbor Q. But what if the snapshot is slightly stale and Q actually split of from a newer version of P? In that case, you can't subsume Q because that would remove a perfectly healthy replica. This won't happen because the original replica that was merged into P would block Q from ever getting onto that node. So add a forward-reference to the replicaGC section below to let readers know that the explanations are coming up.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r5, 1 of 1 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)


docs/tech-notes/range-merges.md, line 55 at r6 (raw file):

The merge is coordinated by the LHS. The coordinator begins by verifying that a)
the two ranges are, in fact, adjacent, and b) that the replica sets of the two

I think also c) that the replicas are full, that is, they have received the underlying KV data for the range.


docs/tech-notes/range-merges.md, line 72 at r6 (raw file):

range _PQ_ already have all the constituent data, by virtue of having a copy of
both _P_ and _Q_ before the merge begins. Note that _P_ and _Q_ do not need to
form a consistent snapshot before the merge begins; they'll be appropriately

I do not know that the term "consistent snapshot" means.


docs/tech-notes/range-merges.md, line 95 at r6 (raw file):

At the time of writing, merges are only initiated by the merge queue, which is
responsible both for locating ranges that are in need of a merge and aligning
their replica sets before initating the merge.

*initiating


docs/tech-notes/range-merges.md, line 170 at r6 (raw file):

for details.

Using the LHS to coordinate also yields nice symmetry for splits, where the

I'd put this paragraph first. Even if there was no key encoding oddity, this symmetry would be sufficient to motivate the choice.


docs/tech-notes/range-merges.md, line 198 at r6 (raw file):

Then the merge could be aborted if the merge transaction discovered that either
of the implicated ranges did not match the corresponding descriptor in the
AdminMerge request arguments, forming a sort of optimistic lock.

I would recommend the paragraph to suggest passing the end key of the merge as argument, instead of the descriptors. As you said before, the start key does not change so the replica that receives the AdminMerge req knows it. Passing the end key signals intent (possibly from SQL, later) to form one unit of storage for a given key span. The merge queue could then be in charge of performing multiple merges, as much as necessary to coalesce the start/end key span into a single range.

This may race with splits, but defining the semantics this way would just mean the merge will take longer until all the splits, then the resolving merges, are finalized.


docs/tech-notes/range-merges.md, line 352 at r6 (raw file):

The Subsume request provides promise 1 by declaring that it reads and writes all
addressable keys in the range. This is a bald-faced lie, as the Subsume request
only reads one key and writes no keys, but it forces a flush of the command

translate the phrase "flush of the command queue" into something about the span latch manager


docs/tech-notes/range-merges.md, line 586 at r6 (raw file):

When store 2 comes back online, it will start catching up on missed messages.
But notice how store 2 is considered to be a member of Q', because it was a

"store 2 is considered to be a member of Q'" - isn't this averted by the introduction of the range generation number?


docs/tech-notes/range-merges.md, line 800 at r6 (raw file):

into [150 lines of hard to follow code][code].

I think it would be good to explicitly remind the reader that this goroutine is spawn on every incoming request to the RHS; if the node holding the RHS goes down, then back up, the command hold is guaranteed to be cleared.


docs/tech-notes/range-merges.md, line 862 at r6 (raw file):

there could be a snapshot in flight that the LHS is entirely unaware of. So
instead we require that replicas of the LHS in a merge are initialized on every
store before the merge can begin.

Just for my education: what happens when a merge an an up/down-replication are issued concurrently? How do you ensure that the cardinality of the set of replicas is known and fixed for the duration of the merge?

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing. I learned that each replica has some local keys in addition to the logical KV space for the upper layers. This includes the range desc, transaction abort spans, etc.

When the RHS is subsumed, I expect the following to happen:

  1. not sure about the txn abort span, are their keys prefixed by the range ID in any way? Are they prefixed by the txn ID? Depending on how they're keyed, they may need to be migrated to the LHS. At the very least, this should be spelled out in the tech note.
  2. there must be a rocksdb delete event on the range desc of the RHS, otherwise that space (albeit tiny) will never be reclaimed. Is this done in the impl?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)

@benesch benesch force-pushed the tn-range-merges branch 3 times, most recently from e0c9841 to 7b0d951 Compare April 10, 2019 21:40
Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re @knz:

  1. not sure about the txn abort span, are their keys prefixed by the range ID in any way? Are they prefixed by the txn ID? Depending on how they're keyed, they may need to be migrated to the LHS. At the very least, this should be spelled out in the tech note.

Good point. The abort span is indeed copied from RHS to LHS. I've added a bit to the merge description trigger about this.

  1. there must be a rocksdb delete event on the range desc of the RHS, otherwise that space (albeit tiny) will never be reclaimed. Is this done in the impl?

Yep, it's done in the merge transaction itself:

The standard KV operations that the merge transaction performs are:

  • Reading the LHS descriptor and RHS descriptor, and verifying that their
    replica sets are aligned.
  • Updating the local and meta copy of the LHS descriptor to reflect
    the widened end key.
  • Deleting the local and meta copy of the RHS descriptor.
  • Writing an entry to the system.rangelog table.

More generally: I've addressed mostly everything! I'm going to hit the big green button on the strength of the earlier LGTMs, knowing full well there are probably a few nits remaining. I'm simply out of time. If you come across any problems, please just open a PR and go to town! I'm happy to review if you CC me, but don't feel obligated.

🍻

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @a-robinson, @bdarnell, @knz, @petermattis, and @tbg)


docs/tech-notes/range-merges.md, line 55 at r6 (raw file):

Previously, knz (kena) wrote…

I think also c) that the replicas are full, that is, they have received the underlying KV data for the range.

In fact this is explicitly not checked here; that's what's meant below about how P and Q needn't form a consistent snapshot. It would probably be a worthwhile optimization to not attempt to merge if one replica is lagging behind, but no such optimization is performed today.


docs/tech-notes/range-merges.md, line 72 at r6 (raw file):

Previously, knz (kena) wrote…

I do not know that the term "consistent snapshot" means.

Reworded to be more clear!


docs/tech-notes/range-merges.md, line 95 at r6 (raw file):

Previously, knz (kena) wrote…

*initiating

Thanks, done.


docs/tech-notes/range-merges.md, line 170 at r6 (raw file):

Previously, knz (kena) wrote…

I'd put this paragraph first. Even if there was no key encoding oddity, this symmetry would be sufficient to motivate the choice.

Sure, done.


docs/tech-notes/range-merges.md, line 198 at r6 (raw file):

Previously, knz (kena) wrote…

I would recommend the paragraph to suggest passing the end key of the merge as argument, instead of the descriptors. As you said before, the start key does not change so the replica that receives the AdminMerge req knows it. Passing the end key signals intent (possibly from SQL, later) to form one unit of storage for a given key span. The merge queue could then be in charge of performing multiple merges, as much as necessary to coalesce the start/end key span into a single range.

This may race with splits, but defining the semantics this way would just mean the merge will take longer until all the splits, then the resolving merges, are finalized.

That makes sense, but it would require more than just the change you suggest of passing the end key instead of the full descriptor to AdminMerge. The AdminMerge request is dumb, and merges exactly what you ask it to, provided that the merge wouldn't violate correctness. It's the merge queue that decides what ranges make sense to merge and calls AdminMerge on them—and I think it's important that if the merge queue decides to merge P and Q because they're both nice and small and adjacent, then the AdminMerge request will bail if any of those decisions are based on invalid assumptions, like P or Q growing in size or splitting.

What you're asking for is an API to answer user requests like, "please merge everything from /Table/51/gorilla to /Table/52/monkey". That's an important request! And I agree we should handle that by repeatedly retrying merges of all ranges in that span, even if some automatic splits occur concurrently, until just one range remains (or until whatever the minimum number of ranges is remains). When the SQL to make such a request exists, it will be possible to build an implementation on top of an AdminMerge request that behaves like the one proposed in this section, because it can simply retry the merge requests until they succeed. But note that the converse is not true; it is not possible to implement a non-racy merge queue on top of an AdminMerge request that handles its own retries.


docs/tech-notes/range-merges.md, line 284 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: the merge could've been committed, but "this replica" hasn't applied it yet, in which case we want to read the pre-intent descriptor. Maybe this is too far out in the weeds, but I feel that this could've gotten me to scratch my head if I were less familiar with matters.

Good point. Done.


docs/tech-notes/range-merges.md, line 352 at r6 (raw file):

Previously, knz (kena) wrote…

translate the phrase "flush of the command queue" into something about the span latch manager

Done.


docs/tech-notes/range-merges.md, line 356 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Heh, was just about to add this same comment. The subsume request populates the read timestamp cache at its timestamp, right? That would mean that reads below are fair game.

Yep, seems scary.


docs/tech-notes/range-merges.md, line 400 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Even worse, over any combination of merges and splits

Done.


docs/tech-notes/range-merges.md, line 406 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I think you should give an idea of this case: replica P gets a snapshot that would subsume a right neighbor Q. But what if the snapshot is slightly stale and Q actually split of from a newer version of P? In that case, you can't subsume Q because that would remove a perfectly healthy replica. This won't happen because the original replica that was merged into P would block Q from ever getting onto that node. So add a forward-reference to the replicaGC section below to let readers know that the explanations are coming up.

Oh, hm, I was trying to avoid explaining or even motivating those hard-to-explain scenarios more than once, especially since the replica GC section has nice pictures. What I was trying to reference here was just the delicate synchronization to update internal bookeeping. I've updated the wording to hopefully be a bit more clear.


docs/tech-notes/range-merges.md, line 586 at r6 (raw file):

Previously, knz (kena) wrote…

"store 2 is considered to be a member of Q'" - isn't this averted by the introduction of the range generation number?

Unfortunately not. When a range splits, its replica set is copied verbatim from its parent range, which in this case includes s2. At the time the merge and split transactions commit, it's not yet known that s2 has gone down; from the perspective of s1 and s4, s2 is just a little bit slow, so they don't see any need to downreplicate it away.


docs/tech-notes/range-merges.md, line 800 at r6 (raw file):

Previously, knz (kena) wrote…

I think it would be good to explicitly remind the reader that this goroutine is spawn on every incoming request to the RHS; if the node holding the RHS goes down, then back up, the command hold is guaranteed to be cleared.

I don't follow. This goroutine is only spawned once per merge transaction on the RHS leader. (I've added a note to the doc to clarify this.) In fact if the RHS leaseholder goes down and come back up, we'll relaunch this goroutine before processing any requests!


docs/tech-notes/range-merges.md, line 862 at r6 (raw file):

Previously, knz (kena) wrote…

Just for my education: what happens when a merge an an up/down-replication are issued concurrently? How do you ensure that the cardinality of the set of replicas is known and fixed for the duration of the merge?

Up/down replication will need to modify the same range descriptor(s) that are implicated in the merge. This creates a serializability conflict, and so we're guaranteed that at least one of the transactions (either the merge or replication change) will fail. There's a slightly wordier explanation in the "Merge transaction" section.

@benesch
Copy link
Contributor Author

benesch commented Apr 10, 2019

Bors crashed.

bors r+

craig bot pushed a commit that referenced this pull request Apr 10, 2019
24394: RFCs: add range merges RFC r=benesch a=benesch

This is very much a WIP. I wrote it mostly to collect my own thoughts on range merges. At this stage I'm most interested in getting buy-in on building a general-case prototype, as opposed to the special-cases I lay out in the alternatives section, but I'm of course eager to get feedback on the rest of this design.

33334: tech-notes: add note on range merges r=benesch a=benesch

Very much still in progress. Opening a PR in case folks want to follow along.

Release note: None

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
@craig
Copy link
Contributor

craig bot commented Apr 10, 2019

Build succeeded

@craig craig bot merged commit ef6ded1 into cockroachdb:master Apr 10, 2019
@benesch benesch deleted the tn-range-merges branch June 21, 2019 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants