RFCs: add range merges RFC #24394

benesch · 2018-04-01T03:04:03Z

This is very much a WIP. I wrote it mostly to collect my own thoughts on range merges. At this stage I'm most interested in getting buy-in on building a general-case prototype, as opposed to the special-cases I lay out in the alternatives section, but I'm of course eager to get feedback on the rest of this design.

cockroach-teamcity · 2018-04-01T03:04:09Z

This change is

benesch · 2018-04-01T04:59:27Z

Just stumbled across #2433. Some of the XXXs I left for myself look to be answered there.

tbg · 2018-04-01T15:30:39Z

Why [wip][wip]? To really drive the wip'ness home 😃?

Could you discuss the thundering herd? I.e. if we have 1000 adjacent empty ranges and merges suddenly become available, what will happen?

I wrote a long comment about the colocation of replicas, apologies for the wall of text. In the absence of brain farts on my part, though, I think this is all workable. Let me know what you think (happy to chat offline if it's too confused).

Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 21 at r1 (raw file):

This causes several problems, ordered from least to most important:

  1. **User confusion.** Creating a table introduces new ranges, but dropping a

There's also the real point that you can't change the range size and have it be applied, unless it happens to be smaller than before. And then you're locked into that.

docs/RFCS/20180330_range_merges.md, line 26 at r1 (raw file):

  2. **Bookkeeping overhead.** Every range requires a fixed amount of metadata
     and must be periodically scanned by the queues (XXX is this true?).

Yeah this is true, though holding all the ranges in memory is likely the bigger deal.

docs/RFCS/20180330_range_merges.md, line 40 at r1 (raw file):

undo button, we're currently unwilling to introduce load-based splitting—an
important feature to unlock performance on skewed workloads—until we can clean
up after ourselves when the load disappears.

Queue-like workloads (which are common) are also a problem because they leave behind an ever-growing tail of empty ranges and that can really mess with latencies as these empty ranges tend to get queried.

docs/RFCS/20180330_range_merges.md, line 65 at r1 (raw file):

MERGE AT returns an error unless x is a split point created via a prior SPLIT AT. Additionally, it does not guarantee that the two ranges on either side of x are actually merged; rather it lifts the restriction that they must be separate. For example, ...

docs/RFCS/20180330_range_merges.md, line 80 at r1 (raw file):

A merge of R and its right neighbor S

docs/RFCS/20180330_range_merges.md, line 95 at r1 (raw file):

have a special replicated range-ID local key set that indicates it is not to be
merged into the range to its left. I'll hereafter refer to this key as the
"sticky" bit.

The sticky bit will be inserted into the range split transaction, right? That is you'll plumb it down into AdminSplit(sticky=true). (That makes sense since if you're not doing it atomically, you have to teach the split queue to be sensitive to it, etc).

docs/RFCS/20180330_range_merges.md, line 101 at r1 (raw file):

command to allow the merge queue to proceed with its evaluation.

XXX: We might want to run a migration in 2.1 to backfill the sticky bit. It's

I think we may want to introduce merges behind a cluster setting which defaults to false for clusters that are not bootstrapped into 2.1 or higher, and have a note in the docs that tells operators to activate it and re-run any manual splits they care about. (And periodically log at Warning until the setting is flipped; nobody should leave it off).
This is a bit of work, but it'd be a shame if we locked existing clusters into their potentially horribly laid out key space. Or, of course, we provide a tool that undoes all automatic splits and we encourage folks to use it after the upgrade. But I'm also worried about the size of the set-sticky migration; it'd have to write a whole boatload of keys and could run for a long time. If it weren't for that concern, the second solution seems better.

docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file):

Ideally we'll just reuse the split queue's logic here, conjuring up a range
to ask about by summing together _Q_ and _R_'s range stats. _L<sub>Q</sub>_
responds immediately to _L<sub>Q</sub>_ with whether it will proceed with the merge.

I think you mean L_R here, but I'm actually confused about everything after this sentence. It looks like you put R in charge of subsuming its left neighbor, but it works the other way around (so that a range's start key never changes through splits or merges, which is important for the sticky bit). So you want the lease at Q and have it execute an AdminSplit on Q that eats R. So the way I imagine this is

leaseholder of R sends SuggestMerge to Q
on success, leaseholder of R colocates R with Q and then
transfers its lease to the replica colocated with leaseholder of Q
noticing that the replica+lease colocation is complete, leaseholder of Q invokes AdminMerge

docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file):
I'm pretty sure AdminMerge was never completely correct. The main concern is that R is autonomous and hard to lock down while you need it stalled. For starters, you have to prove that when you call AdminMerge, the replica configuration doesn't change out from under you. If it does, you've lost - you can't merge ranges that aren't colocated and things will explode.

I don't actually think that can happen even today because the merge transaction writes intents on both range descriptors, and any replication change needs to write an intent there as well. So that base is covered: you know that if the replicas are colocated initially, they are colocated when your merge commits.

For the leaseholders, it's trickier. If the leaseholders are colocated at the time of the merge, then there can be a correct merge (I'm also pretty sure that when you add copious amounts of testing you'll find a bug there, but that's because it's been a long time since we've even cared about making this correct). But what keeps R's lease from moving elsewhere? Nothing at the moment, and so what can happen is that you try to physically merge the ranges (which Q is in charge of), but now you notice that your replica of R isn't actually the one in charge of writing and so reads and writes might be going into R as you try to hold things still for a moment. That creates inconsistencies.

More precisely, we need

on the lease holding node, in the interval [leaseholder of Q evalutes the commit, leaseholder of Q successfully applied the commit], the leaseholder of R is colocated.

I think something that might work here is that you use an intent in the merge transaction again, but this time it's to collide with anyone trying to change the lease of R. Let's assume for simplicity that anyone trying to perform a lease change on R (i.e. transfer or taking over the lease after expiration) has to check the replicated range descriptor key of R first. If they see an intent there, they have to abort it or they're not allowed to go ahead. If they don't see an intent, they can get the lease.

The logic error here is that to read the key on our own range, someone needs to have the lease in the first place. This is clearly not going to be the case if we're trying to get the lease, so I think we have to flip things around: you get the lease unconditionally, and when you have it, you check the range descriptor key. If there is a (merge) intent, you try to abort the merge transaction (which is not anchored to the local range R, but to Q and so you're not really using the lease in doing so). If that succeeds, you go ahead and use your lease. If you discover that the transaction has committed (you have to check the range descriptor of Q as well) you forfeit the lease and queue yourself for replica GC. Note that you have to change Qs descriptor because there's a race: you might find an intent and prepare to push, in the meantime the transaction commits and resolves said intent and removes the txn record, you push and create an aborted txn record. This race is not a correctness problem because when you try to abort the intent you find that it has since turned into a value, but it would be fatal if we assumed that there wasn't a merge in progress in the first place. So we must check that we're really aborting the intent (but that again is racy because multiple pushers could have tried at the same time), or that the LHS doesn't contain our range (not racy).

Note that the mechanism implicitly assumes that the resolution of the intent on R happens after the leaseholder on Q has executed the merge trigger. That's not trivially true, at least not after parallel commits extended version. There we'll resolve intents in parallel with running the final commit. But, you just send the EndTransaction at the end in its own batch, which is probably already done today.

If the leaseholder crashes, the commit may still get through, but that's OK because both leaseholders crashed (they were colocated). After the restart, they can't continue using their leases, so now they behave like followers who crashed.

It's not a problem if a follower crashes and applies the merge much later than the rest. I can't observe any configuration changes until it has caught up. It will have both ranges for longer, but won't be able to do anything with the right hand side (except catch up with the Raft log, where there shouldn't be much) until it processes the merge on the left hand side which then kills the RHS. It also can't get a lease on the LHS until it has processed the merge, so all is well. There are two minor concerns:

the RHS becomes eligible for GC before the merge applies. That needs to be handled appropriately (it's reminiscent of storage: *roachpb.RaftGroupDeletedError for RHS in splitPostApply #21146) or the RHS will be missing when the LHS tries to subsume it, and that will be a fatal error.
the RHS might try to get a lease, though that won't be able to go through successfully, it might confuse the merged range QR which receives it. I think that will just error out, but is a scenario we need to test.

Comments from Reviewable

bdarnell · 2018-04-01T17:18:08Z

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 26 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Yeah this is true, though holding all the ranges in memory is likely the bigger deal.

Currently the biggest impact is actually that we loop over all ranges every 100ms for the raft ticker. I think we can fix this without merges by storing quiesced and unquiesced ranges separately, though.

docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file):

manual `SPLIT AT` and which ranges were split due to exceeding a size threshold
that they no longer exceed. Maybe we take the conservative approach and give
every existing range the sticky bit?

I'd go the other way and say that nothing is initially sticky. Users are far more likely to have data that has split too small than to have manual splits which need to persist even though size threshold are no longer met. In my experience, manual splits are used to help with cold starts, but once the system has been running those original split points are no longer special. I don't think there's a need to apply a permanent sticky bit retroactively.

docs/RFCS/20180330_range_merges.md, line 139 at r1 (raw file):

polling?) Once it receives the lease, it coordinates with the replicate queue to
align the remainder of _Q_'s replicas with _R_'s. Once all replicas are aligned,
it executes an `AdminMerge` on _Q_ and the merge is complete.

One of the tricky parts of this process is that once the replicas are aligned, we must disallow any other replica moves of either range until the merge has completed, but we must also not leave the ranges permanently stuck in place if node running the AdminMerge dies. (But I think @tschottdorf is right and the fact that merges and replica changes use transactions on the same keys saves us here)

docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file):

I'm pretty sure AdminMerge was never completely correct.

Right, it was never finished when we realized how hard the remaining bits were going to be and we had more important things to work on at the time.

I think something that might work here is that you use an intent in the merge transaction again, but this time it's to collide with anyone trying to change the lease of R.

I really don't want to complicate the lease process if we can help it.

I think there's a simpler way here. Once the leases are colocated, we set an in-memory "do not transfer" flag on the Replica. This guarantees that the leases will be colocated until our liveness expires, so we can use the end of our current liveness as the transaction expiration. That way, if the transaction can't complete before our time is up, it will be guaranteed to abort.

docs/RFCS/20180330_range_merges.md, line 145 at r1 (raw file):

XXX:
  * Does `AdminMerge` properly handle in-flight traffic to _Q_ and _R_, or do we

AdminMerge probably needs the same treatment on Q that AdminSplit does: a command-queue write lock for the EndTransaction that runs the merge trigger to ensure the final stats are accurate. We also need some to-be-developed mechanism to lock down R completely during the merge (while still allowing it to unlock if the merge aborts).

docs/RFCS/20180330_range_merges.md, line 148 at r1 (raw file):

    need to somehow "lock" them while the merge is in progress?

  * Is `AdminMerge` correct? Tobi suggests its correctness has likely rotted.

There has very likely been rot, in addition to the unfinished parts related to locking down the RHS. We'll need to go through splitTrigger, Store.SplitRange, and splitPostApply to make sure that everything there has a counterpart in the merge path.

docs/RFCS/20180330_range_merges.md, line 156 at r1 (raw file):

merging at once.

Thrashing should be minimal, as _L<sub>Q</sub>_ will reject any

What about alternating splits and merges? We should ensure that there is a gap between the split and merge threshold so that ranges that are just split are unlikely to immediately become mergeable (even if there is a dip in load or some data deleted). We may want a safeguard like "don't auto-merge a range that has split in the last 24h" (in fact, if we made all splits "sticky" for 24h, would we even want a concept of a permanent sticky bit?)

docs/RFCS/20180330_range_merges.md, line 160 at r1 (raw file):

and _P_.

The merge queue will temporarily pause if the cluster is unbalanced to avoid

How will this imbalance be detected? Why would merges cause further imbalances?

docs/RFCS/20180330_range_merges.md, line 192 at r1 (raw file):

  table cannot receive any external traffic.

  What if we only merged empty ranges? The allocator could be much dumber about

If we know the range is empty (and will stay empty), we can leave the allocator out of the process entirely. Just change the endpoints of the other range and let the empty one be GC'd wherever it is. "And will stay empty" is the tricky part - aside from the special case of dropped tables, guaranteeing this seems to require a large enough portion of the general solution that the special case may not be worth it.

Comments from Reviewable

tbg · 2018-04-01T18:25:05Z

Added some more impl suggestions for the actual commit trigger. And I realized I hadn't actually finished reading the document the last time, so the thrashing question is retracted 🙈

Review status: 0 of 1 files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd go the other way and say that nothing is initially sticky. Users are far more likely to have data that has split too small than to have manual splits which need to persist even though size threshold are no longer met. In my experience, manual splits are used to help with cold starts, but once the system has been running those original split points are no longer special. I don't think there's a need to apply a permanent sticky bit retroactively.

👍

docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm pretty sure AdminMerge was never completely correct.

Right, it was never finished when we realized how hard the remaining bits were going to be and we had more important things to work on at the time.

I think something that might work here is that you use an intent in the merge transaction again, but this time it's to collide with anyone trying to change the lease of R.

I really don't want to complicate the lease process if we can help it.

I think there's a simpler way here. Once the leases are colocated, we set an in-memory "do not transfer" flag on the Replica. This guarantees that the leases will be colocated until our liveness expires, so we can use the end of our current liveness as the transaction expiration. That way, if the transaction can't complete before our time is up, it will be guaranteed to abort.

Are you saying we use a txn deadline that is equal to the expiration of the lease on the RHS? That's not how the txn deadline works, the txn deadline compares against the provisional commit timestamp, and that has nothing to do with real time. I hope I'm mistaken about what you're suggesting because I'm always interested in a simpler solution.

docs/RFCS/20180330_range_merges.md, line 145 at r1 (raw file):

We also need some to-be-developed mechanism to lock down R completely during the merge (while still allowing it to unlock if the merge aborts).

If we cover all of it via the command queue (the nice thing about merges is that we don't have to recompute anything, though note that we likely want to if ContainsEstimates is set) and have the lease pinned, we can

at commit evaluation time on replQ (if evaluation succeeds) we act similar to the gc queue:
1. set replR.mu.destroyed (in-mem only)
2. cancel all pending proposals
3. new: send a canary proposal (to flush out anything else)
4. now we know R is "locked" though at the Raft-level it is still active. (this prevents recreation; we can't just drop it from the store map now).
5. run the merge trigger evaluation logic. On error, return side effect that undoes the above (i.e. clears replR.mu.destroyed CPut-style to avoid clearing a replica corruption error that has happened in the meantime)
6. note that this needs to lay down a raft tombstone for R in the command's writebatch which prevents all future reincarnations (MinReplicaID=\infty). This is an unreplicated key but here we want this done on all replicas, so if there are issues with that we can instead use a side effect which puts it in the writebatch downstream of Raft.
at apply time, we already have acquireMergeLock which freezes R's Raft
1. if there is a forced error at apply time (that's not replica corruption), need to still undo the locking on RHS.
add to GC queue
GC queue should also be sensitive to the tombstones we laid down: if it finds one of those it tries to GC (should always succeed). This is important to clean up after a crash between WriteBatch application and GC.

To block the command queue of the RHS, we can tamper with its command queue directly (similar to how replR.mu.destroyed is handled). This seems doable. Wonder if I missed anything.

docs/RFCS/20180330_range_merges.md, line 156 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What about alternating splits and merges? We should ensure that there is a gap between the split and merge threshold so that ranges that are just split are unlikely to immediately become mergeable (even if there is a dip in load or some data deleted). We may want a safeguard like "don't auto-merge a range that has split in the last 24h" (in fact, if we made all splits "sticky" for 24h, would we even want a concept of a permanent sticky bit?)

Are you aware of any splits that are carried out by users for reasons other than performance? I'm worried that there is some outlandish reason for wanting something split that I don't anticipate, and that we would violate here.

docs/RFCS/20180330_range_merges.md, line 165 at r1 (raw file):

## Prototype ordering

I'm planning to start by implementing the merge queue with enough smarts to

The merge queue is kind of the easy part. The hard part is to actually make AdminMerge work correctly, and then to test it. If we can't achieve that, then the merge queue (which I'm much more comfortable you can put together) is moot. How do you feel about starting this bottom-up by trying to make AdminMerge correct while introducing a suite of correctness tests that stress our current + suggested fixes in this RFC AdminMerge implementation? In (unit or acceptance or roach) testing, we have good control over lease and replica placement and can force things to happen very fast.

If within this cycle you arrive at an AdminMerge that's trusted, that would be fantastic and I personally wouldn' even care what else you get done. I'd much rather you frantically hack together the merge queue (and we just label it experimental) than the other way around! And you may actually be able to enlist someone else for the merge queue who isn't as familiar with the innards of the core layer.

docs/RFCS/20180330_range_merges.md, line 192 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If we know the range is empty (and will stay empty), we can leave the allocator out of the process entirely. Just change the endpoints of the other range and let the empty one be GC'd wherever it is. "And will stay empty" is the tricky part - aside from the special case of dropped tables, guaranteeing this seems to require a large enough portion of the general solution that the special case may not be worth it.

Having thought about the general case a bit now, it's my impression that tackling it head-on is worth it.

docs/RFCS/20180330_range_merges.md, line 209 at r1 (raw file):

## Unresolved questions

* Most of the implementation.

I've tried to fill in some of that in my comments. Doesn't seem terrible but the devil is in the details.

Comments from Reviewable

tbg · 2018-04-01T18:25:09Z

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful.

Comments from Reviewable

petermattis · 2018-04-01T18:33:02Z

Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 26 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Currently the biggest impact is actually that we loop over all ranges every 100ms for the raft ticker. I think we can fix this without merges by storing quiesced and unquiesced ranges separately, though.

The memory usage appears to be non-trivial. The kv/split/nodes=3 roachtest manually splits a table into 500k ranges without inserting any data and memory usage grows to something like 12GiB. I haven't investigated why. Note that is 10x more ranges per node than we're expecting and it isn't clear that the memory usage is a problem in real clusters.

docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

👍

As a counter-point, for the TPC-C performance work we manually split the items table (100k rows, a few megabytes of data) in order to spread it out across a cluster. It is possible that adjusting the zone config for that table to use a smaller max-range size would have also worked. Note that this table is read-only. A small read-only table is useful to split for improved load distribution. We wouldn't want range merging to work against that, though I'm also not convinced we need manual splits to be sticky.

Comments from Reviewable

tbg · 2018-04-01T18:38:09Z

Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

As a counter-point, for the TPC-C performance work we manually split the items table (100k rows, a few megabytes of data) in order to spread it out across a cluster. It is possible that adjusting the zone config for that table to use a smaller max-range size would have also worked. Note that this table is read-only. A small read-only table is useful to split for improved load distribution. We wouldn't want range merging to work against that, though I'm also not convinced we need manual splits to be sticky.

But post merging, wouldn't you just set a zone config that mandates ~few mb ranges? Actually you could do that today, but you likely didn't want to bother with querying crdb_internal.ranges to wait until that has gone through.

Comments from Reviewable

petermattis · 2018-04-01T18:42:22Z

Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

But post merging, wouldn't you just set a zone config that mandates ~few mb ranges? Actually you could do that today, but you likely didn't want to bother with querying crdb_internal.ranges to wait until that has gone through.

Yes, setting a zone config probably would work (I stated as much). So not really a counter-point, but an additional point that we should keep in mind.

Comments from Reviewable

benesch · 2018-04-01T19:01:40Z

Why [wip][wip]? To really drive the wip'ness home 😃?

Yessir. 😜

I wrote a long comment about the colocation of replicas, apologies for the wall of text. In the absence of brain farts on my part, though, I think this is all workable. Let me know what you think (happy to chat offline if it's too confused).

Hah, that was exactly what I was after. I figured if I tossed out a partial RFC you and Ben would jump in and help fill in the rest. You two certainly haven't disappointed. :)

Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions.

docs/RFCS/20180330_range_merges.md, line 21 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

There's also the real point that you can't change the range size and have it be applied, unless it happens to be smaller than before. And then you're locked into that.

Good point. Done.

docs/RFCS/20180330_range_merges.md, line 26 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Currently the biggest impact is actually that we loop over all ranges every 100ms for the raft ticker. I think we can fix this without merges by storing quiesced and unquiesced ranges separately, though.

Fleshed this section out a bit. The Raft ticker problem is now noted in the alternatives section below.

docs/RFCS/20180330_range_merges.md, line 40 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Queue-like workloads (which are common) are also a problem because they leave behind an ever-growing tail of empty ranges and that can really mess with latencies as these empty ranges tend to get queried.

Done.

docs/RFCS/20180330_range_merges.md, line 65 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

MERGE AT returns an error unless x is a split point created via a prior SPLIT AT. Additionally, it does not guarantee that the two ranges on either side of x are actually merged; rather it lifts the restriction that they must be separate. For example, ...

Done.

docs/RFCS/20180330_range_merges.md, line 80 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

A merge of R and its right neighbor S

Thanks, done.

docs/RFCS/20180330_range_merges.md, line 95 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The sticky bit will be inserted into the range split transaction, right? That is you'll plumb it down into AdminSplit(sticky=true). (That makes sense since if you're not doing it atomically, you have to teach the split queue to be sensitive to it, etc).

Ugh, it's not quite so simple though. We'll want to be able to set the bit even if the split already exists.

docs/RFCS/20180330_range_merges.md, line 101 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I think we may want to introduce merges behind a cluster setting which defaults to false for clusters that are not bootstrapped into 2.1 or higher, and have a note in the docs that tells operators to activate it and re-run any manual splits they care about. (And periodically log at Warning until the setting is flipped; nobody should leave it off).
This is a bit of work, but it'd be a shame if we locked existing clusters into their potentially horribly laid out key space. Or, of course, we provide a tool that undoes all automatic splits and we encourage folks to use it after the upgrade. But I'm also worried about the size of the set-sticky migration; it'd have to write a whole boatload of keys and could run for a long time. If it weren't for that concern, the second solution seems better.

Ack, this definitely needs more thought.

docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I think you mean L_R here, but I'm actually confused about everything after this sentence. It looks like you put R in charge of subsuming its left neighbor, but it works the other way around (so that a range's start key never changes through splits or merges, which is important for the sticky bit). So you want the lease at Q and have it execute an AdminSplit on Q that eats R. So the way I imagine this is

leaseholder of R sends SuggestMerge to Q

on success, leaseholder of R colocates R with Q and then

transfers its lease to the replica colocated with leaseholder of Q

noticing that the replica+lease colocation is complete, leaseholder of Q invokes AdminMerge

The two phases are distinct, though. There's the colocation phase where R pulls Q over to the same ranges, then the merge phase where Q merges into R. So I think what's written here is viable.

I agree it's confusing and stupid, though, for these phases to look in different directions. The colocation phase should just look rightward. I'll rework this.

docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Are you saying we use a txn deadline that is equal to the expiration of the lease on the RHS? That's not how the txn deadline works, the txn deadline compares against the provisional commit timestamp, and that has nothing to do with real time. I hope I'm mistaken about what you're suggesting because I'm always interested in a simpler solution.

~~By "transaction expiration", do you mean the deadline that SQL table leases use, or is there another txn expiration mechanism?~~

@tschottdorf beat me to it.

docs/RFCS/20180330_range_merges.md, line 156 at r1 (raw file):

We may want a safeguard like "don't auto-merge a range that has split in the last 24h"

Agreed, though I was thinking a threshold on the order of minutes, not hours.

in fact, if we made all splits "sticky" for 24h, would we even want a concept of a permanent sticky bit?

I think we still would. If I run ALTER TABLE ... SPLIT AT generate_series(1, 100) today, I'm going to be mighty confused when I come back tomorrow and the splits have all disappeared. I'm very willing to be convinced otherwise, though. I expect the sticky bit will be one of the last things I tackle.

docs/RFCS/20180330_range_merges.md, line 160 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How will this imbalance be detected? Why would merges cause further imbalances?

In the wost case, merges could pull three replicas off underfull nodes and transfer them to overfull nodes. And even if you're doing neutral work (moving a replica from an overfull store to another overfull store, say), you're preventing a preemptive snapshot that could have balanced the cluster.

Doesn't seem like it would be too hard to pause the merge queue when the store pool thinks some stores are overfull/underfull.

docs/RFCS/20180330_range_merges.md, line 165 at r1 (raw file):

The merge queue is kind of the easy part.

Agreed. I think I have it half done already. I'm inclined to continue hacking it together before I embark on fixing AdminMerge. I'll be much happier if I can watch ranges disappear in a real cluster instead of just in unit tests.

docs/RFCS/20180330_range_merges.md, line 209 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I've tried to fill in some of that in my comments. Doesn't seem terrible but the devil is in the details.

Yes thank you! 🙇

Comments from Reviewable

tbg · 2018-04-01T20:56:27Z

Reviewed 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 95 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Ugh, it's not quite so simple though. We'll want to be able to set the bit even if the split already exists.

AdminSplit is the thing that runs the distributed split transaction, so the new code in AdminSplit will be something like txn.Get(stickyKey); if !bitIsSet { txn.Put(stickyKey) }, i.e. it's idempotent.

docs/RFCS/20180330_range_merges.md, line 101 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Ack, this definitely needs more thought.

FWIW, I like Ben's suggestion to just turn it on, though I'm not sure the real world doesn't have a case in which that's a bad idea. We can also introduce the cluster setting to turn it off (but have it on by default) so that folks have a chance to turn it off before they bump the cluster version.

docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Yes, setting a zone config probably would work (I stated as much). So not really a counter-point, but an additional point that we should keep in mind.

Apologies, reading is not my strong suit today. There was a time in high school where I'd routinely just forget about whole exercises in math exams. Feel reminded of that today.

docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

The two phases are distinct, though. There's the colocation phase where R pulls Q over to the same ranges, then the merge phase where Q merges into R. So I think what's written here is viable.

I agree it's confusing and stupid, though, for these phases to look in different directions. The colocation phase should just look rightward. I'll rework this.

Ack, I'll hold off until you've done so.

docs/RFCS/20180330_range_merges.md, line 156 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

We may want a safeguard like "don't auto-merge a range that has split in the last 24h"

Agreed, though I was thinking a threshold on the order of minutes, not hours.

in fact, if we made all splits "sticky" for 24h, would we even want a concept of a permanent sticky bit?

I think we still would. If I run ALTER TABLE ... SPLIT AT generate_series(1, 100) today, I'm going to be mighty confused when I come back tomorrow and the splits have all disappeared. I'm very willing to be convinced otherwise, though. I expect the sticky bit will be one of the last things I tackle.

Also doesn't seem that the sticky bit informs any of the other mechanisms, right? It's just a take it or leave it addon.

Comments from Reviewable

bdarnell · 2018-04-01T22:01:47Z

Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 101 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

FWIW, I like Ben's suggestion to just turn it on, though I'm not sure the real world doesn't have a case in which that's a bad idea. We can also introduce the cluster setting to turn it off (but have it on by default) so that folks have a chance to turn it off before they bump the cluster version.

Even if we "just turn it on", it will need to be behind a cluster version. I like the idea of having a setting that can be toggled while the upgrade is in its non-finalized state (but be aware of the auto-finalize proposal in #24377)

docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Apologies, reading is not my strong suit today. There was a time in high school where I'd routinely just forget about whole exercises in math exams. Feel reminded of that today.

Yeah, by saying that no pre-2.1 splits are sticky we might break a few use cases, and they'll need to either set zone configs with small limits or manually re-run their splits to set the sticky bit. I consider this preferable to the alternative of "all pre-2.1 splits are sticky".

docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ack, I'll hold off until you've done so.

FWIW, I think things make sense as described here. The queue runs on replicas looking leftward, and if it makes sense from their perspective it hands off to the leftward range for the rest of the process (which can now proceed with input from both ranges).

docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file):

Are you saying we use a txn deadline that is equal to the expiration of the lease on the RHS? That's not how the txn deadline works, the txn deadline compares against the provisional commit timestamp, and that has nothing to do with real time.

Yes, exactly, but I see now that I'd need a new change: EndTransaction needs to consult the timestamp cache, so that a new leaseholder can't commit a transaction started by the previous leaseholder.

docs/RFCS/20180330_range_merges.md, line 156 at r1 (raw file):

Are you aware of any splits that are carried out by users for reasons other than performance? I'm worried that there is some outlandish reason for wanting something split that I don't anticipate, and that we would violate here.

Maybe as a poor-man's partitioning? But without zone constraints I can't see why that would matter for anything but performance.

Agreed, though I was thinking a threshold on the order of minutes, not hours.

For manual splits we'd definitely want hours, if not days. For auto-splits minutes might be OK, but still seems aggressive to me. If we set the limit less than 24h, some clusters will have traffic patterns that split every morning and re-merge at night. (is that OK? Maybe, but it seems better to me to just leave the splits in place)

I think we still would. If I run ALTER TABLE ... SPLIT AT generate_series(1, 100) today, I'm going to be mighty confused when I come back tomorrow and the splits have all disappeared.

I see manual splits as primarily aimed at the cold start problem. As long as I split less than 24h before my launch/announcement/whatever, I'd be fine and the ranges will grow into their unmergeable sizes pretty quickly. The question is whether it's reasonable for admins savvy enough to do manual pre-splitting to also be able to schedule their pre-splitting within 24h of of the traffic spike.

I think having the sticky bit is better than not having it, but note that if we drop it, we also don't need the MERGE AT command, so it's heavier than it initially looks.

docs/RFCS/20180330_range_merges.md, line 160 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

In the wost case, merges could pull three replicas off underfull nodes and transfer them to overfull nodes. And even if you're doing neutral work (moving a replica from an overfull store to another overfull store, say), you're preventing a preemptive snapshot that could have balanced the cluster.

Doesn't seem like it would be too hard to pause the merge queue when the store pool thinks some stores are overfull/underfull.

Or conversely when a merge opportunity has been identified, the replication queue should try to make room for it to happen (the ranges to merge are already going to be below-average in size).

docs/RFCS/20180330_range_merges.md, line 192 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Having thought about the general case a bit now, it's my impression that tackling it head-on is worth it.

That's my thought as well.

docs/RFCS/20180330_range_merges.md, line 76 at r2 (raw file):

the split point will remain.

`MERGE AT` returns an error if any of the listed split points were not created

I think MERGE AT should at least attempt a merge if the split points were not created by SPLIT AT. Similarly, even if the split points were created by SPLIT AT, if the merge fails it should return with an error like "ranges too large to merge".

If it only removes the sticky bit without attempting the merge, I'd call it UNSPLIT AT.

Comments from Reviewable

tbg · 2018-04-01T22:51:56Z

Review status: all files reviewed at latest revision, 13 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file):

where Q merges into R

FWIW, this is exactly the opposite of what's happening, no? R merges into Q (i.e., after the merge, Q is still there but larger). This confusion permeates the text below and maybe explains why I'm confused. I understand that there will be a phase where R colocates itself with Q and then hands over to Q for the actual merge.

docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Are you saying we use a txn deadline that is equal to the expiration of the lease on the RHS? That's not how the txn deadline works, the txn deadline compares against the provisional commit timestamp, and that has nothing to do with real time.

Yes, exactly, but I see now that I'd need a new change: EndTransaction needs to consult the timestamp cache, so that a new leaseholder can't commit a transaction started by the previous leaseholder.

I'm still confused about what you're suggesting. We have a transaction that will commit on Q and R is the lease we care about. Both use the same node's epoch, but I'm not sure how that helps. You can't have anything below Raft look into R (or use the current time) because what you'll find is not the same on all replicas (for example a follower may have a far-behind version of R that has a past leaseholder, we can work around that but again, you wanted something simple). Can you ELI5 ("explain like I'm five")?

Comments from Reviewable

benesch · 2018-04-02T04:19:22Z

Check out that blip at 4:04. Four range merges in a real cluster! Pay the catastrophic failure that occurs next no mind. That's Raft complaining about a missing log, but it resolves after a restart so how bad can it be?

bdarnell · 2018-04-02T17:33:58Z

Review status: all files reviewed at latest revision, 13 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'm still confused about what you're suggesting. We have a transaction that will commit on Q and R is the lease we care about. Both use the same node's epoch, but I'm not sure how that helps. You can't have anything below Raft look into R (or use the current time) because what you'll find is not the same on all replicas (for example a follower may have a far-behind version of R that has a past leaseholder, we can work around that but again, you wanted something simple). Can you ELI5 ("explain like I'm five")?

Nevermind, you're right. I wasn't thinking enough about the followers and was just trying to synchronize between the two Replicas on the leader. But I don't see how the intent proposal above addresses the problem of lagging followers either.

Comments from Reviewable

benesch · 2018-04-03T21:42:24Z

Review status: all files reviewed at latest revision, 13 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

where Q merges into R

FWIW, this is exactly the opposite of what's happening, no? R merges into Q (i.e., after the merge, Q is still there but larger). This confusion permeates the text below and maybe explains why I'm confused. I understand that there will be a phase where R colocates itself with Q and then hands over to Q for the actual merge.

Hmm, I'm confused about why you're confused. I looked at it again and it seems correct. R considers a merge leftward with Q. If things looks good, it says, "hey Q, how do you feel about this?" If Q likes it too, Q moves into alignment with R AND cedes the lease to R's leaseholder. Then R's leaseholder, having Q's lease now, merges R into Q.

The benefit here is that you'll never have to roundtrip to find out of the sticky bit is set. (The sticky bit necessarily needs to be set on the RHS of a split because ranges are inclusive on the LHS.)

Comments from Reviewable

tbg · 2018-04-04T16:21:41Z

Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Hmm, I'm confused about why you're confused. I looked at it again and it seems correct. R considers a merge leftward with Q. If things looks good, it says, "hey Q, how do you feel about this?" If Q likes it too, Q moves into alignment with R AND cedes the lease to R's leaseholder. Then R's leaseholder, having Q's lease now, merges R into Q.

The benefit here is that you'll never have to roundtrip to find out of the sticky bit is set. (The sticky bit necessarily needs to be set on the RHS of a split because ranges are inclusive on the LHS.)

The root of my confusion is that AdminMerge runs on Q, the left range. Your description sounds like it wants AdminMerge to run on R, the right range.

docs/RFCS/20180330_range_merges.md, line 139 at r1 (raw file):

But I think @tschottdorf is right and the fact that merges and replica changes use transactions on the same keys saves us here

I don't think that's right, after all. Consider (this is roughly the same problem as with my intent proposal for leaseholder pinning):

previous leaseholder n3 of R calls ChangeReplicas, but goroutine gets preempted before doing anything
ranges Q, R are colocated, so are the leaseholders (now both on n1)
merge of Q into R runs and commits, applies only on n1
AdminMerge on n3 gets unstuck, runs the replication change, removes n2 and adds n4
merge trigger applies on n2 and n4, explodes

We can fortify by mandating that ChangeReplicas commit only under a lease that is held by the node running ChangeReplicas, and then the whole burden of correctness is on pinning the leaseholders and flagging the leaseholder so that it promises not to run ChangeReplicas.

docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Nevermind, you're right. I wasn't thinking enough about the followers and was just trying to synchronize between the two Replicas on the leader. But I don't see how the intent proposal above addresses the problem of lagging followers either.

You're right, it's equally broken. For posterity, you mean this scenario:

node 1 is initially leaseholder of r1 and r2, r1 tries to subsume r2

proposal applies on n1/r1
intent gets resolved on all replicas of r2
n2 becomes leaseholder of r2, there's no intent so it does whatever

I think I have salvaged the idea, but I'll polish this a bit and post it later. I'm rediscovering why all of this is so hard.

Comments from Reviewable

nvanbenschoten · 2018-04-04T18:13:41Z

Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 76 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think MERGE AT should at least attempt a merge if the split points were not created by SPLIT AT. Similarly, even if the split points were created by SPLIT AT, if the merge fails it should return with an error like "ranges too large to merge".

If it only removes the sticky bit without attempting the merge, I'd call it UNSPLIT AT.

+1 to both of those points. SPLIT AT has been so useful because it allows us to bypass any magic and performs its operation synchronously. Ideally MERGE AT would have the same properties.

docs/RFCS/20180330_range_merges.md, line 103 at r2 (raw file):

To preserve these user-created split points, we need to store state. I
tentatively propose using a replicated range-ID local key. Every range to the

An alternative approach would be to make this a range local key and put it right next to the split point itself. This would allow the sticky bit to be more easily worked with since it would be addressable. It would also naturally force the key onto the right side of the manual split, which fits with the rest of the proposal here.

docs/RFCS/20180330_range_merges.md, line 137 at r2 (raw file):

Now Q's leaseholder L_Q has perfect information. The merge

Perfect information? In a distributed system? That must be nice 😃

Unfortunately, since we're not locking both ranges while we process this suggestion, it will always be racey. For instance, while this series of rebalances and lease transfers is taking place, Q and R could continue to grow. We'll need to consider what happens if a split of Q races with a merge of R into Q.

EDIT: I see you proposed this as a question below and that this is already a lively discussion!

Another race we'll need to explore is a series of ranges each concurrently attempting to merge into the range on their left. Again, I don't think we'll be able to avoid this entirely, but it does introduce an interesting question. Should we favor small ranges merging left into large ranges, large ranges merging left into small ranges, or have no preference? A situation that demonstrates why we should think about this is the case where an empty range is sandwiched between two other ranges. Without any preference for who kicks off the merge, this case could regularly end up in a racing double merge. This could be avoided in the common case by only adding ranges that are less than half the max_range_size to the merge queue.

Comments from Reviewable

tbg · 2018-04-04T21:43:14Z

Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file):
I wrote another wall of text. I'm afraid this isn't easily digested and may contain overcomplicated and incorrect solutions. But I think it does a decent job at exposing a number of problems we need to solve. I'm happy to talk these through offline.

The exposition is also lacking, but I've rewritten this numerous times already and I think it's time to involve more eyes in that process.

Making sure the RHS is up to date at the time of merge trigger

First of all, we must also solve the "RHS catch-up problem": when the merge trigger executes, the subsumed replica must be at a log position after we locked down the leaseholder (ie. it has the latest writes). I think this is most adequately addressed by a round of long-poll RPCs to all the replicas of R after having locked down the leaseholder and proposed a noop write with a trigger in it (that signals the polls). Trying to do this from below Raft is a non-starter. So from now on I will assume that whenever a merge trigger runs and the RHS is present, it will be up to date.

Snapshots

Something else we have to worry about is that we must not send any snapshots that contain a merge trigger in the log entries. The recipient won't be able to apply the merge trigger; we must send a higher-index snapshot that postdates the merge.

AdminMerge Lock-Down

We make sure that the AdminMerge transaction can only commit with the node running the AdminMerge holding the lease for Q (i.e. we pass the lease sequence into the merge trigger).

Leaseholder Lock-Down

So our only problem left (hopefully) is to lock down (lease and data of) R. To lock down the current lease on L, we communicate directly with the lease holder before committing the merge trigger (there are details to be figured out here, but basically this should work).

Now we need to prevent followers from stealing a lease unless the merge is aborted. I still believe that the best way to achieve that is to insert a cheap check that runs after a lease acquisition on L before using the lease.

Lease Check

The check simply retrieves keys.RangeDescriptorKey(desc.StartKey). If that key is a deletion intent (it'll be anchored on Q), we need to push it before continuing. If the read comes back with a descriptor that's not ours (i.e. empty or changed), we know we've been merged away, reject all requests (with a range key mismatch error), and move ourselves to the mergatory described below. Note that in trying to push the intent, we wait for the merge to commit (unless we decide to push with high priority, in which case we would abort. Since we also want to merge expiration-based ranges, I think waiting makes sense but I could be convinced otherwise).

Note that as far as the data goes, it's OK if R sticks around for a while and uses its Raft machinery and gets a lease, even after QR exists. The lease and Raft state are both keyed by range ID, so there will be no conflict between the post-merge range QR and R:

  .-- keyed by RangeID of R
  |     .-- keyed by RangeID of QR (=post-merge Q)
  |     |          .-- RangeDescriptorKey(R)
  v     |          v
[----)  v          [------R's-data------)
      [----)       [-------QR's-(data)---------------)

There's no requirement to synchronously knock out R as the lease trigger applies. This is helpful because in the general case we can't guarantee that, see the "Handling Q-rebalances" section below.

Note: (*Replica).assertStateLocked will fire if run, since the on-disk descriptor and the in-memory copy have diverged.

Mergatory

The "mergatory" (bad pun on merge+purgatory, not suggested name) is a list of replicas that the store still owns, but which have been merged away and don't own their data keyspace any more. We could avoid having a mergatory if we eagerly ran GC on a replica while moving it into the mergatory, but this can be problematic for performance, so keep that in mind.

Replicas in the mergatory are ignored by LookupReplica and are unable to interfere with the range that subsumed them due to the lease check described later and the disjointness of the rangeID-based keyspace.

The mergatory is visited by the GC queue with high priority.

GCQueue

The GC queue is ultimately responsible for removing the data. Assuming that it finds out (via the meta ranges) that a range has been merged away, it has to distinguish two cases by answering the question:

Is the keyspace that originally belonged to R claimed by another replica on this store?

This is a question that's easy to answer by looking into the local store map (with sufficient synchronization). It can only be true if the range is in the mergatory (another good assertion). Another good assertion is that if a replica comes from the mergatory, it must definitely be GC'ed.

If the answer to the question above is "yes", only the rangeID-keyed keyspace is wiped, and otherwise the data as well.

Handling Q-rebalances

Away

Consider the scenario in which a replica of Q gets rebalanced off a follower node before it has any chance to apply the trigger. R will sit around forever wondering what went wrong. The lease check will take care of when the replica is next woken up by a queue. The replica will move to mergatory and will be removed.

Back

Note that in addition to the previous scenario. Q (or parts of Q overlapping R; Q may have split in the meantime) comes back (in its post-merge size QR), and that R has not been removed yet. In that case, the snapshot for Q will intersect R. We simply make sure that this wakes up R and so it should move itself to mergatory and stop blocking the snapshot (since the mergatory replicas are not considered as owning any part of the addressable keyspace). This is similar to a mechanism for splits today.

Note: Eager Intent Resolution

Note that the eager intent resolution with the GC will resolve the intent on the range descriptor of R eagerly. This means that if the node restarts and there is an intent, then the intent is ignored (because that implies that the merge trigger has not been processed yet). This is a subtle invariant; with the possiblity of "committed intents", the situation complicates fundamentally. (There is a similar property for splits which, if absent, would make the world more complicated).

Avoiding RangeID-Keyspace Leaks

The case in which a replica gets put into mergatory just before the process dies is noteworthy. In that case, when the server starts, it won't instantiate the replica in the first place (as there's no range descriptor). A naive solution is to try to be sensitive to tombstones, but those may have been wiped away with the range QR rebalancing off the node.

Note that making the mergatory synchronous does not solve this problem (see the Q-away case above). We can address this by making sure that any nontrivial RangeID-keyspace (nontrivial means that it contains more than a replica tombstone) also contains a key that mentions the start key. The server startup process can then iterate through those keys and, for each of them, checks whether the corresponding range is initialized. If not, the corresponding rangeID prefix is wiped.

Comments from Reviewable

tbg · 2018-04-05T13:02:40Z

Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.

docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I wrote another wall of text. I'm afraid this isn't easily digested and may contain overcomplicated and incorrect solutions. But I think it does a decent job at exposing a number of problems we need to solve. I'm happy to talk these through offline.

The exposition is also lacking, but I've rewritten this numerous times already and I think it's time to involve more eyes in that process.

Making sure the RHS is up to date at the time of merge trigger

First of all, we must also solve the "RHS catch-up problem": when the merge trigger executes, the subsumed replica must be at a log position after we locked down the leaseholder (ie. it has the latest writes). I think this is most adequately addressed by a round of long-poll RPCs to all the replicas of R after having locked down the leaseholder and proposed a noop write with a trigger in it (that signals the polls). Trying to do this from below Raft is a non-starter. So from now on I will assume that whenever a merge trigger runs and the RHS is present, it will be up to date.

Snapshots

Something else we have to worry about is that we must not send any snapshots that contain a merge trigger in the log entries. The recipient won't be able to apply the merge trigger; we must send a higher-index snapshot that postdates the merge.

AdminMerge Lock-Down

We make sure that the AdminMerge transaction can only commit with the node running the AdminMerge holding the lease for Q (i.e. we pass the lease sequence into the merge trigger).

Leaseholder Lock-Down

So our only problem left (hopefully) is to lock down (lease and data of) R. To lock down the current lease on L, we communicate directly with the lease holder before committing the merge trigger (there are details to be figured out here, but basically this should work).

Now we need to prevent followers from stealing a lease unless the merge is aborted. I still believe that the best way to achieve that is to insert a cheap check that runs after a lease acquisition on L before using the lease.

Lease Check

The check simply retrieves keys.RangeDescriptorKey(desc.StartKey). If that key is a deletion intent (it'll be anchored on Q), we need to push it before continuing. If the read comes back with a descriptor that's not ours (i.e. empty or changed), we know we've been merged away, reject all requests (with a range key mismatch error), and move ourselves to the mergatory described below. Note that in trying to push the intent, we wait for the merge to commit (unless we decide to push with high priority, in which case we would abort. Since we also want to merge expiration-based ranges, I think waiting makes sense but I could be convinced otherwise).

Note that as far as the data goes, it's OK if R sticks around for a while and uses its Raft machinery and gets a lease, even after QR exists. The lease and Raft state are both keyed by range ID, so there will be no conflict between the post-merge range QR and R:
  .-- keyed by RangeID of R
  |     .-- keyed by RangeID of QR (=post-merge Q)
  |     |          .-- RangeDescriptorKey(R)
  v     |          v
[----)  v          [------R's-data------)
      [----)       [-------QR's-(data)---------------)
There's no requirement to synchronously knock out R as the lease trigger applies. This is helpful because in the general case we can't guarantee that, see the "Handling Q-rebalances" section below.

Note: (*Replica).assertStateLocked will fire if run, since the on-disk descriptor and the in-memory copy have diverged.

Mergatory

The "mergatory" (bad pun on merge+purgatory, not suggested name) is a list of replicas that the store still owns, but which have been merged away and don't own their data keyspace any more. We could avoid having a mergatory if we eagerly ran GC on a replica while moving it into the mergatory, but this can be problematic for performance, so keep that in mind.

Replicas in the mergatory are ignored by LookupReplica and are unable to interfere with the range that subsumed them due to the lease check described later and the disjointness of the rangeID-based keyspace.

The mergatory is visited by the GC queue with high priority.

GCQueue

The GC queue is ultimately responsible for removing the data. Assuming that it finds out (via the meta ranges) that a range has been merged away, it has to distinguish two cases by answering the question:

Is the keyspace that originally belonged to R claimed by another replica on this store?

This is a question that's easy to answer by looking into the local store map (with sufficient synchronization). It can only be true if the range is in the mergatory (another good assertion). Another good assertion is that if a replica comes from the mergatory, it must definitely be GC'ed.

If the answer to the question above is "yes", only the rangeID-keyed keyspace is wiped, and otherwise the data as well.

Handling Q-rebalances

Away

Consider the scenario in which a replica of Q gets rebalanced off a follower node before it has any chance to apply the trigger. R will sit around forever wondering what went wrong. The lease check will take care of when the replica is next woken up by a queue. The replica will move to mergatory and will be removed.

Back

Note that in addition to the previous scenario. Q (or parts of Q overlapping R; Q may have split in the meantime) comes back (in its post-merge size QR), and that R has not been removed yet. In that case, the snapshot for Q will intersect R. We simply make sure that this wakes up R and so it should move itself to mergatory and stop blocking the snapshot (since the mergatory replicas are not considered as owning any part of the addressable keyspace). This is similar to a mechanism for splits today.

Note: Eager Intent Resolution

Note that the eager intent resolution with the GC will resolve the intent on the range descriptor of R eagerly. This means that if the node restarts and there is an intent, then the intent is ignored (because that implies that the merge trigger has not been processed yet). This is a subtle invariant; with the possiblity of "committed intents", the situation complicates fundamentally. (There is a similar property for splits which, if absent, would make the world more complicated).

Avoiding RangeID-Keyspace Leaks

The case in which a replica gets put into mergatory just before the process dies is noteworthy. In that case, when the server starts, it won't instantiate the replica in the first place (as there's no range descriptor). A naive solution is to try to be sensitive to tombstones, but those may have been wiped away with the range QR rebalancing off the node.

Note that making the mergatory synchronous does not solve this problem (see the Q-away case above). We can address this by making sure that any nontrivial RangeID-keyspace (nontrivial means that it contains more than a replica tombstone) also contains a key that mentions the start key. The server startup process can then iterate through those keys and, for each of them, checks whether the corresponding range is initialized. If not, the corresponding rangeID prefix is wiped.

Eh, already found the first buglet. We have to handle intents on the range descriptors at server start time, but only if the node didn't execute the merge trigger and hasn't received a snapshot of QR in the meantime (both of which would have removed the intent). I think that means that we can instantiate the old version for range descriptors with merge intent. This problem does not exist in the split or rebalance case, where the intent is always synchronously resolved with the commit.

Comments from Reviewable

bdarnell · 2018-04-05T15:24:57Z

First of all, we must also solve the "RHS catch-up problem": when the merge trigger executes, the subsumed replica must be at a log position after we locked down the leaseholder (ie. it has the latest writes). I think this is most adequately addressed by a round of long-poll RPCs to all the replicas of R

The leader knows the last acknowledged log position of each follower. It lags application by a little bit, but I think it's good enough to treat this as a local problem instead of one requiring coordination. The leader waits for all replicas to ack the last log entry in the RHS before sending the EndTransaction, then all the replicas block in the merge trigger until the RHS's applied index has caught up. This is a little hand-wavy (do we need to worry about deadlocks if the raft scheduler runs out of threads?), but I think there's an answer here without adding special long-poll RPCs (it's also possible that long-poll RPCs are a cleaner solution than whatever synchronization we end up needing here).

Something else we have to worry about is that we must not send any snapshots that contain a merge trigger in the log entries. The recipient won't be able to apply the merge trigger; we must send a higher-index snapshot that postdates the merge.

The snapshot must be sent with a higher applied index. The entry with the merge trigger need not be truncated away. I think this is already covered: we don't send unapplied log entries with our snapshots.

We make sure that the AdminMerge transaction can only commit with the node running the AdminMerge holding the lease for Q (i.e. we pass the lease sequence into the merge trigger).

Is it the merge trigger (upstream of raft) or Store.MergeRange (downstream of raft)? Store.MergeRange is the one that worries me more.

To lock down the current lease on L, we communicate directly with the lease holder before committing the merge trigger (there are details to be figured out here, but basically this should work).

Who is "we" here? And what are Q and L here? I'm pretty sure R is the RHS; do Q and L both refer to the LHS or are they something different?

I still believe that the best way to achieve that is to insert a cheap check that runs after a lease acquisition on L before using the lease.

Don't you mean R? I'm still very wary of adding anything that can block lease acquisition.

Consider the scenario in which a replica of Q gets rebalanced off a follower node before it has any chance to apply the trigger. R will sit around forever wondering what went wrong. The lease check will take care of when the replica is next woken up by a queue.

I don't think the followers of R do anything different here. They will respond to anything the leader sends them. The lockdown only applies to the leader of R, which should be co-located with the leader of Q at the time of the merge. I think the question here is what happens if the leader dies immediately after performing the merge and new leaders of both Q and R get elected.

tbg · 2018-04-05T20:45:39Z

The leader knows the last acknowledged log position of each follower. It lags application by a little bit, but I think it's good enough to treat this as a local problem instead of one requiring coordination. The leader waits for all replicas to ack the last log entry in the RHS before sending the EndTransaction, then all the replicas block in the merge trigger until the RHS's applied index has caught up. This is a little hand-wavy (do we need to worry about deadlocks if the raft scheduler runs out of threads?), but I think there's an answer here without adding special long-poll RPCs (it's also possible that long-poll RPCs are a cleaner solution than whatever synchronization we end up needing here).

Yeah, that might be a good alternative.

Is it the merge trigger (upstream of raft) or Store.MergeRange (downstream of raft)? Store.MergeRange is the one that worries me more.

Downstream. I'm suggesting that application of the merge trigger catches a forcedError. By "Is it the merge trigger (upstream of raft)", do you mean evaluation of the merge EndTransaction? We should ideally catch it there too, but if the lockdown of the RHS is properly reversible (which it needs to be anyway) I think it's OK if we didn't do that.

Who is "we" here? And what are Q and L here? I'm pretty sure R is the RHS; do Q and L both refer to the LHS or are they something different?

Sorry, there's only Q and R (I say QR when I mean the post-merge Q). Q is also the LHS and R is the RHS. Would have been to easy had I been anywhere close to consistent.

Don't you mean R? I'm still very wary of adding anything that can block lease acquisition.

Yeah, R, sorry. So don't we absolutely positively have to block lease acquisition? If we don't block lease acquisition we allow replicas of L to get a lease and "do things", which is 100% incorrect if L has been merged already. Note that the check is "free" and only has an effect when there's a merge going on. And note that we acquire the lease, we just don't allow it to be used for data that L holds or once held unless we know which of the two it is.

Whatever alternative you have in mind (do you have one? I'd be interested to hear) it needs to work even if R is not colocated with Q and never gets to see the merge trigger. See the Q-Away section.

There are approximately 100 lose ends already in this PR. I think we should get some dedicated real world time to attack some of these between the three of us and decide what can work and what can't. I'm happy to do the prep work and condense the problems and ideas collected so far into something more digestible. WDY @bdarnell?

Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.

Comments from Reviewable

bdarnell · 2018-04-09T18:18:55Z

By "Is it the merge trigger (upstream of raft)", do you mean evaluation of the merge EndTransaction?

Yes

Downstream. I'm suggesting that application of the merge trigger catches a forcedError.

This means that some replicas could experience an error while applying this command while others don't. This would have to be something like a replicaCorruptionError (which means we'd need to make sure that mechanism really works, and figure out what it means to make it reversible).

Yeah, R, sorry. So don't we absolutely positively have to block lease acquisition?

We must not allow both a merge and a lease acquisition to succeed. One way is to block lease acquisitions if there is any chance a merge that could succeed is in flight. The other way is to ensure that the merge will fail if a new lease is acquired on R.

My incomplete suggestion for the latter is to use transaction deadlines: Before time T, the merge can commit, but the lease can't change hands. Afterwards, the reverse is true. The problem with this is of course that the EndTransaction can be proposed but not yet applied. If we were dealing with a single range, the EndTransaction would be well-ordered with respect to the lease attempt (and the new lease will invalidate all commands proposed under the old lease). With two ranges, that's trickier if not impossible.

Whatever alternative you have in mind (do you have one? I'd be interested to hear) it needs to work even if R is not colocated with Q and never gets to see the merge trigger. See the Q-Away section.

When we start the merge transaction, we verify that Q and R are co-located, and we write an intent to the range descriptor which ensures that this can't change unless our transaction is aborted. So by the time any rebalancing can occur, we know the disposition of the merge transaction. If the merge succeeded, we shouldn't try to move the dead range R. If Q completes and immediately moves, everything should generally be fine. We just can't copy Q to a store that has a stale replica of R. Am I missing something here? I think the problems with the lease-timestamp proposal mainly have to do with out-of-date replicas instead of non-colocated ones.

I think we should get some dedicated real world time to attack some of these between the three of us and decide what can work and what can't. I'm happy to do the prep work and condense the problems and ideas collected so far into something more digestible. WDY @bdarnell?

SGTM

Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.

Comments from Reviewable

tbg · 2018-04-10T18:42:34Z

This means that some replicas could experience an error while applying this command while others don't. This would have to be something like a replicaCorruptionError (which means we'd need to make sure that mechanism really works, and figure out what it means to make it reversible).

No, then we're talking past each other. I'm saying that in addition to the usual check lease at apply time == lease at propose time, we additionally check that lease at propose time == the leaseholder we wanted, so this would be the same for all the replicas. Maybe we don't need this particular check but we need to pay attention that the node that runs the AdminMerge is actually the leaseholder (as you know our admin commands try to only run on the lease holder, but aren't particularly concerned with guaranteeing it). Without such a provision, we could lose the lease but still commit the merge and cause trouble that way, but I think your next suggestion nicely takes care of that.

My incomplete suggestion for the latter is to use transaction deadlines:

I see where you're going with this, but I don't know how you would actually make it work. As you said, by the time the decision is made on whether the commit happens on Q, you can't guarantee that a lease change won't happen on R.

When we start the merge transaction, we verify that Q and R are co-located, and we write an intent to the range descriptor which ensures that this can't change unless our transaction is aborted. So by the time any rebalancing can occur, we know the disposition of the merge transaction. If the merge succeeded, we shouldn't try to move the dead range R. If Q completes and immediately moves, everything should generally be fine. We just can't copy Q to a store that has a stale replica of R. Am I missing something here? I think the problems with the lease-timestamp proposal mainly have to do with out-of-date replicas instead of non-colocated ones.

See the Q-away example. The merge commits on Q (which means a majority have it in their logs and probably apply it pretty soon therafter). The majority now rebalances a replica (of the new, wider Q) from the minority away, but that minority replica never actually got the merge before. That replica (of Q) never gets to see the merge in the future too, right? Everyone just stops talking to it when the conf change is proposed and it probably GCs itself really soon (or it's just down most of the time and comes back). You're left with a lone replica of R that can cause trouble. So yes, the replicas are colocated as the merge commits, but then immediately after things may shift around even though not all followers of Q know about it.

Morally, the way I like to think about my suggestion is to tag the range R with a flag that says "you can only get a lease and remove this flag if you've checked with the meta ranges that you're still a proper range and not merged away" (which also serves to serialize with the merge, if any). But we don't have to check the meta, we can check our local range descriptor (even though it may not belong to us any more after the merge, this happens to still work).

Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.

Comments from Reviewable

benesch · 2018-07-21T19:49:22Z

Doesn't it need to be declared as a write that touches all keys so that we prevent writes at higher timestamps?

Yep. The code gets this right:

cockroach/pkg/storage/batcheval/cmd_get_snapshot_for_merge.go

Lines 40 to 57 in 6a69ebd

    
           // GetSnapshotForMerge must not run concurrently with any other command. It 
        
           // declares that it reads and writes every addressable key in the range; this 
        
           // guarantees that it conflicts with any other command because every command 
        
           // must declare at least one addressable key. It does not, in fact, write any 
        
           // keys. 
        
           spans.Add(spanset.SpanReadWrite, roachpb.Span{ 
        
           	Key:    desc.StartKey.AsRawKey(), 
        
           	EndKey: desc.EndKey.AsRawKey(), 
        
           }) 
        
           spans.Add(spanset.SpanReadWrite, roachpb.Span{ 
        
           	Key:    keys.MakeRangeKeyPrefix(desc.StartKey), 
        
           	EndKey: keys.MakeRangeKeyPrefix(desc.EndKey).PrefixEnd(), 
        
           }) 
        
           rangeIDPrefix := keys.MakeRangeIDReplicatedPrefix(desc.RangeID) 
        
           spans.Add(spanset.SpanReadWrite, roachpb.Span{ 
        
           	Key:    rangeIDPrefix, 
        
           	EndKey: rangeIDPrefix.PrefixEnd(), 
        
           })

Likewise, what happens to reads at lower timestamps? Are they still permitted to run concurrently with the GetSnapshotForMerge req? I don't see any issues with that, but we should clarify here.

They are permitted provided they make it through the mergeCompleteCh before it closes. Hmm. I wonder if there are some clock offset problems to worry about here.

FYI I haven't updated this RFC in a long time and I wasn't planning to until the 2.1 freeze sets in. LMK if you think there's an area in particular that would benefit from more ahead-of-time discussion though.

nvanbenschoten · 2018-07-23T11:36:40Z

They are permitted provided they make it through the mergeCompleteCh before it closes. Hmm. I wonder if there are some clock offset problems to worry about here.

We shouldn't have to worry about clock offset since the merge entails direct communication between leaseholder nodes so the new leaseholder's HLC can't have a smaller time than the latest permitted read on the RHS. We do still need to bump the timestamp cache on the new leaseholder over the span of the old RHS though. Looks like there's a TODO for this.

cockroach/pkg/storage/store.go

Line 2396 in 54a811d

// TODO(benesch): bump the timestamp cache of the LHS.

FYI I haven't updated this RFC in a long time and I wasn't planning to until the 2.1 freeze sets in. LMK if you think there's an area in particular that would benefit from more ahead-of-time discussion though.

No this is fine, no need to clean this up while you're still working on the code.

Release note: None

benesch · 2019-01-07T22:24:43Z

Any objections to just merging this as is? Seems useful to have around for posterity, and I've added a disclaimer that will point folks towards the up-to-date description in #33334 once that merges. (I wouldn't merge this PR before #33334.)

petermattis · 2019-01-08T00:59:44Z

I'm fine with merging as-is.

tbg · 2019-01-08T10:24:24Z

Ditto

…

On Tue, Jan 8, 2019, 01:59 Peter Mattis ***@***.*** wrote: I'm fine with merging as-is. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24394 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135BugIFs0tIowjU4aLry5Q82iwRCbks5vA-2LgaJpZM4TCsP9> .

nvanbenschoten · 2019-04-09T18:07:14Z

Merging this as-is.

bors r+

craig · 2019-04-09T18:07:17Z

👎 Rejected by PR status

nvanbenschoten · 2019-04-09T18:08:46Z

bors r+

craig · 2019-04-09T18:08:48Z

👎 Rejected by PR status

nvanbenschoten · 2019-04-09T19:11:43Z

@benesch it looks like this is a CLA issue. Do you mind force pushing to kick off another CI run?

The bulk of the work is completed. Add a disclaimer that the RFC is woefully out of date and has been supplated by a tech note instead. Release note: None

benesch · 2019-04-10T20:27:25Z

Done, but I don't think it helped anything. Ugh.

benesch · 2019-04-10T20:27:52Z

🤦‍♂️

benesch · 2019-04-10T20:28:31Z

please?

bors r+

benesch · 2019-04-10T20:31:29Z

Well, CLA assistant still thinks this isn't signed, but Bors is running now. 🤔

benesch · 2019-04-10T21:48:52Z

Welp, Bors noticed and crashed loudly, but now the CLA is signed, so ¯_(ツ)_/¯ .

bors r+

benesch · 2019-04-10T21:50:16Z

Bors crashed again.

bors r+

24394: RFCs: add range merges RFC r=benesch a=benesch This is very much a WIP. I wrote it mostly to collect my own thoughts on range merges. At this stage I'm most interested in getting buy-in on building a general-case prototype, as opposed to the special-cases I lay out in the alternatives section, but I'm of course eager to get feedback on the rest of this design. 33334: tech-notes: add note on range merges r=benesch a=benesch Very much still in progress. Opening a PR in case folks want to follow along. Release note: None Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>

craig · 2019-04-10T22:15:57Z

Build succeeded

GitHub CI (Cockroach)

benesch requested review from bdarnell, tbg and petermattis April 1, 2018 03:04

benesch requested a review from a team as a code owner April 1, 2018 03:04

benesch force-pushed the rfc-range-merges branch from 7d911d3 to db3a865 Compare April 1, 2018 18:52

benesch added the do-not-merge bors won't merge a PR with this label. label Apr 1, 2018

RFCs: add range merges RFC

e4215a7

Release note: None

benesch force-pushed the rfc-range-merges branch from db913f5 to 19d1ee2 Compare January 7, 2019 22:23

benesch removed the do-not-merge bors won't merge a PR with this label. label Jan 7, 2019

benesch changed the title ~~[wip] RFCs: add range merges RFC~~ RFCs: add range merges RFC Jan 7, 2019

nvanbenschoten approved these changes Apr 9, 2019

View reviewed changes

rfcs: mark range merges as completed

c49a1e4

The bulk of the work is completed. Add a disclaimer that the RFC is woefully out of date and has been supplated by a tech note instead. Release note: None

benesch force-pushed the rfc-range-merges branch from 19d1ee2 to c49a1e4 Compare April 10, 2019 20:24

craig bot merged commit c49a1e4 into cockroachdb:master Apr 10, 2019

nvanbenschoten mentioned this pull request May 12, 2019

storage: reconcile manual Range splits and automatic Range merges #37487

Closed

5 tasks

benesch deleted the rfc-range-merges branch June 21, 2019 22:00

RFCs: add range merges RFC #24394

RFCs: add range merges RFC #24394

Conversation

benesch commented Apr 1, 2018

cockroach-teamcity commented Apr 1, 2018

benesch commented Apr 1, 2018

tbg commented Apr 1, 2018

bdarnell commented Apr 1, 2018

tbg commented Apr 1, 2018

tbg commented Apr 1, 2018

petermattis commented Apr 1, 2018

tbg commented Apr 1, 2018

petermattis commented Apr 1, 2018

benesch commented Apr 1, 2018

tbg commented Apr 1, 2018

bdarnell commented Apr 1, 2018

tbg commented Apr 1, 2018

benesch commented Apr 2, 2018

bdarnell commented Apr 2, 2018

benesch commented Apr 3, 2018

tbg commented Apr 4, 2018

nvanbenschoten commented Apr 4, 2018

tbg commented Apr 4, 2018

Making sure the RHS is up to date at the time of merge trigger

Snapshots

AdminMerge Lock-Down

Leaseholder Lock-Down

Lease Check

Mergatory

GCQueue

Handling Q-rebalances

Away

Back

Note: Eager Intent Resolution

Avoiding RangeID-Keyspace Leaks

tbg commented Apr 5, 2018

Making sure the RHS is up to date at the time of merge trigger

Snapshots

AdminMerge Lock-Down

Leaseholder Lock-Down

Lease Check

Mergatory

GCQueue

Handling Q-rebalances

Away

Back

Note: Eager Intent Resolution

Avoiding RangeID-Keyspace Leaks

bdarnell commented Apr 5, 2018

tbg commented Apr 5, 2018

bdarnell commented Apr 9, 2018

tbg commented Apr 10, 2018

benesch commented Jul 21, 2018

nvanbenschoten commented Jul 23, 2018

benesch commented Jan 7, 2019

petermattis commented Jan 8, 2019

tbg commented Jan 8, 2019 via email

nvanbenschoten commented Apr 9, 2019

craig bot commented Apr 9, 2019

nvanbenschoten commented Apr 9, 2019

craig bot commented Apr 9, 2019

nvanbenschoten commented Apr 9, 2019

benesch commented Apr 10, 2019

benesch commented Apr 10, 2019

benesch commented Apr 10, 2019

benesch commented Apr 10, 2019

benesch commented Apr 10, 2019

benesch commented Apr 10, 2019

craig bot commented Apr 10, 2019

Build succeeded