-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFCs: add range merges RFC #24394
RFCs: add range merges RFC #24394
Conversation
Just stumbled across #2433. Some of the XXXs I left for myself look to be answered there. |
Why Could you discuss the thundering herd? I.e. if we have 1000 adjacent empty ranges and merges suddenly become available, what will happen? I wrote a long comment about the colocation of replicas, apologies for the wall of text. In the absence of brain farts on my part, though, I think this is all workable. Let me know what you think (happy to chat offline if it's too confused). Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 21 at r1 (raw file):
There's also the real point that you can't change the range size and have it be applied, unless it happens to be smaller than before. And then you're locked into that. docs/RFCS/20180330_range_merges.md, line 26 at r1 (raw file):
Yeah this is true, though holding all the ranges in memory is likely the bigger deal. docs/RFCS/20180330_range_merges.md, line 40 at r1 (raw file):
Queue-like workloads (which are common) are also a problem because they leave behind an ever-growing tail of empty ranges and that can really mess with latencies as these empty ranges tend to get queried. docs/RFCS/20180330_range_merges.md, line 65 at r1 (raw file):
docs/RFCS/20180330_range_merges.md, line 80 at r1 (raw file):
docs/RFCS/20180330_range_merges.md, line 95 at r1 (raw file):
The sticky bit will be inserted into the range split transaction, right? That is you'll plumb it down into docs/RFCS/20180330_range_merges.md, line 101 at r1 (raw file):
I think we may want to introduce merges behind a cluster setting which defaults to false for clusters that are not bootstrapped into 2.1 or higher, and have a note in the docs that tells operators to activate it and re-run any manual splits they care about. (And periodically log at docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file):
I think you mean L_R here, but I'm actually confused about everything after this sentence. It looks like you put
docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file): I don't actually think that can happen even today because the merge transaction writes intents on both range descriptors, and any replication change needs to write an intent there as well. So that base is covered: you know that if the replicas are colocated initially, they are colocated when your merge commits. For the leaseholders, it's trickier. If the leaseholders are colocated at the time of the merge, then there can be a correct merge (I'm also pretty sure that when you add copious amounts of testing you'll find a bug there, but that's because it's been a long time since we've even cared about making this correct). But what keeps More precisely, we need
I think something that might work here is that you use an intent in the merge transaction again, but this time it's to collide with anyone trying to change the lease of The logic error here is that to read the key on our own range, someone needs to have the lease in the first place. This is clearly not going to be the case if we're trying to get the lease, so I think we have to flip things around: you get the lease unconditionally, and when you have it, you check the range descriptor key. If there is a (merge) intent, you try to abort the merge transaction (which is not anchored to the local range Note that the mechanism implicitly assumes that the resolution of the intent on If the leaseholder crashes, the commit may still get through, but that's OK because both leaseholders crashed (they were colocated). After the restart, they can't continue using their leases, so now they behave like followers who crashed. It's not a problem if a follower crashes and applies the merge much later than the rest. I can't observe any configuration changes until it has caught up. It will have both ranges for longer, but won't be able to do anything with the right hand side (except catch up with the Raft log, where there shouldn't be much) until it processes the merge on the left hand side which then kills the RHS. It also can't get a lease on the LHS until it has processed the merge, so all is well. There are two minor concerns:
Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 26 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Currently the biggest impact is actually that we loop over all ranges every 100ms for the raft ticker. I think we can fix this without merges by storing quiesced and unquiesced ranges separately, though. docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file):
I'd go the other way and say that nothing is initially sticky. Users are far more likely to have data that has split too small than to have manual splits which need to persist even though size threshold are no longer met. In my experience, manual splits are used to help with cold starts, but once the system has been running those original split points are no longer special. I don't think there's a need to apply a permanent sticky bit retroactively. docs/RFCS/20180330_range_merges.md, line 139 at r1 (raw file):
One of the tricky parts of this process is that once the replicas are aligned, we must disallow any other replica moves of either range until the merge has completed, but we must also not leave the ranges permanently stuck in place if node running the AdminMerge dies. (But I think @tschottdorf is right and the fact that merges and replica changes use transactions on the same keys saves us here) docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file):
Right, it was never finished when we realized how hard the remaining bits were going to be and we had more important things to work on at the time.
I really don't want to complicate the lease process if we can help it. I think there's a simpler way here. Once the leases are colocated, we set an in-memory "do not transfer" flag on the Replica. This guarantees that the leases will be colocated until our liveness expires, so we can use the end of our current liveness as the transaction expiration. That way, if the transaction can't complete before our time is up, it will be guaranteed to abort. docs/RFCS/20180330_range_merges.md, line 145 at r1 (raw file):
AdminMerge probably needs the same treatment on Q that AdminSplit does: a command-queue write lock for the EndTransaction that runs the merge trigger to ensure the final stats are accurate. We also need some to-be-developed mechanism to lock down R completely during the merge (while still allowing it to unlock if the merge aborts). docs/RFCS/20180330_range_merges.md, line 148 at r1 (raw file):
There has very likely been rot, in addition to the unfinished parts related to locking down the RHS. We'll need to go through splitTrigger, Store.SplitRange, and splitPostApply to make sure that everything there has a counterpart in the merge path. docs/RFCS/20180330_range_merges.md, line 156 at r1 (raw file):
What about alternating splits and merges? We should ensure that there is a gap between the split and merge threshold so that ranges that are just split are unlikely to immediately become mergeable (even if there is a dip in load or some data deleted). We may want a safeguard like "don't auto-merge a range that has split in the last 24h" (in fact, if we made all splits "sticky" for 24h, would we even want a concept of a permanent sticky bit?) docs/RFCS/20180330_range_merges.md, line 160 at r1 (raw file):
How will this imbalance be detected? Why would merges cause further imbalances? docs/RFCS/20180330_range_merges.md, line 192 at r1 (raw file):
If we know the range is empty (and will stay empty), we can leave the allocator out of the process entirely. Just change the endpoints of the other range and let the empty one be GC'd wherever it is. "And will stay empty" is the tricky part - aside from the special case of dropped tables, guaranteeing this seems to require a large enough portion of the general solution that the special case may not be worth it. Comments from Reviewable |
Added some more impl suggestions for the actual commit trigger. And I realized I hadn't actually finished reading the document the last time, so the thrashing question is retracted 🙈 Review status: 0 of 1 files reviewed at latest revision, 16 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
👍 docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Are you saying we use a txn deadline that is equal to the expiration of the lease on the RHS? That's not how the txn deadline works, the txn deadline compares against the provisional commit timestamp, and that has nothing to do with real time. I hope I'm mistaken about what you're suggesting because I'm always interested in a simpler solution. docs/RFCS/20180330_range_merges.md, line 145 at r1 (raw file):
If we cover all of it via the command queue (the nice thing about merges is that we don't have to recompute anything, though note that we likely want to if
To block the command queue of the RHS, we can tamper with its command queue directly (similar to how docs/RFCS/20180330_range_merges.md, line 156 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Are you aware of any splits that are carried out by users for reasons other than performance? I'm worried that there is some outlandish reason for wanting something split that I don't anticipate, and that we would violate here. docs/RFCS/20180330_range_merges.md, line 165 at r1 (raw file):
The merge queue is kind of the easy part. The hard part is to actually make If within this cycle you arrive at an docs/RFCS/20180330_range_merges.md, line 192 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Having thought about the general case a bit now, it's my impression that tackling it head-on is worth it. docs/RFCS/20180330_range_merges.md, line 209 at r1 (raw file):
I've tried to fill in some of that in my comments. Doesn't seem terrible but the devil is in the details. Comments from Reviewable |
Reviewed 1 of 1 files at r1. Comments from Reviewable |
Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 26 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
The memory usage appears to be non-trivial. The docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
As a counter-point, for the TPC-C performance work we manually split the items table (100k rows, a few megabytes of data) in order to spread it out across a cluster. It is possible that adjusting the zone config for that table to use a smaller max-range size would have also worked. Note that this table is read-only. A small read-only table is useful to split for improved load distribution. We wouldn't want range merging to work against that, though I'm also not convinced we need manual splits to be sticky. Comments from Reviewable |
Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
But post merging, wouldn't you just set a zone config that mandates ~few mb ranges? Actually you could do that today, but you likely didn't want to bother with querying Comments from Reviewable |
Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Yes, setting a zone config probably would work (I stated as much). So not really a counter-point, but an additional point that we should keep in mind. Comments from Reviewable |
7d911d3
to
db3a865
Compare
Yessir. 😜
Hah, that was exactly what I was after. I figured if I tossed out a partial RFC you and Ben would jump in and help fill in the rest. You two certainly haven't disappointed. :) Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions. docs/RFCS/20180330_range_merges.md, line 21 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Good point. Done. docs/RFCS/20180330_range_merges.md, line 26 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Fleshed this section out a bit. The Raft ticker problem is now noted in the alternatives section below. docs/RFCS/20180330_range_merges.md, line 40 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. docs/RFCS/20180330_range_merges.md, line 65 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. docs/RFCS/20180330_range_merges.md, line 80 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Thanks, done. docs/RFCS/20180330_range_merges.md, line 95 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Ugh, it's not quite so simple though. We'll want to be able to set the bit even if the split already exists. docs/RFCS/20180330_range_merges.md, line 101 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Ack, this definitely needs more thought. docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
The two phases are distinct, though. There's the colocation phase where I agree it's confusing and stupid, though, for these phases to look in different directions. The colocation phase should just look rightward. I'll rework this. docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
@tschottdorf beat me to it. docs/RFCS/20180330_range_merges.md, line 156 at r1 (raw file):
Agreed, though I was thinking a threshold on the order of minutes, not hours.
I think we still would. If I run docs/RFCS/20180330_range_merges.md, line 160 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
In the wost case, merges could pull three replicas off underfull nodes and transfer them to overfull nodes. And even if you're doing neutral work (moving a replica from an overfull store to another overfull store, say), you're preventing a preemptive snapshot that could have balanced the cluster. Doesn't seem like it would be too hard to pause the merge queue when the store pool thinks some stores are overfull/underfull. docs/RFCS/20180330_range_merges.md, line 165 at r1 (raw file):
Agreed. I think I have it half done already. I'm inclined to continue hacking it together before I embark on fixing docs/RFCS/20180330_range_merges.md, line 209 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Yes thank you! 🙇 Comments from Reviewable |
Reviewed 1 of 1 files at r2. docs/RFCS/20180330_range_merges.md, line 95 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
docs/RFCS/20180330_range_merges.md, line 101 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
FWIW, I like Ben's suggestion to just turn it on, though I'm not sure the real world doesn't have a case in which that's a bad idea. We can also introduce the cluster setting to turn it off (but have it on by default) so that folks have a chance to turn it off before they bump the cluster version. docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
Apologies, reading is not my strong suit today. There was a time in high school where I'd routinely just forget about whole exercises in math exams. Feel reminded of that today. docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Ack, I'll hold off until you've done so. docs/RFCS/20180330_range_merges.md, line 156 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Also doesn't seem that the sticky bit informs any of the other mechanisms, right? It's just a take it or leave it addon. Comments from Reviewable |
Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 101 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Even if we "just turn it on", it will need to be behind a cluster version. I like the idea of having a setting that can be toggled while the upgrade is in its non-finalized state (but be aware of the auto-finalize proposal in #24377) docs/RFCS/20180330_range_merges.md, line 105 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Yeah, by saying that no pre-2.1 splits are sticky we might break a few use cases, and they'll need to either set zone configs with small limits or manually re-run their splits to set the sticky bit. I consider this preferable to the alternative of "all pre-2.1 splits are sticky". docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
FWIW, I think things make sense as described here. The queue runs on replicas looking leftward, and if it makes sense from their perspective it hands off to the leftward range for the rest of the process (which can now proceed with input from both ranges). docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file):
Yes, exactly, but I see now that I'd need a new change: EndTransaction needs to consult the timestamp cache, so that a new leaseholder can't commit a transaction started by the previous leaseholder. docs/RFCS/20180330_range_merges.md, line 156 at r1 (raw file):
Maybe as a poor-man's partitioning? But without zone constraints I can't see why that would matter for anything but performance.
For manual splits we'd definitely want hours, if not days. For auto-splits minutes might be OK, but still seems aggressive to me. If we set the limit less than 24h, some clusters will have traffic patterns that split every morning and re-merge at night. (is that OK? Maybe, but it seems better to me to just leave the splits in place)
I see manual splits as primarily aimed at the cold start problem. As long as I split less than 24h before my launch/announcement/whatever, I'd be fine and the ranges will grow into their unmergeable sizes pretty quickly. The question is whether it's reasonable for admins savvy enough to do manual pre-splitting to also be able to schedule their pre-splitting within 24h of of the traffic spike. I think having the sticky bit is better than not having it, but note that if we drop it, we also don't need the docs/RFCS/20180330_range_merges.md, line 160 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Or conversely when a merge opportunity has been identified, the replication queue should try to make room for it to happen (the ranges to merge are already going to be below-average in size). docs/RFCS/20180330_range_merges.md, line 192 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
That's my thought as well. docs/RFCS/20180330_range_merges.md, line 76 at r2 (raw file):
I think If it only removes the sticky bit without attempting the merge, I'd call it Comments from Reviewable |
Review status: all files reviewed at latest revision, 13 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file):
FWIW, this is exactly the opposite of what's happening, no? docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I'm still confused about what you're suggesting. We have a transaction that will commit on Comments from Reviewable |
Review status: all files reviewed at latest revision, 13 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Nevermind, you're right. I wasn't thinking enough about the followers and was just trying to synchronize between the two Replicas on the leader. But I don't see how the intent proposal above addresses the problem of lagging followers either. Comments from Reviewable |
Review status: all files reviewed at latest revision, 13 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Hmm, I'm confused about why you're confused. I looked at it again and it seems correct. R considers a merge leftward with Q. If things looks good, it says, "hey Q, how do you feel about this?" If Q likes it too, Q moves into alignment with R AND cedes the lease to R's leaseholder. Then R's leaseholder, having Q's lease now, merges R into Q. The benefit here is that you'll never have to roundtrip to find out of the sticky bit is set. (The sticky bit necessarily needs to be set on the RHS of a split because ranges are inclusive on the LHS.) Comments from Reviewable |
Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 129 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
The root of my confusion is that docs/RFCS/20180330_range_merges.md, line 139 at r1 (raw file):
I don't think that's right, after all. Consider (this is roughly the same problem as with my intent proposal for leaseholder pinning):
We can fortify by mandating that docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
You're right, it's equally broken. For posterity, you mean this scenario: node 1 is initially leaseholder of r1 and r2, r1 tries to subsume r2
I think I have salvaged the idea, but I'll polish this a bit and post it later. I'm rediscovering why all of this is so hard. Comments from Reviewable |
Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 76 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
+1 to both of those points. docs/RFCS/20180330_range_merges.md, line 103 at r2 (raw file):
An alternative approach would be to make this a range local key and put it right next to the split point itself. This would allow the sticky bit to be more easily worked with since it would be addressable. It would also naturally force the key onto the right side of the manual split, which fits with the rest of the proposal here. docs/RFCS/20180330_range_merges.md, line 137 at r2 (raw file):
Perfect information? In a distributed system? That must be nice 😃 Unfortunately, since we're not locking both ranges while we process this suggestion, it will always be racey. For instance, while this series of rebalances and lease transfers is taking place, Q and R could continue to grow. We'll need to consider what happens if a split of Q races with a merge of R into Q. EDIT: I see you proposed this as a question below and that this is already a lively discussion! Another race we'll need to explore is a series of ranges each concurrently attempting to merge into the range on their left. Again, I don't think we'll be able to avoid this entirely, but it does introduce an interesting question. Should we favor small ranges merging left into large ranges, large ranges merging left into small ranges, or have no preference? A situation that demonstrates why we should think about this is the case where an empty range is sandwiched between two other ranges. Without any preference for who kicks off the merge, this case could regularly end up in a racing double merge. This could be avoided in the common case by only adding ranges that are less than half the Comments from Reviewable |
Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file): The exposition is also lacking, but I've rewritten this numerous times already and I think it's time to involve more eyes in that process. Making sure the RHS is up to date at the time of merge triggerFirst of all, we must also solve the "RHS catch-up problem": when the merge trigger executes, the subsumed replica must be at a log position after we locked down the leaseholder (ie. it has the latest writes). I think this is most adequately addressed by a round of long-poll RPCs to all the replicas of SnapshotsSomething else we have to worry about is that we must not send any snapshots that contain a merge trigger in the log entries. The recipient won't be able to apply the merge trigger; we must send a higher-index snapshot that postdates the merge. AdminMerge Lock-DownWe make sure that the Leaseholder Lock-DownSo our only problem left (hopefully) is to lock down (lease and data of) Now we need to prevent followers from stealing a lease unless the merge is aborted. I still believe that the best way to achieve that is to insert a cheap check that runs after a lease acquisition on Lease CheckThe check simply retrieves Note that as far as the data goes, it's OK if
There's no requirement to synchronously knock out Note: MergatoryThe "mergatory" (bad pun on merge+purgatory, not suggested name) is a list of replicas that the store still owns, but which have been merged away and don't own their data keyspace any more. We could avoid having a mergatory if we eagerly ran GC on a replica while moving it into the mergatory, but this can be problematic for performance, so keep that in mind. Replicas in the mergatory are ignored by The mergatory is visited by the GC queue with high priority. GCQueueThe GC queue is ultimately responsible for removing the data. Assuming that it finds out (via the meta ranges) that a range has been merged away, it has to distinguish two cases by answering the question:
This is a question that's easy to answer by looking into the local store map (with sufficient synchronization). It can only be true if the range is in the mergatory (another good assertion). Another good assertion is that if a replica comes from the mergatory, it must definitely be GC'ed. If the answer to the question above is "yes", only the rangeID-keyed keyspace is wiped, and otherwise the data as well. Handling Q-rebalancesAwayConsider the scenario in which a replica of BackNote that in addition to the previous scenario. Note: Eager Intent ResolutionNote that the eager intent resolution with the GC will resolve the intent on the range descriptor of Avoiding RangeID-Keyspace LeaksThe case in which a replica gets put into mergatory just before the process dies is noteworthy. In that case, when the server starts, it won't instantiate the replica in the first place (as there's no range descriptor). A naive solution is to try to be sensitive to tombstones, but those may have been wiped away with the range Note that making the mergatory synchronous does not solve this problem (see the Q-away case above). We can address this by making sure that any nontrivial Comments from Reviewable |
Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/RFCS/20180330_range_merges.md, line 144 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Eh, already found the first buglet. We have to handle intents on the range descriptors at server start time, but only if the node didn't execute the merge trigger and hasn't received a snapshot of Comments from Reviewable |
The leader knows the last acknowledged log position of each follower. It lags application by a little bit, but I think it's good enough to treat this as a local problem instead of one requiring coordination. The leader waits for all replicas to ack the last log entry in the RHS before sending the EndTransaction, then all the replicas block in the merge trigger until the RHS's applied index has caught up. This is a little hand-wavy (do we need to worry about deadlocks if the raft scheduler runs out of threads?), but I think there's an answer here without adding special long-poll RPCs (it's also possible that long-poll RPCs are a cleaner solution than whatever synchronization we end up needing here).
The snapshot must be sent with a higher applied index. The entry with the merge trigger need not be truncated away. I think this is already covered: we don't send unapplied log entries with our snapshots.
Is it the merge trigger (upstream of raft) or Store.MergeRange (downstream of raft)? Store.MergeRange is the one that worries me more.
Who is "we" here? And what are Q and L here? I'm pretty sure R is the RHS; do Q and L both refer to the LHS or are they something different?
Don't you mean R? I'm still very wary of adding anything that can block lease acquisition.
I don't think the followers of R do anything different here. They will respond to anything the leader sends them. The lockdown only applies to the leader of R, which should be co-located with the leader of Q at the time of the merge. I think the question here is what happens if the leader dies immediately after performing the merge and new leaders of both Q and R get elected. |
Yeah, that might be a good alternative.
Downstream. I'm suggesting that application of the merge trigger catches a
Sorry, there's only
Yeah, Whatever alternative you have in mind (do you have one? I'd be interested to hear) it needs to work even if There are approximately 100 lose ends already in this PR. I think we should get some dedicated real world time to attack some of these between the three of us and decide what can work and what can't. I'm happy to do the prep work and condense the problems and ideas collected so far into something more digestible. WDY @bdarnell? Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. Comments from Reviewable |
Yes
This means that some replicas could experience an error while applying this command while others don't. This would have to be something like a replicaCorruptionError (which means we'd need to make sure that mechanism really works, and figure out what it means to make it reversible).
We must not allow both a merge and a lease acquisition to succeed. One way is to block lease acquisitions if there is any chance a merge that could succeed is in flight. The other way is to ensure that the merge will fail if a new lease is acquired on R. My incomplete suggestion for the latter is to use transaction deadlines: Before time T, the merge can commit, but the lease can't change hands. Afterwards, the reverse is true. The problem with this is of course that the EndTransaction can be proposed but not yet applied. If we were dealing with a single range, the EndTransaction would be well-ordered with respect to the lease attempt (and the new lease will invalidate all commands proposed under the old lease). With two ranges, that's trickier if not impossible.
When we start the merge transaction, we verify that Q and R are co-located, and we write an intent to the range descriptor which ensures that this can't change unless our transaction is aborted. So by the time any rebalancing can occur, we know the disposition of the merge transaction. If the merge succeeded, we shouldn't try to move the dead range R. If Q completes and immediately moves, everything should generally be fine. We just can't copy Q to a store that has a stale replica of R. Am I missing something here? I think the problems with the lease-timestamp proposal mainly have to do with out-of-date replicas instead of non-colocated ones.
SGTM Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. Comments from Reviewable |
No, then we're talking past each other. I'm saying that in addition to the usual check
I see where you're going with this, but I don't know how you would actually make it work. As you said, by the time the decision is made on whether the commit happens on
See the Q-away example. The merge commits on Morally, the way I like to think about my suggestion is to tag the range Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. Comments from Reviewable |
Yep. The code gets this right: cockroach/pkg/storage/batcheval/cmd_get_snapshot_for_merge.go Lines 40 to 57 in 6a69ebd
They are permitted provided they make it through the mergeCompleteCh before it closes. Hmm. I wonder if there are some clock offset problems to worry about here. FYI I haven't updated this RFC in a long time and I wasn't planning to until the 2.1 freeze sets in. LMK if you think there's an area in particular that would benefit from more ahead-of-time discussion though. |
We shouldn't have to worry about clock offset since the merge entails direct communication between leaseholder nodes so the new leaseholder's HLC can't have a smaller time than the latest permitted read on the RHS. We do still need to bump the timestamp cache on the new leaseholder over the span of the old RHS though. Looks like there's a TODO for this. cockroach/pkg/storage/store.go Line 2396 in 54a811d
No this is fine, no need to clean this up while you're still working on the code. |
Release note: None
db913f5
to
19d1ee2
Compare
I'm fine with merging as-is. |
Ditto
…On Tue, Jan 8, 2019, 01:59 Peter Mattis ***@***.*** wrote:
I'm fine with merging as-is.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24394 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135BugIFs0tIowjU4aLry5Q82iwRCbks5vA-2LgaJpZM4TCsP9>
.
|
Merging this as-is. bors r+ |
👎 Rejected by PR status |
bors r+ |
👎 Rejected by PR status |
@benesch it looks like this is a CLA issue. Do you mind force pushing to kick off another CI run? |
The bulk of the work is completed. Add a disclaimer that the RFC is woefully out of date and has been supplated by a tech note instead. Release note: None
19d1ee2
to
c49a1e4
Compare
Done, but I don't think it helped anything. Ugh. |
please? bors r+ |
Well, CLA assistant still thinks this isn't signed, but Bors is running now. 🤔 |
Welp, Bors noticed and crashed loudly, but now the CLA is signed, so ¯_(ツ)_/¯ . bors r+ |
Bors crashed again. bors r+ |
24394: RFCs: add range merges RFC r=benesch a=benesch This is very much a WIP. I wrote it mostly to collect my own thoughts on range merges. At this stage I'm most interested in getting buy-in on building a general-case prototype, as opposed to the special-cases I lay out in the alternatives section, but I'm of course eager to get feedback on the rest of this design. 33334: tech-notes: add note on range merges r=benesch a=benesch Very much still in progress. Opening a PR in case folks want to follow along. Release note: None Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
Build succeeded |
This is very much a WIP. I wrote it mostly to collect my own thoughts on range merges. At this stage I'm most interested in getting buy-in on building a general-case prototype, as opposed to the special-cases I lay out in the alternatives section, but I'm of course eager to get feedback on the rest of this design.