-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: Rebalances must be atomic #12768
Comments
If we're using flexible paxos, we can make it work as long as we first ensure that the leader is not on the datacenter where the rebalance is taking place. So if we denote datacenters by numbers and replicas by letters, if we are going from {1A, 2B, 3C} to {1A, 2B, 3D}, we just need to ensure that A or B is the leader before upreplicating to {1A, 2B, 3C, 3D}. Now if we have a datacenter failure at 1 or 2, we have enough replicas (three) to elect a leader. If we have a datacenter failure at 3, we still have our leader so we can upreplicate again to a 5-replica range. |
Yeah, we'd need to ensure both that A or B is the leader and that C cannot become leader while the upreplication is in flight. There are also performance considerations here (which we do not currently take into account): the new replica gets its snapshot from the leader, so for performance we'd want to have 3C be the leader when 3D is added. (or we introduce some way to send preemptive snapshots from a follower) |
We have a real-world instance of this: a node crashed (in a persistent way so it crashes again each time it's restarted) while the range was only configured for two replicas. This required extreme efforts to revive the downed node instead of simply letting the cluster recover normally. |
While it was only configured for two replicas? So this was when up-replicating from 1 to 3? I'm drawing a blank on when else a range would be configured to have two replicas during rebalancing (other than when removing a replica from a dead store and adding a new one to replace it, which this doesn't sound like). |
I think this was while removing a replica from a dead store. So technically this was a two-simultaneous-failure event, but the first failure was transient while the second was persistent. If the replica change had been atomic, we'd have been able to recover when the first transient failure resolved. |
Yeah, we need to fix that, although the fix for that will be different from / simpler than the fix for what this issue was originally describing. We really just need to up-replicate before removing the dead replica, not to make anything atomic. For the record, we hit this issue or removing a dead replica before adding its replacement a couple months ago as well - #25392. |
I folded #2067 (can't change the membership with three replicas in three nodes cluster) into this issue. |
This is a PR just to show some code to the interested parties. The real thing to look at here is the explanation and suggested strategy below. Don't review the code. ---- Learner replicas are full Replicas minus the right to vote (i.e they don't count for quorum). They are interesting to us because they allow us to phase out preemptive snapshots (via [delayed preemptive snaps] as a migration strategy; see there for all the things we don't like about preemptive snapshots), and because they can help us avoid spending more time in vulnerable configurations than we need to. To see an example of the latter, assume we're trying to upreplicate from three replicas to five replicas. As is, we need to add a fourth replica, wait for a preemptive snapshot, and add the fifth. We spend approximately the duration of one preemptive snapshot in an even replica configuration. In theory, we could send two preemptive snapshots first and then carry out two replica additions, but with learners, it would be much more straightforward and less error-prone. This doesn't solve the problems in cockroachdb#12768, but it helps avoid them. This PR shows the bare minimum of code changes to upreplicate using learner replicas and suggests further steps to make them a reality. Principally, to use learner replicas, we barely need to make any changes. Our current up-replication code is this: 1. send preemptive snapshot 1. run the ChangeReplicas txn, which adds a `ReplicaDescriptor` to the replicated range descriptor and, on commit, induces a Raft configuration change (`raftpb.ConfChangeAddNode`). The new up-replication code (note that the old one has to stick around, because compatibility): 1. run the ChangeReplicas txn, which adds a `ReplicaDescriptor` to the replicated range descriptor with the `Learner` flag set to true and, on commit, induces a Raft configuration change (`raftpb.ConfChangeAddLearnerNode`). 2. wait for the learner to have caught up or send it a Raft snapshot proactively (either works, just have to make sure not to duplicate work) 3. run a ChangeReplicas txn which removes the `Learner` flag from the `ReplicaDescriptor` and induces a Raft conf change of `raftpb.ConfChangeAddNode` (upgrading the learner). The existence of learners will need updates throughout the allocator so that it realizes that they don't count for quorum and are either upgraded or removed in a timely manner. None of that is in this POC. [delayed preemptive snaps]: cockroachdb#35786 Release note: None
While a complete fix requires the implementation of joint consensus, there may be things we can do to make partial improvements. There are a few factors that may be leaving ranges in this over-replicated, under-diversified state longer than necessary.
I think the last issue is probably most significant - a cluster that experiences one failure to remove an extra replica per ten minutes will usually have an overreplicated range and be vulnerable to region/AZ failure. There are a number of ways this can happen. For example, we try not to remove a node that is "behind", but I think the combination of heuristics here will err on the side of not removing anything (especially considering that the newly-added replica is likely not caught up yet, and if the queue is quick enough we reach this point while only 2/3 of the original replicas have acked the replica change itself). |
Over-replicated ranges can indicate problems with the replication queue and in some configurations can lead to unavailability because over-replicated ranges are often under-diversified. Updates cockroachdb#12768 Release note: None
Touches cockroachdb#12768. Release justification: the previous code would enter vulnerable replication configurations when it wasn't necessary, thus undermining what we wanted to achieve in cockroachdb#12768. PS: the newly added test passes when manually overriding useAtomic with `false`, and I verified that it indeed uses no atomic replication changes. Release note: None
41084: storage: use atomic replication changes in RelocateRange r=danhhz a=tbg I wrote this up rather quickly, but wanted to get this out for review sooner rather than later. ---- Touches #12768. Release justification: the previous code would enter vulnerable replication configurations when it wasn't necessary, thus undermining what we wanted to achieve in #12768. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
This is the last known work item for cockroachdb#12768. Previously, the replicate queue would use an add-and-remove strategy to get replicas off decommissioning or dead stores. Now it uses swaps whenever it can, subject to the usual restrictions around single-replica groups and cluster versions. Release justification: feature completes atomic replication changes. Release note: None
This is the last known work item for cockroachdb#12768. Previously, the replicate queue would use an add-and-remove strategy to get replicas off decommissioning or dead stores. Now it uses swaps whenever it can, subject to the usual restrictions around single-replica groups and cluster versions. Release justification: feature completes atomic replication changes. Release note: None
41190: storage: move off dead/decommissioning stores via atomic swaps r=danhhz a=tbg This is the last known work item for #12768. Previously, the replicate queue would use an add-and-remove strategy to get replicas off decommissioning or dead stores. Now it uses swaps whenever it can, subject to the usual restrictions around single-replica groups and cluster versions. Release justification: feature completes atomic replication changes. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
My new god query is
This lists every range that entered a >3x replicated state. I'm running |
Good and bad news. The good news is that the above query verifies that we're never upreplicating to four nodes in
We can check whether we ever put two replicas into region 1 with this query:
which returns lots of results. Picking a range at random, I see this wonderful behavior: Note how all of the replicas were put into a single region 💀 I sort of expected the allocator to do stupid stuff like that, but this is hilariously bad; I hope I broke this just recently. On the "plus" side, the allocator is aware that it's breaking the constraints: When the range is split off, with replicas on n3, n4, n8, the constraints should be OK, but then we move the n4 replica to n3, I don't know why, but noting that "constraint check fail". |
The replicate queue applied to this range at present says
This tells me that the allocator doesn't even think there is a problem, which is bad. |
I found out that this range belongs to the partitioned index
That makes it even more mysterious why the allocator things the constraints are violated, but at least it's better than I initially thought. Unfortunately it will be hard to programmatically go through all of the ranges and to decide which ones are supposed to be on which set of replicas. |
For another bit of confusion, look how many "overreplicated ranges" this thing claims we have (it thinks they're basically all overreplicated)
I don't buy that, the cc @andreimatei / @darinpp (cluster is tobias-1570324285-01-n12cpu4-geo) |
Looking.
…On Sun, Oct 6, 2019 at 8:55 AM Tobias Grieger ***@***.***> wrote:
For another bit of confusion, look how many "overreplicated ranges" this
thing claims we have:
***@***.***:26257/tpcc> select * from system.replication_stats;
zone_id | subzone_id | report_id | total_ranges | unavailable_ranges | under_replicated_ranges | over_replicated_ranges
+---------+------------+-----------+--------------+--------------------+-------------------------+------------------------+
0 | 0 | 3 | 198 | 0 | 0 | 0
1 | 0 | 3 | 18 | 0 | 0 | 0
15 | 0 | 3 | 1 | 0 | 0 | 0
16 | 0 | 3 | 1 | 0 | 0 | 0
17 | 0 | 3 | 2 | 0 | 0 | 0
22 | 0 | 3 | 1 | 0 | 0 | 0
53 | 1 | 3 | 5 | 0 | 0 | 5
53 | 2 | 3 | 6 | 0 | 0 | 6
53 | 3 | 3 | 4 | 0 | 0 | 4
54 | 1 | 3 | 1 | 0 | 0 | 0
54 | 2 | 3 | 1 | 0 | 0 | 0
54 | 3 | 3 | 1 | 0 | 0 | 0
55 | 1 | 3 | 21 | 0 | 0 | 21
55 | 2 | 3 | 20 | 0 | 0 | 20
55 | 3 | 3 | 22 | 0 | 0 | 22
56 | 1 | 3 | 434 | 0 | 0 | 434
56 | 2 | 3 | 424 | 0 | 0 | 424
56 | 3 | 3 | 428 | 0 | 0 | 428
56 | 4 | 3 | 243 | 0 | 0 | 243
56 | 5 | 3 | 251 | 0 | 0 | 251
56 | 6 | 3 | 251 | 0 | 0 | 251
57 | 1 | 3 | 89 | 0 | 0 | 89
57 | 2 | 3 | 88 | 0 | 0 | 88
57 | 3 | 3 | 87 | 0 | 0 | 87
57 | 4 | 3 | 48 | 0 | 0 | 48
57 | 5 | 3 | 47 | 0 | 0 | 47
57 | 6 | 3 | 47 | 0 | 0 | 47
57 | 7 | 3 | 42 | 0 | 0 | 42
57 | 8 | 3 | 40 | 0 | 0 | 40
57 | 9 | 3 | 41 | 0 | 0 | 41
58 | 1 | 3 | 2 | 0 | 0 | 2
58 | 2 | 3 | 2 | 0 | 0 | 2
58 | 3 | 3 | 2 | 0 | 0 | 2
59 | 1 | 3 | 709 | 0 | 0 | 709
59 | 2 | 3 | 705 | 0 | 0 | 705
59 | 3 | 3 | 692 | 0 | 0 | 692
60 | 1 | 3 | 379 | 0 | 0 | 379
60 | 2 | 3 | 377 | 0 | 0 | 377
60 | 3 | 3 | 377 | 0 | 0 | 377
60 | 4 | 3 | 56 | 0 | 0 | 56
60 | 5 | 3 | 53 | 0 | 0 | 53
60 | 6 | 3 | 56 | 0 | 0 | 56
61 | 1 | 3 | 1 | 0 | 0 | 1
61 | 2 | 3 | 1 | 0 | 0 | 1
61 | 3 | 3 | 1 | 0 | 0 | 1
(45 rows)
Time: 69.719958ms
***@***.***:26257/tpcc>
I don't buy that, the ranges.overreplicated metric flatlines at zero.
cc @andreimatei <https://github.com/andreimatei> / @darinpp
<https://github.com/darinpp> (cluster is tobias-1570324285-01-n12cpu4-geo)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12768?email_source=notifications&email_token=AAC4C4OVUAAEYOBUQ3JEU4DQNHN2XA5CNFSM4C3WZEH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAOJNHQ#issuecomment-538744478>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAC4C4JB3AL34A5WWO2GPSLQNHN2XANCNFSM4C3WZEHQ>
.
|
Actually, @darinpp, if you get here before me and are so inclined, feel free to take a look. It'll have to wait a bit on my part. |
#41328 shows a case where you have 4 replicas in 19.2. It does seem like a bug however. |
In that screenshot, the |
@tbg can we close this? |
Closes cockroachdb#6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. This is possible because we previously introduced atomic rebalances in cockroachdb#12768. The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition. With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
Closes cockroachdb#6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. This is possible because we previously introduced atomic rebalances in cockroachdb#12768. The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition. With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
50053: Script for the PublishRelease TC build configuration r=pbardea a=jlinder Before: the script wasn't implemented. Now: Part of the new release process, this script - tags the selected SHA to the provided name - compiles the binaries and archive and uploads them to S3 as the versioned name - uploads the docker image to docker.io/cockroachdb/cockroach - pushes the tag to github.com/cockroachdb/cockroach - push all the artificats to their respective `latest` locations as appropriate Release note: None 51567: kvserver: Allow rebalances between stores on the same nodes. r=lunevalex a=lunevalex Closes #6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. This is possible because we previously introduced atomic rebalances in #12768. The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition. With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node. 52754: importccl: speed up revert of IMPORT INTO empty table r=dt a=dt When IMPORT INTO fails, it reverts the tables to their pre-IMPORT state. Typically this requires running a somewhat expensive RevertRange operation that finds the keys written by the IMPORT in amongst all the table data and deletes just those keys. This is somewhat expensive -- we need to iterate the keys in the target table and check them to see if they need to be reverted. Non-INTO style IMPORTs create the table into which they will IMPORT and thus can just drop it wholesale on failure, instead of doing this expensive revert. However INTO-style IMPORTs could use a similarly fast/cheap wholesale delete *if they knew the table was empty* when the IMPORT was started. This change tracks which tables were empty when the IMPORT started and then deletes, rather than reverts, the table span on failure. Release note (performance improvement): Cleaning up after a failure during IMPORT INTO a table which was empty is now faster. 53023: opt: add index acceleration support for ~ and && bounding box operators r=rytaft a=rytaft This commit adds index acceleration support for the bounding box comparison operators, `~` and `&&`. It maps `~` to Covers and `&&` to Intersects. Release note (performance improvement): The ~ and && geospatial bounding box operations can now benefit from index acceleration if one of the operands is an indexed geometry column. 53049: bulkio: Fix transaction semantics in job scheduler. r=miretskiy a=miretskiy Fixes #53033 Fixes #52959 Use transaction when querying for the schedules to run. In addition, ensure that a single bad schedule does not cause all of the previous work to be wasted by using transaction savepoints. Release Notes: None 53132: sql/opt: add implicit SELECT FOR UPDATE support for UPSERT statements r=nvanbenschoten a=nvanbenschoten Fixes #50180. This commit adds support for implicit SELECT FOR UPDATE support for UPSERT statements with a VALUES clause. This should improve throughput and latency for contended UPSERT statements in much the same way that 435fa43 did for UPDATE statements. However, this only has an effect on UPSERT statements into tables with multiple indexes because UPSERT statements into single-index tables hit a fast-path where they perform a blind-write without doing an initial row scan. Conceptually, if we picture an UPSERT statement as the composition of a SELECT statement and an INSERT statement (with loosened semantics around existing rows) then this change performs the following transformation: ``` UPSERT t = SELECT FROM t + INSERT INTO t => UPSERT t = SELECT FROM t FOR UPDATE + INSERT INTO t ``` I plan to test this out on a contended `indexes` workload at some point in the future. Release note (sql change): UPSERT statements now acquire locks using the FOR UPDATE locking mode during their initial row scan, which improves performance for contended workloads when UPSERTing into tables with multiple indexes. This behavior is configurable using the enable_implicit_select_for_update session variable and the sql.defaults.implicit_select_for_update.enabled cluster setting. Co-authored-by: James H. Linder <jamesl@cockroachlabs.com> Co-authored-by: Alex Lunev <alexl@cockroachlabs.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: Rebecca Taft <becca@cockroachlabs.com> Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Consider a deployment with three datacenters (and a replication configuration that places one replica of each range in each DC). During a rebalance, a range that normally requires a quorum of 2/3 will briefly change to a configuration with a 3/4 quorum and two replicas in one DC. If that DC (or the network) fails before the fourth replica is removed, the range becomes unavailable (If rebalancing downreplicated before upreplicating, then the intermediate state would have a quorum of 2/2 and a failure of either datacenter would lead to unavailability and much harder recovery). While the window is fairly small, there is a risk that an ill-timed DC failure could make one or more ranges unavailable for the duration.
The most general solution to this problem is to implement "joint consensus", the original raft membership change proposal (it is described in the original raft paper and in section 4.3 of Diego's dissertation). Etcd implements a much simpler membership change protocol which has the limitation that only one change can be made at a time (leading to the brief exposure of the intermediate 3/4 state). Joint consensus would let us make these rebalancing changes atomically.
We might also be able to mitigate the problem by using ideas from Flexible Paxos. Flexible Paxos shows that committing entries can be done with a bare majority of nodes, so as long as the leader is not in the failed datacenter the removal of the fourth node can be completed while the DC is down and the range is restored to its 2/3 state. However, if the leader were in the failed DC then the two surviving ones would be unable to make progress since they would have to assume that the former leader is still making progress on the other side of its partition. I'm not sure if there's a full solution here or not.
Running with a higher replication factor (5 replicas in 3 DCs) could also mitigate the problem if the rebalancer were aware of it (so when the range goes from 3/5 to 4/6, those six replicas are guaranteed to be two per DC). This might be a quick fix to prevent this problem from striking a critical system range. Running with more DCs than replicas (3 replicas in 4 DCs, so you never need to place two replicas in the same DC) also avoids this problem without increasing storage costs, but it has the significant downside that all rebalancing must happen over the more expensive inter-DC links.
The text was updated successfully, but these errors were encountered: