-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvstorage: complete RaftReplicaID migration #95513
Conversation
6d30d5d
to
16d1a33
Compare
d3e9028
to
d2c0515
Compare
This comment was marked as resolved.
This comment was marked as resolved.
26f4edc
to
18a7ff7
Compare
Now it's good to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov and @tbg)
-- commits
line 36 at r4:
wasn't it v22.1?
There is also a similar comment in
cockroach/pkg/kv/kvserver/store_create_replica.go
Lines 249 to 283 in c9f87cf
// Compatibility: | |
// - v21.2 and v22.1: v22.1 unilaterally introduces RaftReplicaID (an | |
// unreplicated range-id local key). If a v22.1 binary is rolled back at | |
// a node, the fact that RaftReplicaID was written is harmless to a | |
// v21.2 node since it does not read it. When a v21.2 drops an | |
// initialized range, the RaftReplicaID will also be deleted because the | |
// whole range-ID local key space is deleted. | |
// | |
// - v22.2: we will start relying on the presence of RaftReplicaID, and | |
// remove any unitialized replicas that have a HardState but no | |
// RaftReplicaID. This removal will happen in ReplicasStorage.Init and | |
// allow us to tighten invariants. Additionally, knowing the ReplicaID | |
// for an unitialized range could allow a node to somehow contact the | |
// raft group (say by broadcasting to all nodes in the cluster), and if | |
// the ReplicaID is stale, would allow the node to remove the HardState | |
// and RaftReplicaID. See | |
// https://github.com/cockroachdb/cockroach/issues/75740. | |
// | |
// There is a concern that there could be some replica that survived | |
// from v21.2 to v22.1 to v22.2 in unitialized state and will be | |
// incorrectly removed in ReplicasStorage.Init causing the loss of the | |
// HardState.{Term,Vote} and lead to a "split-brain" wrt leader | |
// election. | |
// | |
// Even though this seems theoretically possible, it is considered | |
// practically impossible, and not just because a replica's vote is | |
// unlikely to stay relevant across 2 upgrades. For one, we're always | |
// going through learners and don't promote until caught up, so | |
// uninitialized replicas generally never get to vote. Second, even if | |
// their vote somehow mattered (perhaps we sent a learner a snap which | |
// was not durably persisted - which we also know is impossible, but | |
// let's assume it - and then promoted the node and it immediately | |
// power-cycled, losing the snapshot) the fire-and-forget way in which | |
// raft votes are requested (in the same raft cycle) makes it extremely | |
// unlikely that the restarted node would then receive it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov and @sumeerbhola)
wasn't it v22.1?
Yep, thanks.
Updated the comment.
18a7ff7
to
9731f35
Compare
5327822
to
749969b
Compare
Ok, RFAL. |
|
@RaduBerinde requested that we file these concurrency issues inside pebble as issues on the pebble repo. |
Filed cockroachdb/pebble#2301 |
See cockroachdb#93310. This is also the beginning of cockroachdb#93247. Epic: CRDB-220 Release note: None
749969b
to
20685eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking much clearer now, thanks for implementing the suggestion! LGTM in principle, but left some tips/nits.
Still have to push the updates, please ignore my comment pings until I request. |
20685eb
to
b0cf469
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final bits and bobs.
As of v22.1[^1], we always write the RaftReplicaID when creating a Replica or updating a snapshot. However, since this is persisted state that could've originated in older versions and not updated yet, we couldn't rely on a persisted ReplicaID yet. This commit adds code to the `(*Store).Start` boot sequence that - persists a RaftReplicaID for all initialized replicas (using the ReplicaID from the descriptor) - deletes all uninitialized replicas that don't have a RaftReplicaID (since we don't know their ReplicaID at this point). The second item in theory violates Raft invariants, as uninitialized Replicas are allowed to vote (though they then cannot accept log entries). So in theory: - an uninitialized replica casts a decisive vote for a leader - it restarts - code in this commit removes the uninited replica (and its vote) - delayed MsgVote from another leader arrives - it casts another vote for the same term for a dueling leader - now there are two leaders in the same term. The above in addition presupposes that the two leaders cannot communicate with each other. Also, even if that is the case, since the two leaders cannot append to the uninitialized replica (it doesn't accept entries), we also need additional voters to return at the exact right time. Since an uninitialized replica without RaftReplicaID in is necessarily at least one release old, this is exceedingly unlikely and we will live with this theoretical risk. This commit also introduces a few assertions that make sure that we don't have overlapping initialized replicas (which would be detected at Store.Start time otherwise while inserting in the btree, but it's nice to catch this earlier) or duplicate RangeIDs. [^1]: cockroachdb#75761 Epic: CRDB-220 Release note: None
402bc02
to
672e8b1
Compare
bors r=pavelkalinnikov |
Build succeeded: |
It now has to be there, so turn this into an assertion failure. See cockroachdb#95513. Epic: CRDB-220 Release note: None
It now has to be there, so turn this into an assertion failure. See cockroachdb#95513. Epic: CRDB-220 Release note: None
115884: kvserver: remove ReplicaID migration r=erikgrinaker a=erikgrinaker This migration ensured every replica had a persisted replica ID. It is no longer needed after `MinSupportedVersion` >= 23.1, since the migration has been applied on every finalized 23.1 node. Instead, we assert during startup that all replicas have a replica ID. Resolves #115869. Touches #95513. Epic: none Release note: None 116443: roachtest: port multitenant/shared-process/basic to new APIs r=srosenberg a=herkolategan Converts multitenant/shared-process/basic to use the new roachprod multitenant APIs. Fixes: #115868 Epic: CRDB-31933 Release Note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Herko Lategan <herko@cockroachlabs.com>
As of v22.11, we always write the RaftReplicaID when creating a
Replica or updating a snapshot. However, since this is
persisted state that could've originated in older versions and not
updated yet, we couldn't rely on a persisted ReplicaID yet.
This commit adds code to the
(*Store).Start
boot sequence thatReplicaID from the descriptor)
ReplicaID at this point).
The second item in theory violates Raft invariants, as uninitialized
Replicas are allowed to vote (though they then cannot accept log
entries). So in theory:
The above in addition presupposes that the two leaders cannot
communicate with each other. Also, even if that is the case, since the
two leaders cannot append to the uninitialized replica (it doesn't
accept entries), we also need additional voters to return at the exact
right time.
Since an uninitialized replica without RaftReplicaID in is necessarily
at least one release old, this is exceedingly unlikely and we will
live with this theoretical risk.
This PR also adds a first stab at a datadriven test harness for
kvstorage
which is likely to be of use for #93247.Epic: CRDB-220
Release note: None
Footnotes
https://github.com/cockroachdb/cockroach/pull/75761 ↩