-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: don't unquiesce uninitialized replicas #73362
kv: don't unquiesce uninitialized replicas #73362
Conversation
I think this is incorrect in two ways which offset. Uninitialized replicas don't know that they are learners, because they don't have access to the range descriptor. This means that they aren't disqualified for being "promotable" because of that, at least from their perspective. However, uninitialized replicas don't know the other members of their range and don't even track themselves in their This also means that leader elections cannot be used to trigger the cleanup of uninitialized replicas. @tbg's new state machine diagram of replica initialization also hints at there being no path for an uninitialized replica to be removed without going through an initialized replica. |
#47982 looks relevant. I guess that confirms the suspicion. So I guess the next steps here are:
@ajwerner do you have interest in pushing that PR over the finish line? |
Let's keep Andrew out of it, I'm sure he has enough to do. I might be missing something, but isn't GC'ing uninitialized replicas difficult? We would need an index RangeID->RangeDesc which we don't have (meta ranges are keyed on EndKey, not RangeID) or do full meta scans every now and then. Also worth pausing to think about the benefit of replicaGC'ing uninit'ed replicas once we don't tick them. Once we no longer tick them, other than a sense of closure the main upshot is removing them from the in-memory maps. The disk state is negligibly different, RangeTombstone key vs HardState key. |
We can definitely replicaGC uninitialized replicas that have a trivial HardState, i.e. in particular I lean towards doing a ~daily scan through the meta ranges per node and performing GC of uninitialized replicas based on that, since that is simple and effective and handles the problem fully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
-- commits, line 28 at r2:
It's by design in the sense that we can't replicaGC them for the reasons discussed in the main thread. There's definitely no desire to not be able to replicaGC them but it hasn't been a priority.
-- commits, line 31 at r2:
Correct, which might never happen, for example if the range doesn't exist any more (merge).
-- commits, line 35 at r2:
nit: they aren't always learners. You can also get uninit'ed voters with a quick split-rebalance succession. But the uninitialized replica doesn't even know what it is, it will have an empty config, see here:
cockroach/pkg/kv/kvserver/replica_raftstorage.go
Lines 66 to 69 in 4c8c0d4
// For uninitialized ranges, membership is unknown at this point. | |
if raft.IsEmptyHardState(hs) || err != nil { | |
return raftpb.HardState{}, raftpb.ConfState{}, err | |
} |
So they certainly will not call an election, ever, since the progress map is empty. They also can't send a message to anyone in general, since that would require them to at least know about a peer, but these would have to be in the configuration. The only thing they can do is respond to incoming messages.
This indicates that by not ticking uninit'ed replicas, we're not regressing on their ability to "commonly" replicaGC themselves in response to a rejection of one of their outgoing messages, as this basically never happens in the first place anyway.
pkg/kv/kvserver/store_raft.go, line 638 at r2 (raw file):
// Filter out uninitialized replicas. We could tick them, but doing so is // unnecessary because uninitialized replicas can never win elections, so
Might want to update this to reflect our finding that uninited replicas cannot even send non-reactive messages since they don't know who the peers are.
pkg/kv/kvserver/store_raft.go, line 647 at r2 (raw file):
// them through the Raft scheduler and ticking them avoids a meaningful // amount of periodic work for each uninitialized replica. s.mu.RLock()
Reading this makes me wonder why we allow uninited replicas to be unquiesced. Not suggesting you change that, but maybe add a comment saying that it's just the way it is and there is no good reason and we don't like lots of things about uninitialized replicas and see #72374.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't GC'ing uninitialized replicas difficult? We would need an index RangeID->RangeDesc which we don't have (meta ranges are keyed on EndKey, not RangeID) or do full meta scans every now and then.
This makes sense to me, thanks for explaining. This must mean that the approach in #47982 is broken. It seems to perform its consistent meta lookup with the replica's in-memory RangeDescriptor here. But if the replica is uninitialized, this descriptor will be empty. So it won't actually look up the right range.
Also worth pausing to think about the benefit of replicaGC'ing uninit'ed replicas once we don't tick them. Once we no longer tick them, other than a sense of closure the main upshot is removing them from the in-memory maps. The disk state is negligibly different, RangeTombstone key vs HardState key.
👍 that matches my understanding as well.
I lean towards doing a ~daily scan through the meta ranges per node and performing GC of uninitialized replicas based on that, since that is simple and effective and handles the problem fully.
This seems nice and simple.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/kv/kvserver/store_raft.go, line 647 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Reading this makes me wonder why we allow uninited replicas to be unquiesced. Not suggesting you change that, but maybe add a comment saying that it's just the way it is and there is no good reason and we don't like lots of things about uninitialized replicas and see #72374.
That's a really interesting idea. If we allowed uninited replicas to quiesce (on their own, with no leader input) then we wouldn't need any of this special handling. These replicas would be removed from the unquiescedReplicas
map and stay out of it until they heard from a peer. I'll explore that alternate approach, as it seems pretty clean and conceptually consistent.
Thanks for all your work on this @nvanbenschoten! Wrote up an issue to handle GC in #73424. |
6a7631b
to
32b96f4
Compare
Thanks for opening the issue @erikgrinaker! I updated the code comment to point to that issue. @tbg I rewrote this PR according to your suggestion. Instead of letting uninitialized replicas unquiesce but then filtering them out in the tick loop, we now no longer let uninitialized replicas unquiesce. This seems cleaner to me, as it introduces a reasonable relationship between state machine initialization and quiescence and uses the existing quiescence logic to prevent these replicas from ticking, instead of introducing an additional orthogonal condition for ticking. PTAL. |
32b96f4
to
6b66392
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r4, 9 of 9 files at r9, 4 of 4 files at r10, 4 of 4 files at r11, 4 of 4 files at r12, 7 of 7 files at r13, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/client_raft_test.go, line 4105 at r13 (raw file):
defer tc.Stopper().Stop(ctx) _, desc, err := tc.Servers[0].ScratchRangeEx()
Add a comment about using an expiration-based lease or, my preference if possible, switch to ScratchRange
.
pkg/kv/kvserver/client_raft_test.go, line 4110 at r13 (raw file):
require.NoError(t, tc.WaitForSplitAndInitialization(key)) // Block a snapshot.
// Block incoming snapshots on s2 until channel is signaled.
pkg/kv/kvserver/client_raft_test.go, line 4114 at r13 (raw file):
handlerFuncs := noopRaftHandlerFuncs() handlerFuncs.snapErr = func(header *kvserver.SnapshotRequest_Header) error { <-blockSnapshot
give up on stopper.ShouldQuiesce()
?
pkg/kv/kvserver/client_raft_test.go, line 4117 at r13 (raw file):
return nil } s1, err := tc.Server(1).GetStores().(*kvserver.Stores).GetStore(tc.Server(1).GetFirstStoreID())
s2? s1 is on tc.Server(0)
. Indexing, always a joy.
pkg/kv/kvserver/client_raft_test.go, line 4125 at r13 (raw file):
}) // Try to up-replicate to s1. Should block on a learner snapshot after the new
s2 here and below?
pkg/kv/kvserver/client_raft_test.go, line 4130 at r13 (raw file):
// receiving Raft traffic from the leader. replicateErrChan := make(chan error) go func() {
nit: I know it doesn't matter here but if we have a stopper might as well use it.
pkg/kv/kvserver/replica.go, line 306 at r13 (raw file):
// // Replica objects always begin life in a quiescent state, as the field is // set to true in the Replica constructor newUnloadedReplica. They unquiesce
Ohh, interesting. I had no idea. That explains your earlier commit. I take back my comment about || internalRaftGroup == nil
then.
pkg/kv/kvserver/replica.go, line 310 at r13 (raw file):
// unquiesceWithOptionsLocked, which are called in response to Raft traffic. // // Only initialized replicas that have a non-nil internalRaftGroup are
This makes it sound like there are initialized replicas with a nil raft group, which raises the question of how these are allowed to unquiesce (since they are not according to this comment). So either init'ed replicas with nil group don't exist, or whether the group is nil or not doesn't matter for purposes of unquiescing?
pkg/kv/kvserver/replica_metrics.go, line 62 at r11 (raw file):
raftStatus := r.raftStatusRLocked() leaseStatus := r.leaseStatusAtRLocked(ctx, now) quiescent := r.mu.quiescent || r.mu.internalRaftGroup == nil
Interesting. Have I missed a commit? Because, wouldn't an uninitialized replica start out with quiescent = false
now when it was previously true
? As is, this seems like a behavior change to me. Not a terribly important one (since this is only for ReplicaMetrics
and internalRaftGroup
is populated anyway within ~10 minutes (scan interval)) but still, this doesn't quite check out with the commit message.
pkg/kv/kvserver/replica_raft_quiesce.go, line 52 at r13 (raw file):
} func (r *Replica) unquiesceWithOptionsLocked(campaignOnWake bool) {
maybeUnquiesceWithOptionsLocked
to indicate that it may not be possible?
pkg/kv/kvserver/replica_raft_quiesce.go, line 116 at r13 (raw file):
// lazily in response to any Raft traffic (see stepRaftGroup) or KV request // traffic (see maybeInitializeRaftGroup). If it has yet to be initialized, // let it remain quiesced. The Raft group will be initialized soon enough.
This is pretty annoying; we morally say that a replica unquiesces when it needs to use raft, but then we also say this isn't true, you need some folks bypassing this check and creating the group first? This is pretty complex and I'm not at all clear on why this works out in practice (I assume we have the right calls in the actual critical path).
I think we should initialize the raft group right away. Is now perhaps the right time to make that change? Doing this lazily was a half-baked way to avoid election storms back when this was still a problem and possibly before we even had quiescence. Now that quiescence is the default, what's the benefit of the lazy raft group?
I understand that we're looking to backport this, so maybe a fast-follow on master only?
pkg/kv/kvserver/store_create_replica.go, line 351 at r13 (raw file):
// quiesced, so we don't need to campaign or wake the leader. We just want // to start ticking. lockedRepl.unquiesceWithOptionsLocked(false /* campaignOnWake */)
Beating the same drum as before, but now we have to know that the internal raft group is no longer nil (probably because we hit withReplicaForRequest
?)
Did you have to add this line in response to any failures? It makes sense, but shouldn't be necessary, right?
6b66392
to
b4f814d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! I finally got around to addressing your comments.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/client_raft_test.go, line 4105 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Add a comment about using an expiration-based lease or, my preference if possible, switch to
ScratchRange
.
ScratchRangeEx
isn't an expiration-based lease, it's the "extended" version of ScratchRange
that returns the left and right descriptor instead of a start key. This made more sense to me than the logic I see in some places where we ScatchRange
and then take the returned start key a look up the corresponding descriptor that we just dropped in ScratchRange
.
pkg/kv/kvserver/client_raft_test.go, line 4110 at r13 (raw file):
// Block incoming snapshots on s2 until channel is signaled.
Done.
pkg/kv/kvserver/client_raft_test.go, line 4114 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
give up on
stopper.ShouldQuiesce()
?
Done.
pkg/kv/kvserver/client_raft_test.go, line 4117 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
s2? s1 is on
tc.Server(0)
. Indexing, always a joy.
Done.
pkg/kv/kvserver/client_raft_test.go, line 4125 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
s2 here and below?
Done.
pkg/kv/kvserver/client_raft_test.go, line 4130 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: I know it doesn't matter here but if we have a stopper might as well use it.
Done.
pkg/kv/kvserver/replica.go, line 306 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ohh, interesting. I had no idea. That explains your earlier commit. I take back my comment about
|| internalRaftGroup == nil
then.
Me neither until writing this up. But I think it makes sense for things to work like that.
pkg/kv/kvserver/replica.go, line 310 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This makes it sound like there are initialized replicas with a nil raft group, which raises the question of how these are allowed to unquiesce (since they are not according to this comment). So either init'ed replicas with nil group don't exist, or whether the group is nil or not doesn't matter for purposes of unquiescing?
The dependency is the other way around. There are initialized replicas with a nil raft group. They unquiesce by first initializing their raft group in withRaftGroupLocked
and then trying to unquiesce. But let's get rid of this craziness in #73715.
pkg/kv/kvserver/replica_metrics.go, line 62 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Interesting. Have I missed a commit? Because, wouldn't an uninitialized replica start out with
quiescent = false
now when it was previouslytrue
? As is, this seems like a behavior change to me. Not a terribly important one (since this is only forReplicaMetrics
andinternalRaftGroup
is populated anyway within ~10 minutes (scan interval)) but still, this doesn't quite check out with the commit message.
I think this comment came before your "I take back my comment", but in case it wasn't clear, all replicas (initialized or uninitialized) start with quiescent = true
. They can then only unquiesce if r.mu.internalRaftGroup != nil
, which has always been the case and is not changing in this PR. So !r.mu.quiescent
implies r.mu.internalRaftGroup != nil
, meaning that this was redundant.
pkg/kv/kvserver/replica_raft_quiesce.go, line 52 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
maybeUnquiesceWithOptionsLocked
to indicate that it may not be possible?
Done. I also added a return value that we can use for assertions in a few places. PTAL at the new commit.
pkg/kv/kvserver/replica_raft_quiesce.go, line 116 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is pretty annoying; we morally say that a replica unquiesces when it needs to use raft, but then we also say this isn't true, you need some folks bypassing this check and creating the group first? This is pretty complex and I'm not at all clear on why this works out in practice (I assume we have the right calls in the actual critical path).
I think we should initialize the raft group right away. Is now perhaps the right time to make that change? Doing this lazily was a half-baked way to avoid election storms back when this was still a problem and possibly before we even had quiescence. Now that quiescence is the default, what's the benefit of the lazy raft group?
I understand that we're looking to backport this, so maybe a fast-follow on master only?
I think I see what you're saying. The idea would be that if quiescence is the default and we only unquiesce in all the same places that would otherwise initialize the raft group, then this lazy raft group creation isn't doing anything anymore, and so the dependency from internalRaftGroup != nil -> !quiescent
is unnecessary complexity. I think I agree with that in theory, as it simplifies the states that a replica can be in significantly.
But I'd like to avoid expanding the scope of this PR any further, so a fast-follow sounds good. I opened #73715 to track the work.
pkg/kv/kvserver/store_create_replica.go, line 351 at r13 (raw file):
but now we have to know that the internal raft group is no longer nil (probably because we hit withReplicaForRequest?)
Yes, that's correct. If we made it here then it was in response to a raft message and so the internal raft group is no longer nil.
Did you have to add this line in response to any failures? It makes sense, but shouldn't be necessary, right?
No, this wasn't in response to any failures. In fact, it's not strictly necessary because this method is only ever called inside of handleRaftReadyRaftMuLocked
in response to snapshots, which also calls withRaftGroupLocked
with a true
return value for unquiesceAndWakeLeader
before completing. So if we ever initialize ourselves inside of handleRaftReadyRaftMuLocked
and don't unquiesce before getting to the end, that will take care of unquiescing for us.
But this isn't entirely superfluous — there is a benefit to calling this here instead of waiting. That's because doing so allows us to call unquiesceWithOptionsLocked
and to pass !campaignOnWake
. So since we know we don't need to campaign or wake the leader, it's better to unquiesce now and prevent the other code path from doing so.
I added both of these points to the comment.
pkg/kv/kvserver/store_raft.go, line 638 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Might want to update this to reflect our finding that uninited replicas cannot even send non-reactive messages since they don't know who the peers are.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 8 files at r14, 1 of 1 files at r15, 1 of 1 files at r16, 6 of 6 files at r17, 10 of 10 files at r18, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/store_create_replica.go, line 369 at r18 (raw file):
// the range, so we can unquiesce without campaigning or waking the leader. if !lockedRepl.maybeUnquiesceWithOptionsLocked(false /* campaignOnWake */) { return errors.Errorf("expected replica %s to unquiesce after initialization", desc)
errors.AssertionFailedf
? If we hit this the world will end, right? Can't imagine this will be handled gracefully by the callers.
This is a small refactor that eliminates a call to `HLC.Now` on each replica tick. Due to the issue fixed by the following commit, we were ticking 1000s of uninitialized replicas in fast succession and as a result, these clock accesses were a major source of mutex contention. We shouldn't be ticking uninitialized replicas, but since we now know that this is an expensive part of Replica.tick, we might as well make it cheaper.
Refactor that does not change behavior.
A replica cannot unquiesce if its internalRaftGroup is nil, so this condition was redundant but confusing.
In a [support issue](cockroachlabs/support#1340), we saw that 10s of thousands of uninitialized replicas were being ticked regularly and creating a large amount of background work on a node, driving up CPU. This commit updates the Raft quiescence logic to disallow uninitialized replicas from being unquiesced and Tick()'ing themselves. Keeping uninitialized replicas quiesced even in the presence of Raft traffic avoids wasted work. We could Tick() these replicas, but doing so is unnecessary because uninitialized replicas can never win elections, so there is no reason for them to ever call an election. In fact, uninitialized replicas do not even know who their peers are, so there would be no way for them to call an election or for them to send any other non-reactive message. As a result, all work performed by an uninitialized replica is reactive and in response to incoming messages (see processRequestQueue). There are multiple ways for an uninitialized replica to be created and then abandoned, and we don't do a good job garbage collecting them at a later point (see cockroachdb#73424), so it is important that they are cheap. Keeping them quiesced instead of letting them unquiesce and tick every 200ms indefinitely avoids a meaningful amount of periodic work for each uninitialized replica. Release notes (bug fix): uninitialized replicas that are abandoned after an unsuccessful snapshot no longer perform periodic background work, so they no longer have a non-negligible cost.
This commit renames the `unquiesceXYZ` methods to include a "maybe" prefix to indicate that they may not succeed in unquiescing the replica. The commit also introduces a boolean return value to these methods to inform the caller about whether the unquiesce attempt was successful. Finally, the commit uses this return value in `maybeMarkReplicaInitializedLockedReplLocked` to assert that after initializing a replica, it did unquiesce immediately.
b4f814d
to
95a7671
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r=tbg
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/store_create_replica.go, line 369 at r18 (raw file):
Previously, tbg (Tobias Grieger) wrote…
errors.AssertionFailedf
? If we hit this the world will end, right? Can't imagine this will be handled gracefully by the callers.
Done.
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 95a7671 to blathers/backport-release-21.1-73362: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.1.x failed. See errors above. error creating merge commit from 95a7671 to blathers/backport-release-21.2-73362: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
After cockroachdb#73362, all uninitialized replicas are quiescent. The callers of addToReplicasByRangeIDLocked are either inserting an unitialized Replica, or a newly created initialized Replica which has not been linked to Store yet and hence has not been unquiesced either. The quiescence check is thus a no-op, and can be removed from addToReplicasByRangeIDLocked. As a second "proof" for why it can be removed: - rangeID is added to unquiescedReplicas in 2 places (Replica.maybeUnquiesce*), both of which also set r.mu.quiescent to false a few lines above. - rangeID is removed from unquiescedReplicas in 2 places (Replica.quiesceLocked and Replica removal in Store.unlinkReplicaByRangeIDLocked). The first one sets r.mu.quiescent to true a few lines apart, and the second one outright removes the replica. Thus, it is already true that r.mu.quiescent == false iff the replica's rangeID is in unquiescedReplicas (with a little exception of when r is being removed which does not impact our check). Release note: none
In a support issue, we saw that 10s of thousands of uninitialized replicas were being ticked regularly and creating a large amount of background work on a node, driving up CPU. This commit updates the Raft quiescence logic to disallow uninitialized replicas from being unquiesced and Tick()'ing themselves.
Keeping uninitialized replicas quiesced even in the presence of Raft traffic avoids wasted work. We could Tick() these replicas, but doing so is unnecessary because uninitialized replicas can never win elections, so there is no reason for them to ever call an election. In fact, uninitialized replicas do not even know who their peers are, so there would be no way for them to call an election or for them to send any other non-reactive message. As a result, all work performed by an uninitialized replica is reactive and in response to incoming messages (see
processRequestQueue
).There are multiple ways for an uninitialized replica to be created and then abandoned, and we don't do a good job garbage collecting them at a later point (see #73424), so it is important that they are cheap. Keeping them quiesced instead of letting them unquiesce and tick every 200ms indefinitely avoids a meaningful amount of periodic work for each uninitialized replica.
Release notes (bug fix): uninitialized replicas that are abandoned after an unsuccessful snapshot no longer perform periodic background work, so they no longer have a non-negligible cost.