Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] storage: detect snapshots intended for a different replica ID #40459

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Sep 4, 2019

Before this change we'd apply a snapshot over an existing initialized replica
regardless of the replica ID its descriptor contained for the recipient store.
This logic is problematic if the current store was removed and re-added and in
the meantime the range subsumed one or more other ranges. This is specifically
problematic because the code assumes in maybeAcquireSnapshotMergeLock that if
the range descriptor grows then there will be collocated replicas to subsume.
This invariant doesn't hold if the replica was removed and then added back
after the range is modified.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Before this change we'd apply a snapshot over an existing initialized replica
regardless of the replica ID its descriptor contained for the recipient store.
This logic is problematic if the current store was removed and re-added and in
the meantime the range subsumed one or more other ranges. This is specifically
problematic because the code assumes in maybeAcquireSnapshotMergeLock that if
the range descriptor grows then there will be collocated replicas to subsume.
This invariant doesn't hold if the replica was removed and then added back
after the range is modified.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/debug-subsume-fatal-after-second-snapshot branch from 6c5b159 to 2787eb0 Compare September 4, 2019 07:00
@ajwerner
Copy link
Contributor Author

ajwerner commented Sep 4, 2019

This decidedly doesn't work.

@tbg
Copy link
Member

tbg commented Sep 4, 2019

The idea is roughly how I also thought we would solve this, but I don't think we can throw in a replicaGC when applying the snapshot - we need to do it when we first receive a message with a higher replicaID. Otherwise, we're clobbering raft state that the new incarnation of the raft member might have made.

For example, assume the replica is re-added and there's an election going on. The first incoming request is an MsgVote from some candidate, under a replicaID that's newer than what the replica itself has. If we answer to the vote and then later run a replicaGC, we're erasing the vote - no good. Instead we need to run the replicaGC before feeding the MsgVote into the system.

I'm curious what problems you're seeing with the current code, the above issue looked too subtle to me to be noticed in our unit tests.

@ajwerner
Copy link
Contributor Author

ajwerner commented Sep 4, 2019

I'm curious what problems you're seeing with the current code, the above issue looked too subtle to me to be noticed in our unit tests.

After stressing it for longer it still hit the previous assertion with this code. I'm not clear on how this happened but I went to bed before digging too deeply.

out.1567577356.txt

we need to do it when we first receive a message with a higher replicaID. Otherwise, we're clobbering raft state that the new incarnation of the raft member might have made.

Makes sense. It seems like in processRequestQueue() we use withReplicaForRequest() which seems like it should return a replica too old error.

cockroach/pkg/storage/store.go

Lines 3940 to 3962 in 58d0fc3

var replTooOldErr error
if creatingReplica != nil {
// Drop messages that come from a node that we believe was once a member of
// the group but has been removed.
desc := repl.mu.state.Desc
_, found := desc.GetReplicaDescriptorByID(creatingReplica.ReplicaID)
// It's not a current member of the group. Is it from the past?
if !found && creatingReplica.ReplicaID < desc.NextReplicaID {
replTooOldErr = roachpb.NewReplicaTooOldError(creatingReplica.ReplicaID)
}
}
var err error
if replTooOldErr != nil {
err = replTooOldErr
} else if ds := repl.mu.destroyStatus; ds.reason == destroyReasonRemoved {
err = errRetry
} else {
err = repl.setReplicaIDRaftMuLockedMuLocked(replicaID)
}
if err != nil {
repl.raftMu.Unlock()
return nil, false, err

Am I reading that code incorrectly? Would it be reasonable to explicitly do the remove in processRequestQueue if we observe the ReplicaTooOldError? I suspect that would still need to be coupled with logic to detect the snapshot with a later replica id.

@tbg
Copy link
Member

tbg commented Sep 4, 2019

I think we need to do it pretty much exactly as I said, which is that whenever you see a raft message in which the replicaID is higher than that of the receiving replica, the receiving replica wipes its state before processing that message.

I think what you're suggesting is that when you receive a ReplicaToOldError from the raft transport, you consider yourself for GC -- we already do that, but note that at this point you might've been readded already under a new replicaID and made promises under said ID. You need to react to learning about the new ID by wiping the state right there (well, or dropping these messages and queueing for GC, hoping that the GC queue will do the right thing and realize that it can GC the replica even though it is a member, because the replicaID is lower).

Maybe try this "drop and gc queue" approach first as it's less invasive. It would look something like this

diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index 418c2cc531..49df70b75b 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -3585,6 +3585,23 @@ func (s *Store) processRequestQueue(ctx context.Context, rangeID roachpb.RangeID
 	var hadError bool
 	for i, info := range infos {
 		last := i == len(infos)-1
+		{
+			hackRepl, err := s.GetReplica(info.req.RangeID)
+			if err != nil {
+				// probably need to handle RangeNotFound by proceeding
+			}
+			// NB: need to do this whole block under a lock so the repl cant be created
+			// under a lower replicaID under us now, will be hard to do this correctly
+			hackRepl.mu.Lock()
+			replID := hackRepl.mu.replicaID
+			hackRepl.mu.Unlock()
+
+			if replID < info.req.ToReplica.ReplicaID {
+				// drop this message, queue for GC
+				s.replicaGCQueue.MaybeAddAsync(ctx, hackRepl, s.cfg.Clock.Now())
+				continue
+			}
+		}
 		pErr := s.withReplicaForRequest(ctx, info.req, func(ctx context.Context, r *Replica) *roachpb.Error {
 			// Save the last Replica we see, since we don't know in advance which
 			// requests will fail during Replica retrieval. We want this later

@tbg
Copy link
Member

tbg commented Sep 4, 2019

Now that I look at your post again, I think you weren't suggesting what I thought. Did you mix up the cases? What you're describing looks like the case in which a replica receives a message from a peer that is out of date (and you want to ignore that message and tell the peer it should consider a GC). In the situation we're describing, we are a peer that learns that it is out of date but does so in the same moment that it also learns that it's a peer again under a new ID.

@ajwerner
Copy link
Contributor Author

ajwerner commented Sep 4, 2019

What you're describing looks like the case in which a replica receives a message from a peer that is out of date (and you want to ignore that message and tell the peer it should consider a GC).

I did indeed mix up which replica we were looking at in tryGetOrCreateReplica. Nevertheless underneath withReplicaForRequest (as opposed to above it as you suggest) seems like a reasonable spot to detect the replica ID mismatch.

@tbg
Copy link
Member

tbg commented Sep 4, 2019 via email

tbg added a commit to tbg/cockroach that referenced this pull request Sep 6, 2019
Without the preceding commit, this fatals as seen in cockroachdb#39796:

> F190821 15:14:28.241623 312191 storage/store.go:2172
> [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
> found in right hand side of split

Note that cockroachdb#40459 will break these tests since it will not allow the
replicaID to change without an intermittent GC cycle, but we now rely on
catching up via the raft log across a removal, split, and then re-add.
At that point, the only way in which a split trigger could execute while
in non-member status is if a replica got removed but would still receive
a later split trigger. This could happen as follows:

The log is `[..., remove, split]`

a) the follower receives `split`
b) it learns that `split` is committed so that
c) it will apply both `remove` and then `split`.

a) seems possible; if the removed follower is trailing a bit, the leader
may append both entries at the same time before it (the leader) applies
the replication change that removes the follower from the leader's
config.
b) seems possible like above, at least once we do async command
application, if said append carries a commit index encompassing the
append itself.
c) will then naturally happen in today's code.

There's no way we can reasonably enact this scenario in our testing at
this point, though. The suggestion for cockroachdb#40459 is to essentially revert
this commit, so that we at least retain the coverage related to the
RHS of this test.

Release note (bug fix): Avoid crashes with message "replica descriptor
of local store not found".
tbg added a commit to tbg/cockroach that referenced this pull request Sep 6, 2019
Without the preceding commit, this fatals as seen in cockroachdb#39796:

> F190821 15:14:28.241623 312191 storage/store.go:2172
> [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
> found in right hand side of split

Note that cockroachdb#40459 will break these tests since it will not allow the
replicaID to change without an intermittent GC cycle, but we now rely on
catching up via the raft log across a removal, split, and then re-add.
At that point, the only way in which a split trigger could execute while
in non-member status is if a replica got removed but would still receive
a later split trigger. This could happen as follows:

The log is `[..., remove, split]`

a) the follower receives `split`
b) it learns that `split` is committed so that
c) it will apply both `remove` and then `split`.

a) seems possible; if the removed follower is trailing a bit, the leader
may append both entries at the same time before it (the leader) applies
the replication change that removes the follower from the leader's
config.
b) seems possible like above, at least once we do async command
application, if said append carries a commit index encompassing the
append itself.
c) will then naturally happen in today's code.

There's no way we can reasonably enact this scenario in our testing at
this point, though. The suggestion for cockroachdb#40459 is to essentially revert
this commit, so that we at least retain the coverage related to the
RHS of this test.

Release note (bug fix): Avoid crashes with message "replica descriptor
of local store not found".
@ajwerner
Copy link
Contributor Author

Superseded by #40751

@ajwerner ajwerner closed this Sep 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants