Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: learner snap can fail when interacting with raft snap #40207

Closed
tbg opened this issue Aug 26, 2019 · 6 comments · Fixed by #40435
Closed

storage: learner snap can fail when interacting with raft snap #40207

tbg opened this issue Aug 26, 2019 · 6 comments · Fixed by #40435
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Aug 26, 2019

When we add a learner, we currently send a snapshot that can race with the raft snapshot queue's. Typically we win (the snap queue has code that skips the snapshot if we're already "starting" to send one from the upreplication code) but if the queue wins, the result can be that the explicit snap gets an error due to an overlapping reservation (see the gist below).

This isn't triggered very frequently on master (at the time of writing) due to a buglet in Raft which was since fixed in etcd-io/etcd#11037. As a result, to bump Raft past that fix, we need to address this issue first. Since we expect to pick up a fix or two in Raft as joint consensus matures, we want to address this sooner rather than later. OTOH it's such a small commit that it's easy enough to switch to a fork if we need to do so temporarily.

https://gist.github.com/tbg/fd882caa6ed72e7af0cbd7c8fb0c3504

Unfortunately no obvious/trivial solution to the race presents itself. It's open whether we want to perhaps start relying on the raft snap queue to catch up the follower (i.e. poll the follower replica manually via a newly added RPC) or whether we want to prevent the snap queue from ever catching up learners (which we used to do for a short amount of time until some problem with that became apparent).

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 26, 2019
tbg added a commit to tbg/cockroach that referenced this issue Aug 26, 2019
This picks up improvements related to joint consensus. We're pinning a
SHA that is close to master but not quite there to avoid tickling to a
bug in CRDB:

cockroachdb#40207

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 26, 2019
This picks up improvements related to joint consensus. We're pinning a
SHA that is close to master but not quite there to avoid tickling to a
bug in CRDB:

cockroachdb#40207

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Aug 26, 2019
This picks up improvements related to joint consensus. We're pinning a
SHA that is close to master but not quite there to avoid tickling to a
bug in CRDB:

cockroachdb#40207

Release note: None
@danhhz
Copy link
Contributor

danhhz commented Aug 26, 2019

As I mentioned offline, I think it's time to take a step back here and reconsider where we're at. We started with replica addition sending snapshots, then allowed the raft snapshot queue to also send them 1) so an orphaned learner doesn't get stuck with nowhere to get a snapshot from and 2) to work around etcd/raft's behavior of being extremely picky about getting exactly the snapshot it asked for instead of just any snapshot that allows it to catch up via logs. Since then, Tobi has pointed out that we also need to ensure that an ill-timed log truncation doesn't get the learner stuck forever.

This behavior of learner snapshots coming from two places (potentially even sent from different machines) makes for a complicated story. This is evidenced by my process of de-flaking the learner tests when I merged the default=on PR, many of which were flaky because I kept forgetting about rare races between these two snapshot sources. This tells me we should try to simplify the story if we can.

The hack seems to be, when the learner snapshot fails start polling for it to get a snapshot from the raft snapshot queue, then move on to promoting it to a voter. This naturally leads to the idea that we could make this the normal path and always rely on the raft snapshot queue to send the learner snapshot, the replica addition code would always just poll for this (and likely try to speed it along). However, I don't really think either Tobi or I feel like that's a very clean solution. Going to continue letting this one mull for a bit.

@danhhz
Copy link
Contributor

danhhz commented Aug 28, 2019

@tbg lots of cleanup left but I have a WIP commit for the snap stuff: https://github.com/cockroachdb/cockroach/compare/master...danhhz:learner_snap?expand=1. i'm pretty sure (in unit tests) that the queue is usually finishing the snap before the replica addition even starts it, which results in sending 2x the snaps. i guess this won't happen as often in practice because the snaps will take longer and they'll intersect. thoughts?

i verified that this fixes the make stress invocation in your gist

@tbg
Copy link
Member Author

tbg commented Aug 29, 2019 via email

@danhhz
Copy link
Contributor

danhhz commented Aug 29, 2019

Realized that as written, that PR will transfer two snapshots when the raft snapshot queue wins the race. It creates an entry in snapshotLogTruncationConstraints but doesn't care that there's already one from the raft snap queue. I don't know that we want to skip attempting to send the snapshot from replica addition if getAndGCSnapshotLogTruncationConstraints returns something because the whatever is in flight could be another replica of the same range. I suppose we could plumb the replica id into snapshotLogTruncationConstraints. This is all very unsatisfying.

I tried a prototype of entirely replacing the sendSnapshot call in replica addition with a call to enqueue the replica in the raft snapshot queue and keeping the waitForApplication call. The first issue I had is that sometimes it didn't find a replica (so we should be good on the replicaGC issue you mentioned). I got around this by calling waitForReplicasInit before the waitForApplication call. Then I realized that it was still basically always sending the snapshot twice, because it got queued twice. Still working through that if you have any thoughts. I'm also worried about how long it takes for waitForApplication to give up if the snapshot can't be applied (for example if the node is down).

@danhhz
Copy link
Contributor

danhhz commented Aug 29, 2019

crazy idea: what if i threw the following before the txn that creates the learner? basically to act as a lock on the raft snapshot queue

snapUUID := uuid.MakeV4()
r.addSnapshotLogTruncationConstraintLocked(ctx, snapUUID, ???)
defer r.completeSnapshotLogTruncationConstraint(ctx, snapUUID, timeutil.Now())

@tbg
Copy link
Member Author

tbg commented Aug 30, 2019

crazy idea: what if i threw the following before the txn that creates the learner? basically to act as a lock on the raft snapshot queue

We'd need to have the lock "per replica and only if the target is a learner" because we don't want to run any risk of blocking snapshots to a node that actually needs one to make the range available.

because it got queued twice.

Hmm, interesting. I think this means it's sending a snapshot, but after that is done and before the raft status tracked for the follower reflects the snapshot, the queue loops around and looks at it again? Or is this the "raft doesn't like the index" problem again?
After you pick up the etcd/raft bump motivating this work I think explicitly queueing isn't as useful. Upon applying the conf change, the leader will broadcast an append to the new replica, which it will reject, which will add to the queue already (caveat: assuming the rate limiting doesn't drop the add). This takes one message roundtrip delay. Hmm but if it did get dropped, we'd wait for ~minutes to even start the snapshot...

I agree this is all generally unsatisfying, but I actually find keying the snapshot map by replicaID reasonable. It just doesn't have that yet because the log queue didn't care, and if we had it we could use your suggestion to populate it before adding the learner (we'd be better off using StoreID instead of ReplicaID because ReplicaID isn't known then). That way, the behavior is basically "raft snap queue doesn't send snapshots to learners" except we make it "learners that are being caught up already". Remind me what the problem with letting the raft snap queue unconditionally skip learners are? Either way, what you're suggesting seems like it would "mostly" achieve that, but without having to worry about making bigger semantic changes at this point in the cycle.

tbg added a commit to tbg/cockroach that referenced this issue Aug 30, 2019
We don't support removing multiple learners atomically just yet, though
\cockroachdb#40268 will fix this (likely in raft). That PR though is obstructed by cockroachdb#40207
because we'll need that first to be able to bump raft again (assuming we
don't want to fork). Instead of dealing with all that upfront, let's
just not remove multiple learners at once right now so that we can flip
the default for atomic replication changes to on.

If anyone is still trying to remove only learners atomically, they will
fail. However the replicate queues place a high priority on removing
stray learners whenever it finds them, so this wouldn't be a permanent
problem. Besides, we never add multiple learners at once so it's
difficult to get into that state in the first place.

Without this commit, TestLearnerAdminRelocateRange fails once atomic
replication changes are enabled.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Sep 3, 2019
We don't support removing multiple learners atomically just yet, though
\cockroachdb#40268 will fix this (likely in raft). That PR though is obstructed by cockroachdb#40207
because we'll need that first to be able to bump raft again (assuming we
don't want to fork). Instead of dealing with all that upfront, let's
just not remove multiple learners at once right now so that we can flip
the default for atomic replication changes to on.

If anyone is still trying to remove only learners atomically, they will
fail. However the replicate queues place a high priority on removing
stray learners whenever it finds them, so this wouldn't be a permanent
problem. Besides, we never add multiple learners at once so it's
difficult to get into that state in the first place.

Without this commit, TestLearnerAdminRelocateRange fails once atomic
replication changes are enabled.

Release note: None
danhhz added a commit to danhhz/cockroach that referenced this issue Sep 3, 2019
The replica addition code first adds it as a raft learner, then hands it
a snapshot, then promotes it to a voter. For various unfortunate reasons
described in the code, we have to allow the raft snapshot queue to
_also_ send snapshots to learners. A recent etcd change exposed that
this code has always been brittle to the raft snapshot queue winning the
race and starting the snapshot first by making the race dramatically
more likely.

After this commit, learner replica addition grabs a (best effort) lock
before the conf change txn to add the learner is started. This prevents
the race when the raft leader (and thus the raft snapshot queue for that
range) is on the same node.

Closes cockroachdb#40207

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Sep 4, 2019
We don't support removing multiple learners atomically just yet, though
\cockroachdb#40268 will fix this (likely in raft). That PR though is obstructed by cockroachdb#40207
because we'll need that first to be able to bump raft again (assuming we
don't want to fork). Instead of dealing with all that upfront, let's
just not remove multiple learners at once right now so that we can flip
the default for atomic replication changes to on.

If anyone is still trying to remove only learners atomically, they will
fail. However the replicate queues place a high priority on removing
stray learners whenever it finds them, so this wouldn't be a permanent
problem. Besides, we never add multiple learners at once so it's
difficult to get into that state in the first place.

Without this commit, TestLearnerAdminRelocateRange fails once atomic
replication changes are enabled.

Release note: None
craig bot pushed a commit that referenced this issue Sep 4, 2019
40435: storage: be more resilient to learner snap conflicts r=tbg a=danhhz

The replica addition code first adds it as a raft learner, then hands it
a snapshot, then promotes it to a voter. For various unfortunate reasons
described in the code, we have to allow the raft snapshot queue to
_also_ send snapshots to learners. A recent etcd change exposed that
this code has always been brittle to the raft snapshot queue winning the
race and starting the snapshot first by making the race dramatically
more likely.

After this commit, learner replica addition grabs a (best effort) lock
before the conf change txn to add the learner is started. This prevents
the race when the raft leader (and thus the raft snapshot queue for that
range) is on the same node.

Closes #40207

Release note: None

40454: kv: improve unit testing around txnSpanRefresher r=nvanbenschoten a=nvanbenschoten

This was decently tested through its inclusion in the TxnCoordSender,
but since I'm planning on changing some behavior to address #36431, I
figured we should first build up a test suite in the same style that
we have for other transaction interceptors.

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in 20ba7bc Sep 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants