storage: learner snap can fail when interacting with raft snap #40207

tbg · 2019-08-26T11:48:41Z

When we add a learner, we currently send a snapshot that can race with the raft snapshot queue's. Typically we win (the snap queue has code that skips the snapshot if we're already "starting" to send one from the upreplication code) but if the queue wins, the result can be that the explicit snap gets an error due to an overlapping reservation (see the gist below).

This isn't triggered very frequently on master (at the time of writing) due to a buglet in Raft which was since fixed in etcd-io/etcd#11037. As a result, to bump Raft past that fix, we need to address this issue first. Since we expect to pick up a fix or two in Raft as joint consensus matures, we want to address this sooner rather than later. OTOH it's such a small commit that it's easy enough to switch to a fork if we need to do so temporarily.

https://gist.github.com/tbg/fd882caa6ed72e7af0cbd7c8fb0c3504

Unfortunately no obvious/trivial solution to the race presents itself. It's open whether we want to perhaps start relying on the raft snap queue to catch up the follower (i.e. poll the follower replica manually via a newly added RPC) or whether we want to prevent the snap queue from ever catching up learners (which we used to do for a short amount of time until some problem with that became apparent).

This picks up improvements related to joint consensus. We're pinning a SHA that is close to master but not quite there to avoid tickling to a bug in CRDB: cockroachdb#40207 Release note: None

danhhz · 2019-08-26T22:42:08Z

As I mentioned offline, I think it's time to take a step back here and reconsider where we're at. We started with replica addition sending snapshots, then allowed the raft snapshot queue to also send them 1) so an orphaned learner doesn't get stuck with nowhere to get a snapshot from and 2) to work around etcd/raft's behavior of being extremely picky about getting exactly the snapshot it asked for instead of just any snapshot that allows it to catch up via logs. Since then, Tobi has pointed out that we also need to ensure that an ill-timed log truncation doesn't get the learner stuck forever.

This behavior of learner snapshots coming from two places (potentially even sent from different machines) makes for a complicated story. This is evidenced by my process of de-flaking the learner tests when I merged the default=on PR, many of which were flaky because I kept forgetting about rare races between these two snapshot sources. This tells me we should try to simplify the story if we can.

The hack seems to be, when the learner snapshot fails start polling for it to get a snapshot from the raft snapshot queue, then move on to promoting it to a voter. This naturally leads to the idea that we could make this the normal path and always rely on the raft snapshot queue to send the learner snapshot, the replica addition code would always just poll for this (and likely try to speed it along). However, I don't really think either Tobi or I feel like that's a very clean solution. Going to continue letting this one mull for a bit.

danhhz · 2019-08-28T23:50:11Z

@tbg lots of cleanup left but I have a WIP commit for the snap stuff: https://github.com/cockroachdb/cockroach/compare/master...danhhz:learner_snap?expand=1. i'm pretty sure (in unit tests) that the queue is usually finishing the snap before the replica addition even starts it, which results in sending 2x the snaps. i guess this won't happen as often in practice because the snaps will take longer and they'll intersect. thoughts?

i verified that this fixes the make stress invocation in your gist

tbg · 2019-08-29T09:05:01Z

Nice. The approach looks good. WaitForApplication needs to wait longer (indefinite really, though its context limits it to 1min at the time of writing, make sure the context cancellation is taken into account by WaitForApplication), snapshots can take a long time especially if we end up upping the range size. We also need to double check that WaitForApplication will fail properly if the polled replica is replicaGC'ed to make sure we're not wedging ourselves if someone removes the learner out from under us. Pretty sure it will return to us in that case, just wanted to make sure you have this on your radar.

danhhz · 2019-08-29T22:04:11Z

Realized that as written, that PR will transfer two snapshots when the raft snapshot queue wins the race. It creates an entry in snapshotLogTruncationConstraints but doesn't care that there's already one from the raft snap queue. I don't know that we want to skip attempting to send the snapshot from replica addition if getAndGCSnapshotLogTruncationConstraints returns something because the whatever is in flight could be another replica of the same range. I suppose we could plumb the replica id into snapshotLogTruncationConstraints. This is all very unsatisfying.

I tried a prototype of entirely replacing the sendSnapshot call in replica addition with a call to enqueue the replica in the raft snapshot queue and keeping the waitForApplication call. The first issue I had is that sometimes it didn't find a replica (so we should be good on the replicaGC issue you mentioned). I got around this by calling waitForReplicasInit before the waitForApplication call. Then I realized that it was still basically always sending the snapshot twice, because it got queued twice. Still working through that if you have any thoughts. I'm also worried about how long it takes for waitForApplication to give up if the snapshot can't be applied (for example if the node is down).

danhhz · 2019-08-29T22:08:06Z

crazy idea: what if i threw the following before the txn that creates the learner? basically to act as a lock on the raft snapshot queue

snapUUID := uuid.MakeV4()
r.addSnapshotLogTruncationConstraintLocked(ctx, snapUUID, ???)
defer r.completeSnapshotLogTruncationConstraint(ctx, snapUUID, timeutil.Now())

tbg · 2019-08-30T07:47:05Z

crazy idea: what if i threw the following before the txn that creates the learner? basically to act as a lock on the raft snapshot queue

We'd need to have the lock "per replica and only if the target is a learner" because we don't want to run any risk of blocking snapshots to a node that actually needs one to make the range available.

because it got queued twice.

Hmm, interesting. I think this means it's sending a snapshot, but after that is done and before the raft status tracked for the follower reflects the snapshot, the queue loops around and looks at it again? Or is this the "raft doesn't like the index" problem again?
After you pick up the etcd/raft bump motivating this work I think explicitly queueing isn't as useful. Upon applying the conf change, the leader will broadcast an append to the new replica, which it will reject, which will add to the queue already (caveat: assuming the rate limiting doesn't drop the add). This takes one message roundtrip delay. Hmm but if it did get dropped, we'd wait for ~minutes to even start the snapshot...

I agree this is all generally unsatisfying, but I actually find keying the snapshot map by replicaID reasonable. It just doesn't have that yet because the log queue didn't care, and if we had it we could use your suggestion to populate it before adding the learner (we'd be better off using StoreID instead of ReplicaID because ReplicaID isn't known then). That way, the behavior is basically "raft snap queue doesn't send snapshots to learners" except we make it "learners that are being caught up already". Remind me what the problem with letting the raft snap queue unconditionally skip learners are? Either way, what you're suggesting seems like it would "mostly" achieve that, but without having to worry about making bigger semantic changes at this point in the cycle.

We don't support removing multiple learners atomically just yet, though \cockroachdb#40268 will fix this (likely in raft). That PR though is obstructed by cockroachdb#40207 because we'll need that first to be able to bump raft again (assuming we don't want to fork). Instead of dealing with all that upfront, let's just not remove multiple learners at once right now so that we can flip the default for atomic replication changes to on. If anyone is still trying to remove only learners atomically, they will fail. However the replicate queues place a high priority on removing stray learners whenever it finds them, so this wouldn't be a permanent problem. Besides, we never add multiple learners at once so it's difficult to get into that state in the first place. Without this commit, TestLearnerAdminRelocateRange fails once atomic replication changes are enabled. Release note: None

The replica addition code first adds it as a raft learner, then hands it a snapshot, then promotes it to a voter. For various unfortunate reasons described in the code, we have to allow the raft snapshot queue to _also_ send snapshots to learners. A recent etcd change exposed that this code has always been brittle to the raft snapshot queue winning the race and starting the snapshot first by making the race dramatically more likely. After this commit, learner replica addition grabs a (best effort) lock before the conf change txn to add the learner is started. This prevents the race when the raft leader (and thus the raft snapshot queue for that range) is on the same node. Closes cockroachdb#40207 Release note: None

We don't support removing multiple learners atomically just yet, though \cockroachdb#40268 will fix this (likely in raft). That PR though is obstructed by cockroachdb#40207 because we'll need that first to be able to bump raft again (assuming we don't want to fork). Instead of dealing with all that upfront, let's just not remove multiple learners at once right now so that we can flip the default for atomic replication changes to on. If anyone is still trying to remove only learners atomically, they will fail. However the replicate queues place a high priority on removing stray learners whenever it finds them, so this wouldn't be a permanent problem. Besides, we never add multiple learners at once so it's difficult to get into that state in the first place. Without this commit, TestLearnerAdminRelocateRange fails once atomic replication changes are enabled. Release note: None

40435: storage: be more resilient to learner snap conflicts r=tbg a=danhhz The replica addition code first adds it as a raft learner, then hands it a snapshot, then promotes it to a voter. For various unfortunate reasons described in the code, we have to allow the raft snapshot queue to _also_ send snapshots to learners. A recent etcd change exposed that this code has always been brittle to the raft snapshot queue winning the race and starting the snapshot first by making the race dramatically more likely. After this commit, learner replica addition grabs a (best effort) lock before the conf change txn to add the learner is started. This prevents the race when the raft leader (and thus the raft snapshot queue for that range) is on the same node. Closes #40207 Release note: None 40454: kv: improve unit testing around txnSpanRefresher r=nvanbenschoten a=nvanbenschoten This was decently tested through its inclusion in the TxnCoordSender, but since I'm planning on changing some behavior to address #36431, I figured we should first build up a test suite in the same style that we have for other transaction interceptors. Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 26, 2019

tbg assigned danhhz Aug 26, 2019

tbg mentioned this issue Aug 27, 2019

secret core release blockers tbg/cockroach#163

Closed

14 tasks

danhhz mentioned this issue Sep 3, 2019

storage: be more resilient to learner snap conflicts #40435

Merged

andy-kimball mentioned this issue Sep 3, 2019

19.2 release blockers list #40447

Closed

53 tasks

craig bot closed this as completed in 20ba7bc Sep 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: learner snap can fail when interacting with raft snap #40207

storage: learner snap can fail when interacting with raft snap #40207

tbg commented Aug 26, 2019

danhhz commented Aug 26, 2019

danhhz commented Aug 28, 2019

tbg commented Aug 29, 2019 via email

danhhz commented Aug 29, 2019

danhhz commented Aug 29, 2019

tbg commented Aug 30, 2019

storage: learner snap can fail when interacting with raft snap #40207

storage: learner snap can fail when interacting with raft snap #40207

Comments

tbg commented Aug 26, 2019

danhhz commented Aug 26, 2019

danhhz commented Aug 28, 2019

tbg commented Aug 29, 2019 via email

danhhz commented Aug 29, 2019

danhhz commented Aug 29, 2019

tbg commented Aug 30, 2019