-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: learner snap can fail when interacting with raft snap #40207
Comments
This picks up improvements related to joint consensus. We're pinning a SHA that is close to master but not quite there to avoid tickling to a bug in CRDB: cockroachdb#40207 Release note: None
This picks up improvements related to joint consensus. We're pinning a SHA that is close to master but not quite there to avoid tickling to a bug in CRDB: cockroachdb#40207 Release note: None
This picks up improvements related to joint consensus. We're pinning a SHA that is close to master but not quite there to avoid tickling to a bug in CRDB: cockroachdb#40207 Release note: None
As I mentioned offline, I think it's time to take a step back here and reconsider where we're at. We started with replica addition sending snapshots, then allowed the raft snapshot queue to also send them 1) so an orphaned learner doesn't get stuck with nowhere to get a snapshot from and 2) to work around etcd/raft's behavior of being extremely picky about getting exactly the snapshot it asked for instead of just any snapshot that allows it to catch up via logs. Since then, Tobi has pointed out that we also need to ensure that an ill-timed log truncation doesn't get the learner stuck forever. This behavior of learner snapshots coming from two places (potentially even sent from different machines) makes for a complicated story. This is evidenced by my process of de-flaking the learner tests when I merged the default=on PR, many of which were flaky because I kept forgetting about rare races between these two snapshot sources. This tells me we should try to simplify the story if we can. The hack seems to be, when the learner snapshot fails start polling for it to get a snapshot from the raft snapshot queue, then move on to promoting it to a voter. This naturally leads to the idea that we could make this the normal path and always rely on the raft snapshot queue to send the learner snapshot, the replica addition code would always just poll for this (and likely try to speed it along). However, I don't really think either Tobi or I feel like that's a very clean solution. Going to continue letting this one mull for a bit. |
@tbg lots of cleanup left but I have a WIP commit for the snap stuff: https://github.com/cockroachdb/cockroach/compare/master...danhhz:learner_snap?expand=1. i'm pretty sure (in unit tests) that the queue is usually finishing the snap before the replica addition even starts it, which results in sending 2x the snaps. i guess this won't happen as often in practice because the snaps will take longer and they'll intersect. thoughts? i verified that this fixes the make stress invocation in your gist |
Nice. The approach looks good. WaitForApplication needs to wait longer
(indefinite really, though its context limits it to 1min at the time of
writing, make sure the context cancellation is taken into account by
WaitForApplication), snapshots can take a long time especially if we end up
upping the range size. We also need to double check that WaitForApplication
will fail properly if the polled replica is replicaGC'ed to make sure we're
not wedging ourselves if someone removes the learner out from under us.
Pretty sure it will return to us in that case, just wanted to make sure you
have this on your radar.
|
Realized that as written, that PR will transfer two snapshots when the raft snapshot queue wins the race. It creates an entry in I tried a prototype of entirely replacing the sendSnapshot call in replica addition with a call to enqueue the replica in the raft snapshot queue and keeping the waitForApplication call. The first issue I had is that sometimes it didn't find a replica (so we should be good on the replicaGC issue you mentioned). I got around this by calling |
crazy idea: what if i threw the following before the txn that creates the learner? basically to act as a lock on the raft snapshot queue
|
We'd need to have the lock "per replica and only if the target is a learner" because we don't want to run any risk of blocking snapshots to a node that actually needs one to make the range available.
Hmm, interesting. I think this means it's sending a snapshot, but after that is done and before the raft status tracked for the follower reflects the snapshot, the queue loops around and looks at it again? Or is this the "raft doesn't like the index" problem again? I agree this is all generally unsatisfying, but I actually find keying the snapshot map by replicaID reasonable. It just doesn't have that yet because the log queue didn't care, and if we had it we could use your suggestion to populate it before adding the learner (we'd be better off using StoreID instead of ReplicaID because ReplicaID isn't known then). That way, the behavior is basically "raft snap queue doesn't send snapshots to learners" except we make it "learners that are being caught up already". Remind me what the problem with letting the raft snap queue unconditionally skip learners are? Either way, what you're suggesting seems like it would "mostly" achieve that, but without having to worry about making bigger semantic changes at this point in the cycle. |
We don't support removing multiple learners atomically just yet, though \cockroachdb#40268 will fix this (likely in raft). That PR though is obstructed by cockroachdb#40207 because we'll need that first to be able to bump raft again (assuming we don't want to fork). Instead of dealing with all that upfront, let's just not remove multiple learners at once right now so that we can flip the default for atomic replication changes to on. If anyone is still trying to remove only learners atomically, they will fail. However the replicate queues place a high priority on removing stray learners whenever it finds them, so this wouldn't be a permanent problem. Besides, we never add multiple learners at once so it's difficult to get into that state in the first place. Without this commit, TestLearnerAdminRelocateRange fails once atomic replication changes are enabled. Release note: None
We don't support removing multiple learners atomically just yet, though \cockroachdb#40268 will fix this (likely in raft). That PR though is obstructed by cockroachdb#40207 because we'll need that first to be able to bump raft again (assuming we don't want to fork). Instead of dealing with all that upfront, let's just not remove multiple learners at once right now so that we can flip the default for atomic replication changes to on. If anyone is still trying to remove only learners atomically, they will fail. However the replicate queues place a high priority on removing stray learners whenever it finds them, so this wouldn't be a permanent problem. Besides, we never add multiple learners at once so it's difficult to get into that state in the first place. Without this commit, TestLearnerAdminRelocateRange fails once atomic replication changes are enabled. Release note: None
The replica addition code first adds it as a raft learner, then hands it a snapshot, then promotes it to a voter. For various unfortunate reasons described in the code, we have to allow the raft snapshot queue to _also_ send snapshots to learners. A recent etcd change exposed that this code has always been brittle to the raft snapshot queue winning the race and starting the snapshot first by making the race dramatically more likely. After this commit, learner replica addition grabs a (best effort) lock before the conf change txn to add the learner is started. This prevents the race when the raft leader (and thus the raft snapshot queue for that range) is on the same node. Closes cockroachdb#40207 Release note: None
We don't support removing multiple learners atomically just yet, though \cockroachdb#40268 will fix this (likely in raft). That PR though is obstructed by cockroachdb#40207 because we'll need that first to be able to bump raft again (assuming we don't want to fork). Instead of dealing with all that upfront, let's just not remove multiple learners at once right now so that we can flip the default for atomic replication changes to on. If anyone is still trying to remove only learners atomically, they will fail. However the replicate queues place a high priority on removing stray learners whenever it finds them, so this wouldn't be a permanent problem. Besides, we never add multiple learners at once so it's difficult to get into that state in the first place. Without this commit, TestLearnerAdminRelocateRange fails once atomic replication changes are enabled. Release note: None
40435: storage: be more resilient to learner snap conflicts r=tbg a=danhhz The replica addition code first adds it as a raft learner, then hands it a snapshot, then promotes it to a voter. For various unfortunate reasons described in the code, we have to allow the raft snapshot queue to _also_ send snapshots to learners. A recent etcd change exposed that this code has always been brittle to the raft snapshot queue winning the race and starting the snapshot first by making the race dramatically more likely. After this commit, learner replica addition grabs a (best effort) lock before the conf change txn to add the learner is started. This prevents the race when the raft leader (and thus the raft snapshot queue for that range) is on the same node. Closes #40207 Release note: None 40454: kv: improve unit testing around txnSpanRefresher r=nvanbenschoten a=nvanbenschoten This was decently tested through its inclusion in the TxnCoordSender, but since I'm planning on changing some behavior to address #36431, I figured we should first build up a test suite in the same style that we have for other transaction interceptors. Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
When we add a learner, we currently send a snapshot that can race with the raft snapshot queue's. Typically we win (the snap queue has code that skips the snapshot if we're already "starting" to send one from the upreplication code) but if the queue wins, the result can be that the explicit snap gets an error due to an overlapping reservation (see the gist below).
This isn't triggered very frequently on master (at the time of writing) due to a buglet in Raft which was since fixed in etcd-io/etcd#11037. As a result, to bump Raft past that fix, we need to address this issue first. Since we expect to pick up a fix or two in Raft as joint consensus matures, we want to address this sooner rather than later. OTOH it's such a small commit that it's easy enough to switch to a fork if we need to do so temporarily.
https://gist.github.com/tbg/fd882caa6ed72e7af0cbd7c8fb0c3504
Unfortunately no obvious/trivial solution to the race presents itself. It's open whether we want to perhaps start relying on the raft snap queue to catch up the follower (i.e. poll the follower replica manually via a newly added RPC) or whether we want to prevent the snap queue from ever catching up learners (which we used to do for a short amount of time until some problem with that became apparent).
The text was updated successfully, but these errors were encountered: