-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: immediately push on WriteIntentError when lock-table disabled #46234
kv: immediately push on WriteIntentError when lock-table disabled #46234
Conversation
e2990c5
to
c3bdf45
Compare
No release note? Will review later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kv: always push intents above pusher's max timestamp
I find something here really fragile (and the title got me a little scared, until I remembered that we had talked about this and that you know better anyway). If MaxTimestamp were the original MaxTimestamp, we'd routinely be pushing to timestamps in the future of anyone's hlc clock (=no bueno). This doesn't happen because MaxTimestamp is guaranteed to be below the current node's clock's wall time because we've lifted the adjustment above the push (but that's not a great invariant to trust). PushCmd refuses to push in the future, so at worst we'd see spurious errors here, but still this could all be more explicit.
What do you think about explicitly pushing to min(clock.Now(), MaxTimestamp)
just to make this explicit (and write a comment there that MaxTimestamp will always win). (I would have suggested asserting, but this kind of assertion in a prod system is no good).
For the title, strong preference for "kv: always push intents above pusher's observed timestamp" which is really what we're after.
Also, the commit message is a bit muddled as far as the reasoning goes -- it's unclear what the lock table does before this commit. (Sounds sort of like it pushes to the actual MaxTimestamp, but that's not true). This is all pretty subtle so I would give that another pass to make sure it's crystal clear.
Reviewed 15 of 15 files at r1, 4 of 4 files at r2, 11 of 11 files at r3, 11 of 11 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/kv/kvserver/closed_timestamp_test.go, line 179 at r4 (raw file):
// TestClosedTimestampCanServeWithConflictingIntent validates that a read served // from a follower replica will wait on conflicting intents and ensure that they
nit: ensures
pkg/kv/kvserver/closed_timestamp_test.go, line 237 at r4 (raw file):
// Abort the transaction. All pushes should succeed and all intents should // be resolved, allowing all reads (on the leaseholder and on followers) to
Is something guaranteeing that you are actually seeing follower reads? As far as I can tell, they might all go to the leaseholder?
pkg/kv/kvserver/replica.go, line 1033 at r1 (raw file):
rSpan roachpb.RSpan, ts hlc.Timestamp, ) error { st := &storagepb.LeaseStatus{Timestamp: r.Clock().Now()}
This seems unfortunate. Is there a better option? Can we just get a lease status?
pkg/kv/kvserver/replica_follower_read.go, line 40 at r2 (raw file):
ctx context.Context, ba *roachpb.BatchRequest, pErr *roachpb.Error, ) *roachpb.Error { canServeFollowerRead := false
What was the motivation for this diff? Just trying to understand if there's something I'm missing here because it looks like behavior is unchanged.
pkg/kv/kvserver/replica_read.go, line 47 at r1 (raw file):
} } else { status.Timestamp = r.Clock().Now() // get a clock reading for checkExecutionCanProceed
Don't like this either, but I guess we previously also passed a mostly zero status in, so maybe it's fine
pkg/kv/kvserver/replica_send.go, line 200 at r2 (raw file):
// range lease or permission to serve via follower reads. if status, pErr = r.redirectOnOrAcquireLease(ctx); pErr != nil { if nErr := r.canServeFollowerRead(ctx, ba, pErr); nErr != nil {
canServeFollowerRead assumes a read-only batch, is this check missing now?
pkg/kv/kvserver/replica_send.go, line 275 at r2 (raw file):
} // Retry... case *roachpb.NotLeaseHolderError:
Just curious, under what circumstance would you expect to hit this path? I'm worried that we have no test coverage for this (do we) and that we'd be better off without this.
pkg/kv/kvserver/replica_write.go, line 82 at r1 (raw file):
return nil, g, pErr } }
From this point on, status is (at least partially) set when it wasn't before. Not an issue, right?
pkg/kv/kvserver/replica_write.go, line 71 at r2 (raw file):
// TODO(nvanbenschoten): unlike on the read-path (executeReadOnlyBatch), we // don't synchronize with r.readOnlyCmdMu here. Is that ok? What if the
I don't have a good answer here. If everything went through Raft, I think this code:
cockroach/pkg/kv/kvserver/replica_destroy.go
Lines 248 to 255 in 6c7b80e
for _, p := range r.mu.proposals { | |
r.cleanupFailedProposalLocked(p) | |
// NB: each proposal needs its own version of the error (i.e. don't try to | |
// share the error across proposals). | |
p.finishApplication(ctx, proposalResult{ | |
Err: roachpb.NewError(roachpb.NewAmbiguousResultError("removing replica")), | |
}) | |
} |
would do it (assuming we check the destroy status under the same lock as we put new proposals in the map), but the failed CPut would skip the read. I think there's probably at least one bug here.
It's also striking how destroying the replica avoids the latch manager. You would assume that destroying the replica is something like
- set destroy status
- cancel all open proposals
- grab an "everything r/w" latch
- delete things
but this isn't how it works.
I look forward to when we unify the rw and ro paths, at which point we'll be forced to confront these head on.
pkg/kv/kvserver/replica_write.go, line 124 at r2 (raw file):
// manager. if curLease, _ := r.GetLease(); curLease.Sequence > st.Lease.Sequence { curLeaseCpy := curLease // avoid letting curLease escape
Is this just muscle memory or do you routinely check these?
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 391 at r3 (raw file):
// with locks acquired by other transaction. The request must wait for all locks // acquired by other transactions at or below this timestamp to be released. All // locks acquired by other transactions above this timestampare ignored.
timestampare
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 396 at r3 (raw file):
if r.Txn != nil { ts = r.Txn.ReadTimestamp ts.Forward(r.Txn.MaxTimestamp)
this is the line that irks me, a method called readTimestamp
shouldn't return fake future timestamps
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 278 at r4 (raw file):
// way. if wait { for i := range t.Intents {
No concerns here about the sequential intents handling, right? You could add a comment as to why it's not needed (or wasn't added in the first pass).
pkg/kv/kvserver/concurrency/lock_table.go, line 1674 at r3 (raw file):
g.txn = &req.Txn.TxnMeta g.readTS = req.Txn.ReadTimestamp g.readTS.Forward(req.Txn.MaxTimestamp)
Hmm I guess none of this is new. Do you share my concerns here at all? Both of us won't mess this up, but younger generations will and right now I couldn't blame them. Perhaps we should explicitly start talking about a "read interval" which is [readTS,maxTS] (maybe right-open, don't remember)
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 432 at r3 (raw file):
// after our operation started. This allows us to not have to // restart for uncertainty as we come back and read. obsTS, ok := h.Txn.GetObservedTimestamp(w.nodeID)
For my edification, is there functionally a difference between this commit vs its parent? Before, we would populate observed timestamps before entering concurrency control (b/c of your previous commit), and so this code would "work". After, we use the already adjusted MaxTimestamp directly. It's the same, and this is just simplifying things? Or am I missing something?
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 310 at r4 (raw file):
return roachpb.NewError(err) } return w.pushLockTxn(ctx, req, waitingState{
@sumeerbhola should sign off at this in particular. It looks.. fine to me? But I couldn't tell if there was a problem here.
c3bdf45
to
c57d6d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review Tobi. Also, apologies for assigning a review to you while you're off. We had talked about this topic on Slack so I figured you had it all paged in, but this ended up being a larger review than I was intending to ask you for.
What do you think about explicitly pushing to min(clock.Now(), MaxTimestamp) just to make this explicit (and write a comment there that MaxTimestamp will always win). (I would have suggested asserting, but this kind of assertion in a prod system is no good).
I considered that, but doing so breaks the guarantee we need from pushLockTxn
that if it returns successfully, it will have triggered a lockTable state transition that allows the request to proceed. I'm more inclined to include an assertion that the txn's MaxTimestamp is below clock.Now()
by the time it enters the concurrency manager.
For the title, strong preference for "kv: always push intents above pusher's observed timestamp" which is really what we're after.
Done.
Also, the commit message is a bit muddled as far as the reasoning goes -- it's unclear what the lock table does before this commit. (Sounds sort of like it pushes to the actual MaxTimestamp, but that's not true). This is all pretty subtle so I would give that another pass to make sure it's crystal clear.
Done. Everything you said is valid, and I think you're picking up on the fact that this was written before the second commit was introduced. I tried to adjust the wording to make sense in the updated context, but clearly didn't do a very good job.
No release note?
I was hoping this would land before the next beta so that it wouldn't need a release note because the bug it fixes wouldn't have been in the previous release, but that doesn't look to be the case. Done.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/kv/kvserver/closed_timestamp_test.go, line 179 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: ensures
This was referring to the read ensuring that the intents are cleaned up, not the test.
pkg/kv/kvserver/closed_timestamp_test.go, line 237 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is something guaranteeing that you are actually seeing follower reads? As far as I can tell, they might all go to the leaseholder?
We send each request to each replica directly (see repl.Send
), so there's no intermediate routing, but I guess there's nothing ensuring that the lease doesn't move around to allow for all N reads to be performed on a leaseholder. We don't seem to test against that kind of thing anywhere in this file. Is that the hazard you're referring to, or did you miss the repls[i]
part?
pkg/kv/kvserver/replica.go, line 1033 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This seems unfortunate. Is there a better option? Can we just get a lease status?
Done.
pkg/kv/kvserver/replica_follower_read.go, line 40 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
What was the motivation for this diff? Just trying to understand if there's something I'm missing here because it looks like behavior is unchanged.
Yes, the behavior is unchanged and surprisingly tested by TestLearnerAndJointConfigFollowerRead
. I didn't want to hit the "replicas cannot serve follower reads" event for writes.
pkg/kv/kvserver/replica_read.go, line 47 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Don't like this either, but I guess we previously also passed a mostly zero status in, so maybe it's fine
I'm a little surprised. I figured you'd like that we were no longer passing around duplicate state - the status of a lease at a particular timestamp and the timestamp that the status was captured at. In the case where we don't require a lease, we set the timestamp of the status to indicate that the "lease is not set, but the current timestamp is ...". I don't feel too strongly about this though. I can revert this commit if you'd prefer, we'd just be passing a larger payload across batchExecutionFn
.
pkg/kv/kvserver/replica_send.go, line 200 at r2 (raw file):
canServeFollowerRead assumes a read-only batch
Good catch. Ever since 278a21b#diff-66f78d299d9a44a2e5f43197b799669eR57, this hasn't been true. But the comment wasn't updated to reflect this. Done.
pkg/kv/kvserver/replica_send.go, line 275 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Just curious, under what circumstance would you expect to hit this path? I'm worried that we have no test coverage for this (do we) and that we'd be better off without this.
Yeah, this was added to avoid NotLeaseHolderErrors when a request is delayed indefinitely in the concurrency manager while waiting for locks and emerges to find that the replicas lease has changed. But I guess this would only help in cases where the lease was transferred away and back again, which seems unlikely. Since it's not tested and pretty difficult to test, I'll remove it.
pkg/kv/kvserver/replica_write.go, line 82 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
From this point on, status is (at least partially) set when it wasn't before. Not an issue, right?
Let's discuss above. I was thinking that this was a simplification, but maybe not.
pkg/kv/kvserver/replica_write.go, line 71 at r2 (raw file):
It's also striking how destroying the replica avoids the latch manager.
Yeah, I was wondering about this too. It would be very nice to flush out all requests using the latch manager, but I fear we might have some dependencies where a request can't make progress in raft until a replica is destroyed but the replica can't be destroyed (if it grabs latches) until the request releases its latches. Maybe that's just FUD.
I'm going to leave this TODO here for now and try to dig in more later.
pkg/kv/kvserver/replica_write.go, line 124 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is this just muscle memory or do you routinely check these?
I didn't check in this case, but it looked like the kind of thing that would cause the object to escape on the happy path, which we don't want. I just confirmed that it would have: replica_write.go:124:3: moved to heap: curLeaseCpy
.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 391 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
timestampare
Done.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 396 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
this is the line that irks me, a method called
readTimestamp
shouldn't return fake future timestamps
Let's discuss below.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 278 at r4 (raw file):
sequential intents handling
I don't know that I understand what you mean. Sequential as opposed to handling (pushing + resolving) them all in parallel? We don't do anything like that in the lockTable itself, so I don't think it's worth optimizing out here for followers.
pkg/kv/kvserver/concurrency/lock_table.go, line 1674 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Hmm I guess none of this is new. Do you share my concerns here at all? Both of us won't mess this up, but younger generations will and right now I couldn't blame them. Perhaps we should explicitly start talking about a "read interval" which is [readTS,maxTS] (maybe right-open, don't remember)
Yes, I absolutely understand your concern. We're not really talking about an interval though, right? We're talking about the maximum timestamp at which non-locking reads conflict with exclusive locks. In this case, that timestamp happens to be max(readTS,maxTS)
. "readTimestamp" is a fairly lazy way to talk about this and introduces room for confusion, but I'm also not sure of anything better. Maybe I need to dig through some literature on MVCC to get some inspiration.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 432 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
For my edification, is there functionally a difference between this commit vs its parent? Before, we would populate observed timestamps before entering concurrency control (b/c of your previous commit), and so this code would "work". After, we use the already adjusted MaxTimestamp directly. It's the same, and this is just simplifying things? Or am I missing something?
You're not missing anything. I made this commit in isolation at first to avoid the bug described in the commit message. Then I realized I needed the previous commit to avoid the "request timestamp less than PushTo timestamp" error. With that previous change, this commit no longer contains a behavior change like you point out, but it does make me a lot more confident that if this push/resolve returns successfully, we'll trigger a state transition in the lockTable that allows our request to proceed. Without that guarantee, we risk deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more inclined to include an assertion that the txn's MaxTimestamp is below clock.Now() by the time it enters the concurrency manager.
SGTM
I considered that, but doing so breaks the guarantee we need from pushLockTxn that if it returns successfully, it will have triggered a lockTable state transition that allows the request to proceed.
Hmm.. maybe I'm misunderstanding, but isn't it true that the min() will always result in the MaxTimestamp (since it's already using a timestamp observed from the node)? Were you assuming I wanted you to back out that part?
Also, apologies for assigning a review to you while you're off.
No worries. I'll push back if needed.
No need to wait for turn-around from me until you have something that you're legit unsure about. Thanks for the quick fix!
Reviewed 20 of 20 files at r5, 4 of 4 files at r6, 11 of 11 files at r7, 11 of 11 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/kv/kvserver/replica_read.go, line 47 at r1 (raw file):
I figured you'd like that we were no longer passing around duplicate state
I do like that! I'm just generally weary of partially populated structs as this tends to leave them in a weird, sometimes bug-inviting, state. I think this is fine here though, sorry about the noise.
pkg/kv/kvserver/replica_write.go, line 71 at r2 (raw file):
but I fear we might have some dependencies where a request can't make progress in raft until a replica is destroyed but the replica can't be destroyed (if it grabs latches) until the request releases its latches.
I think that's a legitimate concern, but we just have to handle it. Roughly, step 1 above sets a destroy status that also prevents handleRaftReady from carrying out any more cycles. step 2 cancels all proposals including the raft-owned ones (which now makes sense since we know the state machine won't transition any more). This starts looking a lot like the current code, but by grabbing the latches in addition before deleting we serialize cleanly with inflight reads, so we address exactly the open questions we have here.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 391 at r7 (raw file):
Maybe uncertainReadTimestamp()
is a decent enough compromise? You could add this to the comment
This timestamp is the upper boundary of the uncertainty interval of this request. In particular, it is a timestamp that may not have been generated by a clock yet, and which must not be handed to any clock (and thus, must not enter the timestamp cache).
This value was already reflected by LeaseStatus.Timestamp, so the return value was redundant.
This commit pulls the replica leaseholder check on the read and write evaluation paths above concurrency manager sequencing. This change makes semantic sense because the contents of the concurrency manager and its lockTable are a reflection of the state of the range, so they should only be exposed to requests if the request is evaluating on a leaseholder replica or has otherwise. Up to this point, this hasn't been a real issue because the lockTable was purely a best-effort representation of the state of a range. There are a few strong reasons to make this change: - it makes it easier to fix cockroachdb#46148, a current release blocker - it allows the MaxTimestamp of transactions to be bound before sequencing in the lockTable. This has the effect of ensuring that all transaction pushes are to a timestamp less than or equal to the current node's HLC clock. - it seems like the order we would want things in if/when we begin to address consistent read replicas (cockroachdb#39758), where the concurrency manager sequencing step becomes a distributed operation that depends on the state of the current lease. One concern here is that the concurrency manager sequencing can take an arbitrarily long amount of time because it blocks on conflicting locks. It would be unfortunate if this lead to an increased number of rejected proposals due to lease sequence mismatches. To address this, the commit performs a best-effort check before evaluating and proposing a write that the lease it will be proposing under is still active. This check is similar to the best-effort context cancellation check we perform before proposing.
Subscribe |
Before the lockTable refactor, we used to push the timestamp of conflicting intents to the observed timestamp from the current leaseholder node (f78c17a). We made an attempt to retain this behavior with the new lockTable. Pushing the intent outside of the pusher's uncertainty window is necessary to ensure that the pusher doesn't continue to conflict with the intent after the push had completed. Pushing to just an observed timestamp instead of the pusher's max timestamp is an optimization to avoid pushing the intent "too far" and being more disruptive to the pushee than is strictly necessary. However, using the observed timestamp while pushing seems fragile now that we rely on the push triggering a state transition in the lockTable that will allow the current request to proceed. Until the previous commit, the lockTable was consulted before we limited a transaction's max timestamp due to observed timestamps, so this "optimization" could have theoretically (I never actually saw this in practice) prevented a push from clearing an intent out of its way. For instance, if a txn (read ts: 5, observed ts: 10, max ts: 15) pushed an intent to just ts 10.Next(), that wouldn't be enough for the txn to proceed after the push. Now that we bound the max timestamp to the observed timestamp before entering the concurrency manager, this stall should no longer be possible, but it still seems fragile and ripe for error. This commit fixes this by always pushing intents past the timestamp that the lock table is using to determine conflicts. This makes this all less fragile.
Fixes cockroachdb#46148. This commit fixes a bug where follower reads that hit intents could get stuck in an indefinite loop of running into the intent during evaluation, not adding the intent to the lock-table because the lock table was disabled, sequencing in the concurrency manager without issue, and repeating. The new TestClosedTimestampCanServeWithConflictingIntent test hits exactly this issue before this commit. The fix implemented here is to immediately push the transaction responsible for an intent when serving a follower read (i.e. when a replica's lock-table is disabled). This ensures that the intent gets cleaned up if it was abandoned and avoids the busy loop we see today. If/when lockTables are maintained on follower replicas by propagating lockTable state transitions through the Raft log in the ReplicatedEvalResult instead of through the (leaseholder-only) LocalResult, we should be able to remove the lockTable "disabled" state and, in turn, remove this special-case. The alternative approach floated to address this was to simply pass a NotLeaseHolderError back to the client when an intent is hit on a follower. This would have worked to avoid the infinite loop, but it seems like a short-term patch that doesn't get to the root of the issue. As we push further on follower reads (or even consistent read replicas), we want non-leaseholders to be able to perform conflict resolution. Falling back to the leaseholder works counter to this goal. The approach implemented by this commit works towards this goal, simply falling back to the previous sub-optimal approach of pushing immediately during conflicts. Release note (bug fix): Follower reads that hit intents no longer have a chance of entering an infinite loop. This bug was present in earlier versions of the v20.1 release. Release justification: fixes a high-priority bug where follower reads could get stuck indefinitely if they hit an abandoned intent.
c57d6d0
to
4b258e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
I'm more inclined to include an assertion that the txn's MaxTimestamp is below clock.Now() by the time it enters the concurrency manager.
This didn't go very well. It turns out that most unit tests that use a testContext
violate this assertion because they don't manually bump the clock before issuing requests. Tests that actually perform pushes already manipulate their clock such that things behave correctly. As you pointed out, nothing will actually go wrong here in practice because "PushCmd refuses to push in the future, so at worst we'd see spurious errors here", but I do still wish this was more explicit.
For now, though, I don't want this to hold up this PR, given that this is blocking a beta and this aspect of the change is nothing new.
Hmm.. maybe I'm misunderstanding, but isn't it true that the min() will always result in the MaxTimestamp (since it's already using a timestamp observed from the node)? Were you assuming I wanted you to back out that part?
No, I think you're understanding correctly. Min should always result in MaxTimestamp if MaxTimestamp is bound to the txn's observed timestamp. But if that was ever violated then I'd rather have things blow up louder than a very rare txn deadlock. Even what we have now, the error returned from PushTxn request, seems better than that.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/kv/kvserver/replica_read.go, line 47 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I figured you'd like that we were no longer passing around duplicate state
I do like that! I'm just generally weary of partially populated structs as this tends to leave them in a weird, sometimes bug-inviting, state. I think this is fine here though, sorry about the noise.
Ack. Let's see how this ages. We can always go back on this decision.
pkg/kv/kvserver/replica_write.go, line 71 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
but I fear we might have some dependencies where a request can't make progress in raft until a replica is destroyed but the replica can't be destroyed (if it grabs latches) until the request releases its latches.
I think that's a legitimate concern, but we just have to handle it. Roughly, step 1 above sets a destroy status that also prevents handleRaftReady from carrying out any more cycles. step 2 cancels all proposals including the raft-owned ones (which now makes sense since we know the state machine won't transition any more). This starts looking a lot like the current code, but by grabbing the latches in addition before deleting we serialize cleanly with inflight reads, so we address exactly the open questions we have here.
I filed #46329 to continue tracking this.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 391 at r7 (raw file):
I changed to readConflictTimestamp
and writeConflictTimestamp
to indicate that these are about the timestamps that the request conflicts with other locks at, not necessarily about the timestamps that the request operates at. I think that addresses the root of the concern.
You could add this to the comment
That doesn't really fit anymore, especially now that we do expect this time to be below the local clock reading.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 310 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
@sumeerbhola should sign off at this in particular. It looks.. fine to me? But I couldn't tell if there was a problem here.
I'm going to merge this without @sumeerbhola because it looks like this is holding up the next beta, but I'd be happy to revisit anything in here once Sumeer gets a chance to take a look.
bors r+ |
Build succeeded |
Fixes #46148.
This commit fixes a bug where follower reads that hit intents could get stuck in an indefinite loop of running into the intent during evaluation, not adding the intent to the lock-table because the lock table was disabled, sequencing in the concurrency manager without issue, and repeating. The new TestClosedTimestampCanServeWithConflictingIntent test hits exactly this issue before this commit.
The fix implemented here is to immediately push the transaction responsible for an intent when serving a follower read (i.e. when a replica's lock-table is disabled). This ensures that the intent gets cleaned up if it was abandoned and avoids the busy loop we see today. If/when lockTables are maintained on follower replicas by propagating lockTable state transitions through the Raft log in the ReplicatedEvalResult instead of through the (leaseholder-only) LocalResult, we should be able to remove the lockTable "disabled" state and, in turn, remove this special-case.
The alternative approach floated to address this was to simply pass a NotLeaseHolderError back to the client when an intent is hit on a follower. This would have worked to avoid the infinite loop, but it seems like a short-term patch that doesn't get to the root of the issue. As we push further on follower reads (or even consistent read replicas), we want non-leaseholders to be able to perform conflict resolution. Falling back to the leaseholder works counter to this goal. The approach implemented by this commit works towards this goal, simply falling back to the previous sub-optimal approach of pushing immediately during conflicts.
Release note (bug fix): Follower reads that hit intents no longer have a chance of entering an infinite loop. This bug was present in earlier versions of the v20.1 release.
Release justification: fixes a high-priority bug where follower reads could get stuck indefinitely if they hit an abandoned intent.