Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: immediately push on WriteIntentError when lock-table disabled #46234

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Mar 18, 2020

Fixes #46148.

This commit fixes a bug where follower reads that hit intents could get stuck in an indefinite loop of running into the intent during evaluation, not adding the intent to the lock-table because the lock table was disabled, sequencing in the concurrency manager without issue, and repeating. The new TestClosedTimestampCanServeWithConflictingIntent test hits exactly this issue before this commit.

The fix implemented here is to immediately push the transaction responsible for an intent when serving a follower read (i.e. when a replica's lock-table is disabled). This ensures that the intent gets cleaned up if it was abandoned and avoids the busy loop we see today. If/when lockTables are maintained on follower replicas by propagating lockTable state transitions through the Raft log in the ReplicatedEvalResult instead of through the (leaseholder-only) LocalResult, we should be able to remove the lockTable "disabled" state and, in turn, remove this special-case.

The alternative approach floated to address this was to simply pass a NotLeaseHolderError back to the client when an intent is hit on a follower. This would have worked to avoid the infinite loop, but it seems like a short-term patch that doesn't get to the root of the issue. As we push further on follower reads (or even consistent read replicas), we want non-leaseholders to be able to perform conflict resolution. Falling back to the leaseholder works counter to this goal. The approach implemented by this commit works towards this goal, simply falling back to the previous sub-optimal approach of pushing immediately during conflicts.

Release note (bug fix): Follower reads that hit intents no longer have a chance of entering an infinite loop. This bug was present in earlier versions of the v20.1 release.

Release justification: fixes a high-priority bug where follower reads could get stuck indefinitely if they hit an abandoned intent.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member

tbg commented Mar 18, 2020

No release note? Will review later today.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kv: always push intents above pusher's max timestamp

I find something here really fragile (and the title got me a little scared, until I remembered that we had talked about this and that you know better anyway). If MaxTimestamp were the original MaxTimestamp, we'd routinely be pushing to timestamps in the future of anyone's hlc clock (=no bueno). This doesn't happen because MaxTimestamp is guaranteed to be below the current node's clock's wall time because we've lifted the adjustment above the push (but that's not a great invariant to trust). PushCmd refuses to push in the future, so at worst we'd see spurious errors here, but still this could all be more explicit.
What do you think about explicitly pushing to min(clock.Now(), MaxTimestamp) just to make this explicit (and write a comment there that MaxTimestamp will always win). (I would have suggested asserting, but this kind of assertion in a prod system is no good).

For the title, strong preference for "kv: always push intents above pusher's observed timestamp" which is really what we're after.

Also, the commit message is a bit muddled as far as the reasoning goes -- it's unclear what the lock table does before this commit. (Sounds sort of like it pushes to the actual MaxTimestamp, but that's not true). This is all pretty subtle so I would give that another pass to make sure it's crystal clear.

Reviewed 15 of 15 files at r1, 4 of 4 files at r2, 11 of 11 files at r3, 11 of 11 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/kvserver/closed_timestamp_test.go, line 179 at r4 (raw file):

// TestClosedTimestampCanServeWithConflictingIntent validates that a read served
// from a follower replica will wait on conflicting intents and ensure that they

nit: ensures


pkg/kv/kvserver/closed_timestamp_test.go, line 237 at r4 (raw file):

	// Abort the transaction. All pushes should succeed and all intents should
	// be resolved, allowing all reads (on the leaseholder and on followers) to

Is something guaranteeing that you are actually seeing follower reads? As far as I can tell, they might all go to the leaseholder?


pkg/kv/kvserver/replica.go, line 1033 at r1 (raw file):

	rSpan roachpb.RSpan, ts hlc.Timestamp,
) error {
	st := &storagepb.LeaseStatus{Timestamp: r.Clock().Now()}

This seems unfortunate. Is there a better option? Can we just get a lease status?


pkg/kv/kvserver/replica_follower_read.go, line 40 at r2 (raw file):

	ctx context.Context, ba *roachpb.BatchRequest, pErr *roachpb.Error,
) *roachpb.Error {
	canServeFollowerRead := false

What was the motivation for this diff? Just trying to understand if there's something I'm missing here because it looks like behavior is unchanged.


pkg/kv/kvserver/replica_read.go, line 47 at r1 (raw file):

		}
	} else {
		status.Timestamp = r.Clock().Now() // get a clock reading for checkExecutionCanProceed

Don't like this either, but I guess we previously also passed a mostly zero status in, so maybe it's fine


pkg/kv/kvserver/replica_send.go, line 200 at r2 (raw file):

			// range lease or permission to serve via follower reads.
			if status, pErr = r.redirectOnOrAcquireLease(ctx); pErr != nil {
				if nErr := r.canServeFollowerRead(ctx, ba, pErr); nErr != nil {

canServeFollowerRead assumes a read-only batch, is this check missing now?


pkg/kv/kvserver/replica_send.go, line 275 at r2 (raw file):

			}
			// Retry...
		case *roachpb.NotLeaseHolderError:

Just curious, under what circumstance would you expect to hit this path? I'm worried that we have no test coverage for this (do we) and that we'd be better off without this.


pkg/kv/kvserver/replica_write.go, line 82 at r1 (raw file):

			return nil, g, pErr
		}
	}

From this point on, status is (at least partially) set when it wasn't before. Not an issue, right?


pkg/kv/kvserver/replica_write.go, line 71 at r2 (raw file):

	// TODO(nvanbenschoten): unlike on the read-path (executeReadOnlyBatch), we
	// don't synchronize with r.readOnlyCmdMu here. Is that ok? What if the

I don't have a good answer here. If everything went through Raft, I think this code:

for _, p := range r.mu.proposals {
r.cleanupFailedProposalLocked(p)
// NB: each proposal needs its own version of the error (i.e. don't try to
// share the error across proposals).
p.finishApplication(ctx, proposalResult{
Err: roachpb.NewError(roachpb.NewAmbiguousResultError("removing replica")),
})
}

would do it (assuming we check the destroy status under the same lock as we put new proposals in the map), but the failed CPut would skip the read. I think there's probably at least one bug here.

It's also striking how destroying the replica avoids the latch manager. You would assume that destroying the replica is something like

  1. set destroy status
  2. cancel all open proposals
  3. grab an "everything r/w" latch
  4. delete things

but this isn't how it works.

I look forward to when we unify the rw and ro paths, at which point we'll be forced to confront these head on.


pkg/kv/kvserver/replica_write.go, line 124 at r2 (raw file):

	// manager.
	if curLease, _ := r.GetLease(); curLease.Sequence > st.Lease.Sequence {
		curLeaseCpy := curLease // avoid letting curLease escape

Is this just muscle memory or do you routinely check these?


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 391 at r3 (raw file):

// with locks acquired by other transaction. The request must wait for all locks
// acquired by other transactions at or below this timestamp to be released. All
// locks acquired by other transactions above this timestampare ignored.

timestampare


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 396 at r3 (raw file):

	if r.Txn != nil {
		ts = r.Txn.ReadTimestamp
		ts.Forward(r.Txn.MaxTimestamp)

this is the line that irks me, a method called readTimestamp shouldn't return fake future timestamps


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 278 at r4 (raw file):

	// way.
	if wait {
		for i := range t.Intents {

No concerns here about the sequential intents handling, right? You could add a comment as to why it's not needed (or wasn't added in the first pass).


pkg/kv/kvserver/concurrency/lock_table.go, line 1674 at r3 (raw file):

			g.txn = &req.Txn.TxnMeta
			g.readTS = req.Txn.ReadTimestamp
			g.readTS.Forward(req.Txn.MaxTimestamp)

Hmm I guess none of this is new. Do you share my concerns here at all? Both of us won't mess this up, but younger generations will and right now I couldn't blame them. Perhaps we should explicitly start talking about a "read interval" which is [readTS,maxTS] (maybe right-open, don't remember)


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 432 at r3 (raw file):

		// after our operation started. This allows us to not have to
		// restart for uncertainty as we come back and read.
		obsTS, ok := h.Txn.GetObservedTimestamp(w.nodeID)

For my edification, is there functionally a difference between this commit vs its parent? Before, we would populate observed timestamps before entering concurrency control (b/c of your previous commit), and so this code would "work". After, we use the already adjusted MaxTimestamp directly. It's the same, and this is just simplifying things? Or am I missing something?


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 310 at r4 (raw file):

		return roachpb.NewError(err)
	}
	return w.pushLockTxn(ctx, req, waitingState{

@sumeerbhola should sign off at this in particular. It looks.. fine to me? But I couldn't tell if there was a problem here.

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review Tobi. Also, apologies for assigning a review to you while you're off. We had talked about this topic on Slack so I figured you had it all paged in, but this ended up being a larger review than I was intending to ask you for.

What do you think about explicitly pushing to min(clock.Now(), MaxTimestamp) just to make this explicit (and write a comment there that MaxTimestamp will always win). (I would have suggested asserting, but this kind of assertion in a prod system is no good).

I considered that, but doing so breaks the guarantee we need from pushLockTxn that if it returns successfully, it will have triggered a lockTable state transition that allows the request to proceed. I'm more inclined to include an assertion that the txn's MaxTimestamp is below clock.Now() by the time it enters the concurrency manager.

For the title, strong preference for "kv: always push intents above pusher's observed timestamp" which is really what we're after.

Done.

Also, the commit message is a bit muddled as far as the reasoning goes -- it's unclear what the lock table does before this commit. (Sounds sort of like it pushes to the actual MaxTimestamp, but that's not true). This is all pretty subtle so I would give that another pass to make sure it's crystal clear.

Done. Everything you said is valid, and I think you're picking up on the fact that this was written before the second commit was introduced. I tried to adjust the wording to make sense in the updated context, but clearly didn't do a very good job.

No release note?

I was hoping this would land before the next beta so that it wouldn't need a release note because the bug it fixes wouldn't have been in the previous release, but that doesn't look to be the case. Done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/closed_timestamp_test.go, line 179 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: ensures

This was referring to the read ensuring that the intents are cleaned up, not the test.


pkg/kv/kvserver/closed_timestamp_test.go, line 237 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is something guaranteeing that you are actually seeing follower reads? As far as I can tell, they might all go to the leaseholder?

We send each request to each replica directly (see repl.Send), so there's no intermediate routing, but I guess there's nothing ensuring that the lease doesn't move around to allow for all N reads to be performed on a leaseholder. We don't seem to test against that kind of thing anywhere in this file. Is that the hazard you're referring to, or did you miss the repls[i] part?


pkg/kv/kvserver/replica.go, line 1033 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This seems unfortunate. Is there a better option? Can we just get a lease status?

Done.


pkg/kv/kvserver/replica_follower_read.go, line 40 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

What was the motivation for this diff? Just trying to understand if there's something I'm missing here because it looks like behavior is unchanged.

Yes, the behavior is unchanged and surprisingly tested by TestLearnerAndJointConfigFollowerRead. I didn't want to hit the "replicas cannot serve follower reads" event for writes.


pkg/kv/kvserver/replica_read.go, line 47 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Don't like this either, but I guess we previously also passed a mostly zero status in, so maybe it's fine

I'm a little surprised. I figured you'd like that we were no longer passing around duplicate state - the status of a lease at a particular timestamp and the timestamp that the status was captured at. In the case where we don't require a lease, we set the timestamp of the status to indicate that the "lease is not set, but the current timestamp is ...". I don't feel too strongly about this though. I can revert this commit if you'd prefer, we'd just be passing a larger payload across batchExecutionFn.


pkg/kv/kvserver/replica_send.go, line 200 at r2 (raw file):

canServeFollowerRead assumes a read-only batch

Good catch. Ever since 278a21b#diff-66f78d299d9a44a2e5f43197b799669eR57, this hasn't been true. But the comment wasn't updated to reflect this. Done.


pkg/kv/kvserver/replica_send.go, line 275 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Just curious, under what circumstance would you expect to hit this path? I'm worried that we have no test coverage for this (do we) and that we'd be better off without this.

Yeah, this was added to avoid NotLeaseHolderErrors when a request is delayed indefinitely in the concurrency manager while waiting for locks and emerges to find that the replicas lease has changed. But I guess this would only help in cases where the lease was transferred away and back again, which seems unlikely. Since it's not tested and pretty difficult to test, I'll remove it.


pkg/kv/kvserver/replica_write.go, line 82 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

From this point on, status is (at least partially) set when it wasn't before. Not an issue, right?

Let's discuss above. I was thinking that this was a simplification, but maybe not.


pkg/kv/kvserver/replica_write.go, line 71 at r2 (raw file):

It's also striking how destroying the replica avoids the latch manager.

Yeah, I was wondering about this too. It would be very nice to flush out all requests using the latch manager, but I fear we might have some dependencies where a request can't make progress in raft until a replica is destroyed but the replica can't be destroyed (if it grabs latches) until the request releases its latches. Maybe that's just FUD.

I'm going to leave this TODO here for now and try to dig in more later.


pkg/kv/kvserver/replica_write.go, line 124 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this just muscle memory or do you routinely check these?

I didn't check in this case, but it looked like the kind of thing that would cause the object to escape on the happy path, which we don't want. I just confirmed that it would have: replica_write.go:124:3: moved to heap: curLeaseCpy.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 391 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

timestampare

Done.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 396 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

this is the line that irks me, a method called readTimestamp shouldn't return fake future timestamps

Let's discuss below.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 278 at r4 (raw file):

sequential intents handling

I don't know that I understand what you mean. Sequential as opposed to handling (pushing + resolving) them all in parallel? We don't do anything like that in the lockTable itself, so I don't think it's worth optimizing out here for followers.


pkg/kv/kvserver/concurrency/lock_table.go, line 1674 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm I guess none of this is new. Do you share my concerns here at all? Both of us won't mess this up, but younger generations will and right now I couldn't blame them. Perhaps we should explicitly start talking about a "read interval" which is [readTS,maxTS] (maybe right-open, don't remember)

Yes, I absolutely understand your concern. We're not really talking about an interval though, right? We're talking about the maximum timestamp at which non-locking reads conflict with exclusive locks. In this case, that timestamp happens to be max(readTS,maxTS). "readTimestamp" is a fairly lazy way to talk about this and introduces room for confusion, but I'm also not sure of anything better. Maybe I need to dig through some literature on MVCC to get some inspiration.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 432 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

For my edification, is there functionally a difference between this commit vs its parent? Before, we would populate observed timestamps before entering concurrency control (b/c of your previous commit), and so this code would "work". After, we use the already adjusted MaxTimestamp directly. It's the same, and this is just simplifying things? Or am I missing something?

You're not missing anything. I made this commit in isolation at first to avoid the bug described in the commit message. Then I realized I needed the previous commit to avoid the "request timestamp less than PushTo timestamp" error. With that previous change, this commit no longer contains a behavior change like you point out, but it does make me a lot more confident that if this push/resolve returns successfully, we'll trigger a state transition in the lockTable that allows our request to proceed. Without that guarantee, we risk deadlock.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more inclined to include an assertion that the txn's MaxTimestamp is below clock.Now() by the time it enters the concurrency manager.

SGTM

I considered that, but doing so breaks the guarantee we need from pushLockTxn that if it returns successfully, it will have triggered a lockTable state transition that allows the request to proceed.

Hmm.. maybe I'm misunderstanding, but isn't it true that the min() will always result in the MaxTimestamp (since it's already using a timestamp observed from the node)? Were you assuming I wanted you to back out that part?

Also, apologies for assigning a review to you while you're off.

No worries. I'll push back if needed.

No need to wait for turn-around from me until you have something that you're legit unsure about. Thanks for the quick fix!

Reviewed 20 of 20 files at r5, 4 of 4 files at r6, 11 of 11 files at r7, 11 of 11 files at r8.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/kvserver/replica_read.go, line 47 at r1 (raw file):

I figured you'd like that we were no longer passing around duplicate state

I do like that! I'm just generally weary of partially populated structs as this tends to leave them in a weird, sometimes bug-inviting, state. I think this is fine here though, sorry about the noise.


pkg/kv/kvserver/replica_write.go, line 71 at r2 (raw file):

but I fear we might have some dependencies where a request can't make progress in raft until a replica is destroyed but the replica can't be destroyed (if it grabs latches) until the request releases its latches.

I think that's a legitimate concern, but we just have to handle it. Roughly, step 1 above sets a destroy status that also prevents handleRaftReady from carrying out any more cycles. step 2 cancels all proposals including the raft-owned ones (which now makes sense since we know the state machine won't transition any more). This starts looking a lot like the current code, but by grabbing the latches in addition before deleting we serialize cleanly with inflight reads, so we address exactly the open questions we have here.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 391 at r7 (raw file):
Maybe uncertainReadTimestamp() is a decent enough compromise? You could add this to the comment

This timestamp is the upper boundary of the uncertainty interval of this request. In particular, it is a timestamp that may not have been generated by a clock yet, and which must not be handed to any clock (and thus, must not enter the timestamp cache).

This value was already reflected by LeaseStatus.Timestamp, so the return
value was redundant.
This commit pulls the replica leaseholder check on the read and write
evaluation paths above concurrency manager sequencing. This change makes
semantic sense because the contents of the concurrency manager and its
lockTable are a reflection of the state of the range, so they should
only be exposed to requests if the request is evaluating on a
leaseholder replica or has otherwise. Up to this point, this hasn't been
a real issue because the lockTable was purely a best-effort representation
of the state of a range.

There are a few strong reasons to make this change:
- it makes it easier to fix cockroachdb#46148, a current release blocker
- it allows the MaxTimestamp of transactions to be bound before sequencing
  in the lockTable. This has the effect of ensuring that all transaction
  pushes are to a timestamp less than or equal to the current node's HLC
  clock.
- it seems like the order we would want things in if/when we begin to
  address consistent read replicas (cockroachdb#39758), where the concurrency manager
  sequencing step becomes a distributed operation that depends on the state
  of the current lease.

One concern here is that the concurrency manager sequencing can take an
arbitrarily long amount of time because it blocks on conflicting locks.
It would be unfortunate if this lead to an increased number of rejected
proposals due to lease sequence mismatches. To address this, the commit
performs a best-effort check before evaluating and proposing a write
that the lease it will be proposing under is still active. This check is
similar to the best-effort context cancellation check we perform before
proposing.
@danhhz
Copy link
Contributor

danhhz commented Mar 19, 2020

Subscribe

Before the lockTable refactor, we used to push the timestamp of
conflicting intents to the observed timestamp from the current
leaseholder node (f78c17a). We made an attempt to retain this behavior
with the new lockTable.

Pushing the intent outside of the pusher's uncertainty window is
necessary to ensure that the pusher doesn't continue to conflict with
the intent after the push had completed. Pushing to just an observed
timestamp instead of the pusher's max timestamp is an optimization to
avoid pushing the intent "too far" and being more disruptive to the
pushee than is strictly necessary.

However, using the observed timestamp while pushing seems fragile now
that we rely on the push triggering a state transition in the lockTable
that will allow the current request to proceed. Until the previous
commit, the lockTable was consulted before we limited a transaction's
max timestamp due to observed timestamps, so this "optimization" could
have theoretically (I never actually saw this in practice) prevented a
push from clearing an intent out of its way. For instance, if a txn
(read ts: 5, observed ts: 10, max ts: 15) pushed an intent to just ts
10.Next(), that wouldn't be enough for the txn to proceed after the
push.

Now that we bound the max timestamp to the observed timestamp before
entering the concurrency manager, this stall should no longer be
possible, but it still seems fragile and ripe for error. This commit
fixes this by always pushing intents past the timestamp that the lock
table is using to determine conflicts. This makes this all less fragile.
Fixes cockroachdb#46148.

This commit fixes a bug where follower reads that hit intents could get
stuck in an indefinite loop of running into the intent during evaluation,
not adding the intent to the lock-table because the lock table was
disabled, sequencing in the concurrency manager without issue, and
repeating. The new TestClosedTimestampCanServeWithConflictingIntent test
hits exactly this issue before this commit.

The fix implemented here is to immediately push the transaction
responsible for an intent when serving a follower read (i.e. when a
replica's lock-table is disabled). This ensures that the intent gets
cleaned up if it was abandoned and avoids the busy loop we see today.
If/when lockTables are maintained on follower replicas by propagating
lockTable state transitions through the Raft log in the
ReplicatedEvalResult instead of through the (leaseholder-only)
LocalResult, we should be able to remove the lockTable "disabled" state
and, in turn, remove this special-case.

The alternative approach floated to address this was to simply pass a
NotLeaseHolderError back to the client when an intent is hit on a
follower. This would have worked to avoid the infinite loop, but it
seems like a short-term patch that doesn't get to the root of the issue.
As we push further on follower reads (or even consistent read replicas),
we want non-leaseholders to be able to perform conflict resolution.
Falling back to the leaseholder works counter to this goal. The approach
implemented by this commit works towards this goal, simply falling back
to the previous sub-optimal approach of pushing immediately during
conflicts.

Release note (bug fix): Follower reads that hit intents no longer have
a chance of entering an infinite loop. This bug was present in earlier
versions of the v20.1 release.

Release justification: fixes a high-priority bug where follower reads
could get stuck indefinitely if they hit an abandoned intent.
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

I'm more inclined to include an assertion that the txn's MaxTimestamp is below clock.Now() by the time it enters the concurrency manager.

This didn't go very well. It turns out that most unit tests that use a testContext violate this assertion because they don't manually bump the clock before issuing requests. Tests that actually perform pushes already manipulate their clock such that things behave correctly. As you pointed out, nothing will actually go wrong here in practice because "PushCmd refuses to push in the future, so at worst we'd see spurious errors here", but I do still wish this was more explicit.

For now, though, I don't want this to hold up this PR, given that this is blocking a beta and this aspect of the change is nothing new.

Hmm.. maybe I'm misunderstanding, but isn't it true that the min() will always result in the MaxTimestamp (since it's already using a timestamp observed from the node)? Were you assuming I wanted you to back out that part?

No, I think you're understanding correctly. Min should always result in MaxTimestamp if MaxTimestamp is bound to the txn's observed timestamp. But if that was ever violated then I'd rather have things blow up louder than a very rare txn deadlock. Even what we have now, the error returned from PushTxn request, seems better than that.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/replica_read.go, line 47 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I figured you'd like that we were no longer passing around duplicate state

I do like that! I'm just generally weary of partially populated structs as this tends to leave them in a weird, sometimes bug-inviting, state. I think this is fine here though, sorry about the noise.

Ack. Let's see how this ages. We can always go back on this decision.


pkg/kv/kvserver/replica_write.go, line 71 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

but I fear we might have some dependencies where a request can't make progress in raft until a replica is destroyed but the replica can't be destroyed (if it grabs latches) until the request releases its latches.

I think that's a legitimate concern, but we just have to handle it. Roughly, step 1 above sets a destroy status that also prevents handleRaftReady from carrying out any more cycles. step 2 cancels all proposals including the raft-owned ones (which now makes sense since we know the state machine won't transition any more). This starts looking a lot like the current code, but by grabbing the latches in addition before deleting we serialize cleanly with inflight reads, so we address exactly the open questions we have here.

I filed #46329 to continue tracking this.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 391 at r7 (raw file):
I changed to readConflictTimestamp and writeConflictTimestamp to indicate that these are about the timestamps that the request conflicts with other locks at, not necessarily about the timestamps that the request operates at. I think that addresses the root of the concern.

You could add this to the comment

That doesn't really fit anymore, especially now that we do expect this time to be below the local clock reading.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 310 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

@sumeerbhola should sign off at this in particular. It looks.. fine to me? But I couldn't tell if there was a problem here.

I'm going to merge this without @sumeerbhola because it looks like this is holding up the next beta, but I'd be happy to revisit anything in here once Sumeer gets a chance to take a look.

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 19, 2020

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: follower read that hits intent may get stuck in endless loop
4 participants