storage: allow HeartbeatTxn and EndTxn requests to create txn records #33396

nvanbenschoten · 2018-12-28T19:38:00Z

Informs #25437.
Informs #32971.

This is the first part of addressing #32971. Part two will update concurrent txns to not immediately abort missing txn records and part three will update the txn client to stop sending BeginTxn records.

The change closely resembles what was laid out in the corresponding RFC sections. HeartbeatTxn and EndTxn are both updated to permit the creation of transaction records when they finds that one is missing.

To prevent replays from causing issues, they both check the write timestamp cache when creating a transaction record. This is one area where the change diverges from the RFC. Instead of having EndTxn always check the write timestamp cache, it only does so if it is creating a txn record from scratch. To make this safe, HeartbeatTxn also checks the write timestamp cache if it is creating a txn record from scratch. This prevents a long-running transaction from increasingly running the risk of being aborted by a lease transfer as its EndTxn continues to be delayed. Instead, a lease transfer will only abort a transaction if it comes before the transaction record creation, which should be within a second of the transaction starting.

The change pulls out a new CanCreateTxnRecord method, which has the potential of being useful for detecting eagerly GCed transaction records and solving the major problem from the RFC without an extra QueryIntent. This is what Andrei was pushing for before. More details are included in TODOs within the PR.

Migration Safety

Most of the changes to the transaction state machine are straightforward to validate. Transaction records can still never be created without checking for replays and only an EndTxn from the client's sync path can GC an ABORTED txn record. This means that it is still impossible for a transaction to move between the two finalized statuses (at some point it would be worth it to draw this state machine).

The one tricky part is ensuring that the changes in this PR are safe when run in mixed-version clusters. The two areas of concern are:

lease transfers between a new and an old cluster on a range that
should/does contain a transaction record.
an old txn client that is not expecting HeartbeatTxn and EndTxn requests to create txn records if they are found to be missing.
an old txn client may not expect a HeartbeatTxn to hit the write timestamp cache if run concurrently with an EndTxn.

The first concern is protected by the write timestamp cache. Regardless of which replica creates that transaction record from a BeginTxn req or a HeartbeatTxn req (on the new replica), it will first need to check the timestamp cache. This prevents against any kind of replay that could create a transaction record that the old replica is not prepared to handle.

We can break the second concern into two parts. First, an old txn client will never send an EndTxn without having sent a successful BeginTxn, so the new behavior there is not an issue. Second, an old txn client may send a HeartbeatTxn concurrently with a BeginTxn. If the heartbeat wins, it will create a txn record on a new replica. If the BeginTxn evaluates on a new replica, it will be a no-op. If it evaluates on an old replica, it will result in a retryable error.

The third concern is specific to the implementation of the heartbeat loop itself. If a HeartbeatTxn loses a race with an EndTxn, it may get an aborted error after checking the timestamp cache. This is desirable from the replica-side, but we'd need to ensure that the client will handle it correctly. This PR adds some protection (see sendingEndTxn) to the txn client to prevent this case from causing weirdness on the client, but I don't think this could actually cause issues even without this protection.
The reason is that the txn client might mark itself as aborted due to the heartbeat, but this will be overwritten when the EndTxn returns and the sql client will still hear back from the successful EndTxn. This must have actually always been an issue because it was always possible for a committed txn record to be GCed and then written as aborted later, at which point a concurrent heartbeat could have observed it.

I'd like Andrei to sign off on this last hazard, as he's thought about this kind of thing more than anybody.

cockroach-teamcity · 2018-12-28T19:38:10Z

This change is

tbg

Looks good! Thanks for breaking this into manageable chunks. As far as I'm concerned adding additional commits (as opposed to massaging the existing ones with feedback) is fine.

I think that for doc purposes, an overview over when the write timestamp cache is supposed to be populated would be useful. I just went back to the RFC and didn't find it there, neither. I think what happens right now on master is that EndTransaction updates the write timestamp cache and BeginTransaction checks it. (I think) You're going to make it such that "whoever creates the transaction record for the client has to check it (which excludes pushes)".

Reviewed 2 of 2 files at r1, 12 of 12 files at r2, 5 of 5 files at r3, 2 of 2 files at r4, 14 of 14 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/kv/txn_interceptor_heartbeat.go, line 267 at r5 (raw file):

	if haveEndTxn {
		h.mu.sendingEndTxn = false

Isn't this still racy? The EndTxn might return (and we're setting status to COMMITTED), and then an inflight heartbeat returns and we flip to ABORTED?
The reason for resetting here in the first place is that there could've been a transaction retry error, right?

pkg/kv/txn_interceptor_heartbeat.go, line 497 at r5 (raw file):

// The asyncAbortCallbackLocked callback is also called.
func (h *txnHeartbeat) abortTxnAsyncLocked(ctx context.Context) {
	// Stop the heartbeat loop if it is still running.

What's up with this change?

pkg/roachpb/data.proto, line 380 at r2 (raw file):

  // See comments on Transaction proto.
	storage.engine.enginepb.TxnMeta meta = 1  [(gogoproto.nullable) = false, (gogoproto.embed) = true];
	TransactionStatus status             = 4;

I think we use spaces instead of tabs in proto files. Or so it seems from the diff.

pkg/roachpb/data_test.go, line 566 at r2 (raw file):

	// strips out fields but is lossless for the desired fields.
	txn := nonZeroTxn
	txnRecord := txn.AsRecord()

assert.NoError(t, zerofields.NoZeroField(&txnRecord) to keep test from rotting

pkg/settings/cluster/cockroach_versions.go, line 63 at r3 (raw file):

	VersionImportFormats
	VersionSecondaryLookupJoins
	VersionClientSideWritingFlag // unused

Always nice to see the final payoff of such a migration.

pkg/storage/replica_test.go, line 3683 at r5 (raw file):

	for i, test := range testCases {
		// Establish existing txn state by writing directly to range engine.
		existTxn := txn.Clone()

Make this .AsRecord().

pkg/storage/replica_test.go, line 3799 at r5 (raw file):

// TestRaftReplayProtectionInTxn verifies that transactional batches
// enjoy protection from "Raft retries".
func TestRaftRetryProtectionInTxn(t *testing.T) {

Hmm, I think the explanation here has rotted. We're protected from "Raft retries" thanks to the lease applied index. I think what we're testing here are RPC-level retries (i.e. DistSender delivers a batch twice).

pkg/storage/replica_test.go, line 3801 at r5 (raw file):

func TestRaftRetryProtectionInTxn(t *testing.T) {
	defer leaktest.AfterTest(t)()
	defer setTxnAutoGC(true)()

This line can go, it was introduced when the default was false.

pkg/storage/replica_test.go, line 3831 at r5 (raw file):

	// transaction, which fails the 1PC path and forces the txn to execute
	// normally at which point the WriteTooOld gets indirectly turned into a
	// TransactionAbortedError.

Isn't something fishy here? If this is indeed a real-world RPC retry and the first RPC is the one that gets lost (i.e. the client receives the response from the second one), it'll receive TxnAbortedError even though the first one might've committed the Txn. Is it that DistSender would turn that TxnAbortedError into an ambiguous one?

pkg/storage/replica_test.go, line 3892 at r5 (raw file):

}

// TestRaftRetryCantCommitIntents tests that transactional Raft retries cannot

Ditto about what we're testing.

pkg/storage/replica_test.go, line 3901 at r5 (raw file):

// the reproposal not execute if the original proposal had already been
// applied.
// - in case of Raft *retries*, in most cases we know that the original proposal

Ah, this is what a Raft retry is. I think @andreimatei is just about to remove this in #33381.

pkg/storage/replica_tscache.go, line 76 at r5 (raw file):

				// throughout the transaction.
				// TODO(nvanbenschoten): This assumption wouldn't be needed
				// if we compared against for the OrigTimestamp. See

s/for//

pkg/storage/replica_tscache.go, line 214 at r5 (raw file):
Specify what timestamp for the transaction you're talking about (the provisional commit timestamp):

Either way, transaction records with provisional commit timestamps at or below this timestamp should not be written.

pkg/storage/batcheval/cmd_push_txn.go, line 109 at r2 (raw file):

	// Fetch existing transaction; if missing, we're allowed to abort.
	existTxn := &roachpb.Transaction{}

Shouldn't this unmarshal into a TxnRecord? Making this change throughout would give me confidence that there aren't any stray usages of other fields.
We should also put a stringer on PushTxnResponse.PusheeTxn which prints the result as a TxnRecord (to avoid printing all the other trivial fields which might create confusion during debugging). Or even better, we change the type on PusheeTxn if we know that this is safe.

A similar concern applies to the other methods that currently populate a *Transaction.

pkg/storage/batcheval/eval_context.go, line 89 at r5 (raw file):

	// it is empty then the tombstone is an implicit artifact of the timestamp
	// cache's low-water mark. Either way, transactions should not write records
	// at or below this timestamp.

I think this implicitly relies on the fact that anyone who calls this also declares (at least) a read on the txn id and that those populating it declare a write. Kind of obvious, but worth pointing out.
I wonder why this method takes the whole txn.

pkg/storage/batcheval/transaction.go, line 104 at r5 (raw file):

// cmd_query_txn.go to determine whether a transaction record is allowed to be
// synthesized to get around the complications of eagerly GCed transaction
// records. I believe this is exactly what @andreimatei was suggesting. To do

Link to the suggestion.

pkg/storage/batcheval/transaction.go, line 105 at r5 (raw file):

// synthesized to get around the complications of eagerly GCed transaction
// records. I believe this is exactly what @andreimatei was suggesting. To do
// this, we'll need to address the TODO comment in below. If we do make this the

s/in//?

pkg/storage/batcheval/transaction.go, line 139 at r5 (raw file):

		// definitive decisions about whether a transaction record will *never*
		// be able to be written. We can't allow a concurrent txn to use this to
		// decided that a transaction can never write a txn record, then have

s/decided/decide/

I agree with the TODO.

bdarnell

Reviewed 2 of 2 files at r1, 12 of 12 files at r2, 5 of 5 files at r3, 2 of 2 files at r4, 14 of 14 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/roachpb/data.proto, line 371 at r2 (raw file):

// be persisted in a transaction record. It also serves as a specification
// for the fields that must be present in a transaction record.
message TransactionRecord {

Comment that any changes to this type must be reflected in the AsRecord/AsTransaction methods.

pkg/storage/replica_tscache.go, line 220 at r5 (raw file):

	// Use the transaction key to look for an entry in the write timestamp cache.
	key := keys.TransactionKey(txn.Key, txn.ID)
	return r.store.tsCache.GetMaxWrite(key, nil /* end */)

This feels like a slight abuse of the tscache txnID payload - normally it's used to allow txns to operate on keys they've already written, instead of for a txn to block itself out. I think it works here because we don't use txn records in normal MVCC operations, and I don't have a better idea right now, but it seems like something that might come back to bite us.

Specifically, at least in the read half of the tscache, the txn id gets cleared if two txns attempt to read the same key, so they both perceive the read as being done by a different transaction instead of themselves. If a similar situation happened here it would be interpreted as a low water mark instead of an interaction between two txns.

pkg/storage/batcheval/eval_context.go, line 84 at r5 (raw file):

	GetSplitQPS() float64

	// GetTxnTombstoneFromTimestampCache returns the tombstone timestamp for the

What is the "tombstone timestamp for the transaction"? I don't think that's a term we use/define anywhere else.

pkg/storage/batcheval/eval_context.go, line 88 at r5 (raw file):

	// is not empty then the tombstone was explicitly set by an EndTxn req. If
	// it is empty then the tombstone is an implicit artifact of the timestamp
	// cache's low-water mark. Either way, transactions should not write records

s/records/txn records/

Which of the txn record's several timestamps matters here?

andreimatei

Good stuff!

What do you think about including information in the responses to requests that can now create txn record (heartbeat and EndTxn) about whether a record was created? I think it might prove useful for testing and tracing.

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/kv/integration_test.go, line 71 at r5 (raw file):

func TestDelayedBeginRetryable(t *testing.T) {
	defer leaktest.AfterTest(t)()
	t.Skip(`WIP: @andrei what should we do here? Now that the timestamp

we discussed - the test should prolly be deleted once there's no more BeginTxn.
In the meantime, skip it, or investigate why it's flaky with the updated error.

pkg/kv/txn_interceptor_heartbeat.go, line 105 at r5 (raw file):

The field is not needed for client-perceived correctness,

Say that the flipping should not be observed by clients since they're not supposed to be using the txn concurrently with an EndTxn

pkg/kv/txn_interceptor_heartbeat.go, line 497 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

What's up with this change?

indeed, what's up?

pkg/roachpb/data.proto, line 258 at r2 (raw file):

  // A free-text identifier for debug purposes.
  string name = 2;
  reserved 3;

nit: i'd put this either at the top or at the bottom

pkg/roachpb/data_test.go, line 562 at r2 (raw file):

}

func TestTransactionRecordRoundtrips(t *testing.T) {

please spell out that this test is about Transaction->TransactionRecord->Transaction, and please give a reminder about how TransactionRecord has a subset of fields and how wire compatibility is required because somebody unmarshalls TR as T.

pkg/roachpb/errors.proto, line 209 at r5 (raw file):

  // A possible replay caused by duplicate begin txn or out-of-order
  // txn sequence number.
  // TODO(nvanbenschoten): This is no longer in use. Remove.

remove in 2.4, right?

pkg/settings/cluster/cockroach_versions.go, line 63 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Always nice to see the final payoff of such a migration.

thanks for the cleanup, indeed.
But why not delete this constant? That's the ultimate victory.

pkg/storage/replica_tscache.go, line 215 at r5 (raw file):

// cache's low-water mark. Either way, transactions should not write records
// at or below this timestamp.
func (r *Replica) GetTxnTombstoneFromTimestampCache(

I don't like this interface very much. I'll read this method and its comment in two weeks and I won't understand very much. One problem is that we return a txn id but the caller doesn't actually use the value. Another is that I don't like that the caller is in charge of dealing with TxnSpanGCThreshold separately from dealing with this method. It seems that that check could be encapsulated.

How about GetTxnRecordHistory() (or GetTxnRecordCreationObjection() or IsTxnRecordRejected()) returning an enum { None, - caller go ahead and create the txn rec Finalized, - replay Ambiguous, - ts cache was bumped, somebody might have aborted us TooOld - all the history might have been GC-ed. Why are you even bothering me for this old txn? }
Or as you've suggested, return a TransactionAbortedReason directly.

Also as we discussed, returning a timestamp from this method, and further propagating it through the TransactionAbortedError seems unnecessary, if not plainly bad, so getting rid of that part is good by itself.

pkg/storage/batcheval/cmd_begin_transaction.go, line 139 at r2 (raw file):

	// Write the txn record.
	txnRecord := reply.Txn.AsRecord()
	return result.Result{}, engine.MVCCPutProto(ctx, batch, cArgs.Stats, key, hlc.Timestamp{}, nil, &txnRecord)

long line :)

pkg/storage/batcheval/cmd_end_transaction.go, line 180 at r5 (raw file):

		return result.Result{}, err
	} else if !ok {
		// No existing transaction record was found - create one.

mind giving some hints about how the record is "created" exactly? I don't see a call to the batch to write it. Same in HeartbeatTxn

pkg/storage/batcheval/cmd_end_transaction.go, line 185 at r5 (raw file):

		// Verify that it is safe to create the transaction record.
		if args.Commit {

so what about rollbacks? They don't need replay protection, I guess? Consider adding a comment saying something.

pkg/storage/batcheval/cmd_end_transaction.go, line 349 at r5 (raw file):

	// WriteTooOldError). However, the replayed intent cannot be resolved by a
	// subsequent replay of this EndTransaction call because the txn timestamp
	// will be too old. Replays which include a BeginTransaction never succeed

I think this comment can be updated around BeginTransaction

pkg/storage/batcheval/cmd_push_txn.go, line 109 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Shouldn't this unmarshal into a TxnRecord? Making this change throughout would give me confidence that there aren't any stray usages of other fields.
We should also put a stringer on PushTxnResponse.PusheeTxn which prints the result as a TxnRecord (to avoid printing all the other trivial fields which might create confusion during debugging). Or even better, we change the type on PusheeTxn if we know that this is safe.

A similar concern applies to the other methods that currently populate a *Transaction.

+1 on unmarshalling as TR here

pkg/storage/batcheval/transaction.go, line 97 at r5 (raw file):

}

// CanCreateTxnRecord determines whether a transaction record can be created

please put more words in this comment hinting to what this check is about

pkg/storage/batcheval/transaction.go, line 110 at r5 (raw file):

func CanCreateTxnRecord(
	ctx context.Context, rec EvalContext, txn *roachpb.Transaction,
) (ok bool, errTxn *roachpb.Transaction, err error) {

please document the ret vals - particularly errTxn.
Since the only error this returns is TransactionAbortedErr, would it be better for this method to return a pErr and put the updated txn in it?

And nit: rename ok and err to _ - they're far too common names for introducing them in the namespace incidentally. Ideally we'd only name return vals when the names are needed by defers; in other cases I'd like to hint at what they are by doing (_ /* ok / bool, _ / errTxn */ roachpb.Transaction, ..), but our formatter makes that look pretty ugly.

pkg/storage/batcheval/transaction.go, line 161 at r5 (raw file):

which may have been written before this entry

Can you clarify which "entry" this is talking about?

nvanbenschoten

TFTRs!

I've gotten through the comments about #33257 so far and updated that PR accordingly.

What do you think about including information in the responses to requests that can now create txn record (heartbeat and EndTxn) about whether a record was created? I think it might prove useful for testing and tracing.

It sounds like a decent idea, but I'll defer until we would find it useful.

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/roachpb/data.proto, line 258 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: i'd put this either at the top or at the bottom

Done.

pkg/roachpb/data.proto, line 371 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Comment that any changes to this type must be reflected in the AsRecord/AsTransaction methods.

Done.

pkg/roachpb/data.proto, line 380 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I think we use spaces instead of tabs in proto files. Or so it seems from the diff.

Done.

pkg/roachpb/data_test.go, line 562 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please spell out that this test is about Transaction->TransactionRecord->Transaction, and please give a reminder about how TransactionRecord has a subset of fields and how wire compatibility is required because somebody unmarshalls TR as T.

Done.

pkg/roachpb/data_test.go, line 566 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

assert.NoError(t, zerofields.NoZeroField(&txnRecord) to keep test from rotting

Nice, done.

pkg/storage/batcheval/cmd_push_txn.go, line 109 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

+1 on unmarshalling as TR here

I did this at first and went back and forth on it, but decided that unmarshalling into TxnRecords everywhere added unnecessary noise without much benefit. We'd be converting these to *Transaction objects almost immediately. Pushing the type safety into api.proto messages would make this more of a win, but at the cost of a dramatic proliferation of the type, which has its own downsides (i.e. we now have another variant of Transaction floating around).

My current thinking about this type is to view it as an optional "mask" on the fields in Transaction when writing transaction records (with the associated perf improvement). Readers of transaction records can use the type as documentation about the few fields that will be set in the unmarshaled Transaction. (I added as much to the type's comment)

nvanbenschoten

The rest of this is ready for another look! The biggest change I made was refining the interface added to batcheval.EvalContext. I think @andreimatei will like this better.

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/kv/integration_test.go, line 71 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

we discussed - the test should prolly be deleted once there's no more BeginTxn.
In the meantime, skip it, or investigate why it's flaky with the updated error.

Done.

pkg/kv/txn_interceptor_heartbeat.go, line 105 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

The field is not needed for client-perceived correctness,
Say that the flipping should not be observed by clients since they're not supposed to be using the txn concurrently with an EndTxn

Removed.

pkg/kv/txn_interceptor_heartbeat.go, line 267 at r5 (raw file):
Yeah, you're right, this is still racy. I removed it and added a TODO instead since this is nothing new.

The reason for resetting here in the first place is that there could've been a transaction retry error, right?

Yes. We could conditionally reset it, but I'd like this to be thought out more carefully before doing anything, and I don't want to do that in this change.

pkg/kv/txn_interceptor_heartbeat.go, line 497 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

indeed, what's up?

Drive by clean up of redundant code. This calls asyncAbortCallbackLocked immediately below, which calls TxnCoordSender.cleanupTxnLocked, which calls txnHeartbeat.closeLocked.

pkg/roachpb/errors.proto, line 209 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

remove in 2.4, right?

Done.

pkg/settings/cluster/cockroach_versions.go, line 63 at r3 (raw file):

But why not delete this constant? That's the ultimate victory.

The instructions for removing a version say not to because we only want to remove versions in-order. That idea looks to date back to #26967 (comment). I don't remember the reason for this though. @bdarnell do you?

pkg/storage/replica_test.go, line 3683 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Make this .AsRecord().

Done.

pkg/storage/replica_test.go, line 3799 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm, I think the explanation here has rotted. We're protected from "Raft retries" thanks to the lease applied index. I think what we're testing here are RPC-level retries (i.e. DistSender delivers a batch twice).

Yeah, I suspect that this has been broken ever since command evaluation was pushed into propose. Fixed.

pkg/storage/replica_test.go, line 3801 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This line can go, it was introduced when the default was false.

Done.

pkg/storage/replica_test.go, line 3831 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Isn't something fishy here? If this is indeed a real-world RPC retry and the first RPC is the one that gets lost (i.e. the client receives the response from the second one), it'll receive TxnAbortedError even though the first one might've committed the Txn. Is it that DistSender would turn that TxnAbortedError into an ambiguous one?

Are we working in a model where the RPC could be duplicated and then a response could be returned from the second one without the client knowing or being the cause of the duplication? I don't see how we could avoid issues from cases like that without full idempotency everywhere, so I suspect that that's not actually the case and that a duplicate request requires some buy-in from DistSender, which is where we turn RPC errors into ambiguous commit errors.

pkg/storage/replica_test.go, line 3892 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto about what we're testing.

So this should stay then, right?

pkg/storage/replica_test.go, line 3901 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ah, this is what a Raft retry is. I think @andreimatei is just about to remove this in #33381.

Ack.

pkg/storage/replica_tscache.go, line 76 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

s/for//

Done.

pkg/storage/replica_tscache.go, line 214 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Specify what timestamp for the transaction you're talking about (the provisional commit timestamp):

Either way, transaction records with provisional commit timestamps at or below this timestamp should not be written.

Done.

pkg/storage/replica_tscache.go, line 215 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't like this interface very much. I'll read this method and its comment in two weeks and I won't understand very much. One problem is that we return a txn id but the caller doesn't actually use the value. Another is that I don't like that the caller is in charge of dealing with TxnSpanGCThreshold separately from dealing with this method. It seems that that check could be encapsulated.

How about GetTxnRecordHistory() (or GetTxnRecordCreationObjection() or IsTxnRecordRejected()) returning an enum { None, - caller go ahead and create the txn rec Finalized, - replay Ambiguous, - ts cache was bumped, somebody might have aborted us TooOld - all the history might have been GC-ed. Why are you even bothering me for this old txn? }
Or as you've suggested, return a TransactionAbortedReason directly.

Also as we discussed, returning a timestamp from this method, and further propagating it through the TransactionAbortedError seems unnecessary, if not plainly bad, so getting rid of that part is good by itself.

Done.

pkg/storage/replica_tscache.go, line 220 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This feels like a slight abuse of the tscache txnID payload - normally it's used to allow txns to operate on keys they've already written, instead of for a txn to block itself out. I think it works here because we don't use txn records in normal MVCC operations, and I don't have a better idea right now, but it seems like something that might come back to bite us.

Specifically, at least in the read half of the tscache, the txn id gets cleared if two txns attempt to read the same key, so they both perceive the read as being done by a different transaction instead of themselves. If a similar situation happened here it would be interpreted as a low water mark instead of an interaction between two txns.

We discussed this in person - this use of the tscache isn't new, although I agree it's somewhat strange. The txnID is used only to determine whether a transaction will be rejected with the reason ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY vs. the reason ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY. Either way, things work correctly.

pkg/storage/batcheval/cmd_end_transaction.go, line 180 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

mind giving some hints about how the record is "created" exactly? I don't see a call to the batch to write it. Same in HeartbeatTxn

Done.

pkg/storage/batcheval/cmd_end_transaction.go, line 185 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

so what about rollbacks? They don't need replay protection, I guess? Consider adding a comment saying something.

Done.

pkg/storage/batcheval/cmd_end_transaction.go, line 349 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think this comment can be updated around BeginTransaction

Good catch, how's that?

pkg/storage/batcheval/eval_context.go, line 84 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What is the "tombstone timestamp for the transaction"? I don't think that's a term we use/define anywhere else.

Done.

pkg/storage/batcheval/eval_context.go, line 88 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/records/txn records/

Which of the txn record's several timestamps matters here?

Done.

pkg/storage/batcheval/eval_context.go, line 89 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I think this implicitly relies on the fact that anyone who calls this also declares (at least) a read on the txn id and that those populating it declare a write. Kind of obvious, but worth pointing out.
I wonder why this method takes the whole txn.

Done.

pkg/storage/batcheval/transaction.go, line 97 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please put more words in this comment hinting to what this check is about

Done.

pkg/storage/batcheval/transaction.go, line 104 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Link to the suggestion.

Done.

pkg/storage/batcheval/transaction.go, line 105 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

s/in//?

Done.

pkg/storage/batcheval/transaction.go, line 110 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please document the ret vals - particularly errTxn.
Since the only error this returns is TransactionAbortedErr, would it be better for this method to return a pErr and put the updated txn in it?

And nit: rename ok and err to _ - they're far too common names for introducing them in the namespace incidentally. Ideally we'd only name return vals when the names are needed by defers; in other cases I'd like to hint at what they are by doing (_ /* ok / bool, _ / errTxn */ roachpb.Transaction, ..), but our formatter makes that look pretty ugly.

Done.

pkg/storage/batcheval/transaction.go, line 139 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

s/decided/decide/

I agree with the TODO.

Done.

pkg/storage/batcheval/transaction.go, line 161 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

which may have been written before this entry
Can you clarify which "entry" this is talking about?

Done.

nvanbenschoten · 2019-01-03T17:22:20Z

The fourth commit is also new. It stops explicitly using the timestamp from TransactionAbortedErrors as the minimum timestamp for new transactions. @andreimatei and I decided this was unnecessary because the coordinator's clock should already be forwarded by the timestamp from the corresponding BatchResponse. This permitted a simplification to the EvalContext interface.

bdarnell

Reviewed 1 of 22 files at r6, 1 of 2 files at r7, 2 of 2 files at r9, 14 of 14 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/settings/cluster/cockroach_versions.go, line 63 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

But why not delete this constant? That's the ultimate victory.

The instructions for removing a version say not to because we only want to remove versions in-order. That idea looks to date back to #26967 (comment). I don't remember the reason for this though. @bdarnell do you?

I think it's just because we were using iota and didn't want to fix up the sequence to preserve the old values. At the time of that comment we were confused about whether the old values needed to be preserved; later in the thread we realized that they're never persisted or written over the network so this doesn't matter.

nvanbenschoten · 2019-01-07T17:58:33Z

I think there's one issue here which is that OrigTimestamp can be moved forward between transaction epochs. This could allow a transaction that has incremented its timestamp and moved to a new epoch to recreate an aborted transaction record. We'll need to compare against EpochZeroTimestamp instead in CanCreateTxnRecord.

andreimatei

LGTM

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/roachpb/errors.proto, line 209 at r10 (raw file):

  // A possible replay caused by duplicate begin txn or out-of-order
  // txn sequence number.
  // TODO(nvanbenschoten): This is no longer in use. Remove in 2.4.

👍

pkg/settings/cluster/cockroach_versions.go, line 63 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think it's just because we were using iota and didn't want to fix up the sequence to preserve the old values. At the time of that comment we were confused about whether the old values needed to be preserved; later in the thread we realized that they're never persisted or written over the network so this doesn't matter.

so it sounds like we should just delete it, right?

pkg/storage/replica_eval_context_span.go, line 155 at r10 (raw file):

// record was rejected. If the method ever determines that a transaction record
// must be rejected, it will continue to reject that transaction going forwards.
func (rec SpanSetReplicaEvalContext) CanCreateTxnRecord(

can we centralize the comment for this method in one place (i.e. on the EvalContext interface) and link to that from here and from the replica implementation?

pkg/storage/replica_tscache.go, line 216 at r10 (raw file):

	// finalized. If there is one, then we return a retriable into an ambiguous
	// one higher up. Otherwise, if the client is still waiting for a result,

I think some words are missing here?

pkg/storage/batcheval/cmd_end_transaction.go, line 356 at r10 (raw file):

	// retry error. If the replay didn't attempt to create a txn record, any
	// push will immediately succeed as a missing txn record on push where
	// CanCreateTxnRecord returns false sets the transaction to aborted. In both

s/sets the transaction to aborted/succeeds

The setting is baked into 2.2 binaries. Release note: None

Updating and applying the timestamp cache are two sides of the same coin. It makes sense that these methods should live right next to each other. Release note: None

The global defaults to true. Release note: None

The local hlc should have been advanced to at least the error's timestamp by the time PrepareTransactionForRetry is called. Release note: None

nvanbenschoten

TFTRs!

I think there's one issue here which is that OrigTimestamp can be moved forward between transaction epochs. This could allow a transaction that has incremented its timestamp and moved to a new epoch to recreate an aborted transaction record. We'll need to compare against EpochZeroTimestamp instead in CanCreateTxnRecord.

I addressed this by using txn.EpochZeroTimestamp, which gives the desired semantics. I also added a comprehensive test with 45 separate scenarios that would have caught this issue and makes me feel more confident that the state machine correctly implements the one drawn out in #33523 (comment). @andreimatei you might be interested in taking a quick look at TestCreateTxnRecord. If not, I'll go ahead and merge this.

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/settings/cluster/cockroach_versions.go, line 63 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

so it sounds like we should just delete it, right?

I'll go through and do that in one sweep.

pkg/storage/replica_eval_context_span.go, line 155 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

can we centralize the comment for this method in one place (i.e. on the EvalContext interface) and link to that from here and from the replica implementation?

Sounds good. I'll do that in #33523 though, since this is being reworked there anyway.

pkg/storage/replica_tscache.go, line 216 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

	// finalized. If there is one, then we return a retriable into an ambiguous
	// one higher up. Otherwise, if the client is still waiting for a result,

I think some words are missing here?

Done.

pkg/storage/batcheval/cmd_end_transaction.go, line 356 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

	// retry error. If the replay didn't attempt to create a txn record, any
	// push will immediately succeed as a missing txn record on push where
	// CanCreateTxnRecord returns false sets the transaction to aborted. In both

s/sets the transaction to aborted/succeeds

Done.

Informs cockroachdb#25437. Informs cockroachdb#32971. This is the first part of addressing cockroachdb#32971. Part two will update concurrent txns to not immediately abort missing txn records and part three will update the txn client to stop sending BeginTxn records. The change closely resembles what was laid out in the corresponding RFC sections. `HeartbeatTxn` and `EndTxn` are both updated to permit the creation of transaction records when they finds that one is missing. To prevent replays from causing issues, they *both* check the write timestamp cache when creating a transaction record. This is one area where the change diverges from the RFC. Instead of having `EndTxn` always check the write timestamp cache, it only does so if it is creating a txn record from scratch. To make this safe, `HeartbeatTxn` also checks the write timestamp cache if it is creating a txn record from scratch. This prevents a long running transaction from increasingly running the risk of being aborted by a lease transfer as its `EndTxn` continues to be delayed. Instead, a lease transfer will only abort a transaction if it comes before the transaction record creation, which should be within a second of the transaction starting. The change pulls out a new `CanCreateTxnRecord` method, which has the potential of being useful for detecting eagerly GCed transaction records and solving the major problem from the RFC without an extra QueryIntent. This is what Andrei was pushing for before. More details are included in TODOs within the PR. _### Migration Safety Most of the changes to the transaction state machine are straightforward to validate. Transaction records can still never be created without checking for replays and only an EndTxn from the client's sync path can GC an ABORTED txn record. This means that it is still impossible for a transaction to move between the two finalized statuses (at some point it would be worth it to draw this state machine). The one tricky part is ensuring that the changes in this PR are safe when run in mixed-version clusters. The two areas of concern are: 1. lease transfers between a new and an old cluster on a range that should/does contain a transaction record. 2. an old txn client that is not expecting HeartbeatTxn and EndTxn requests to create txn records if they are found to be missing. 3. an old txn client may not expect a HeartbeatTxn to hit the write timestamp cache if run concurrently with an EndTxn. The first concern is protected by the write timestamp cache. Regardless of which replica creates that transaction record from a BeginTxn req or a HeartbeatTxn req (on the new replica), it will first need to check the timestamp cache. This prevents against any kind of replay that could create a transaction record that the old replica is not prepared to handle. We can break the second concern into two parts. First, an old txn client will never send an EndTxn without having sent a successful BeginTxn, so the new behavior there is not an issue. Second, an old txn client may send a HeartbeatTxn concurrently with a BeginTxn. If the heartbeat wins, it will create a txn record on a new replica. If the BeginTxn evaluates on a new replica, it will be a no-op. If it evaluates on an old replica, it will result in a retryable error. The third concern is specific to the implementation of the heartbeat loop itself. If a HeartbeatTxn loses a race with an EndTxn, it may get an aborted error after checking the timestamp cache. This is desirable from the replica-side, but we'd need to ensure that the client will handle it correctly. This PR adds some protection (see `sendingEndTxn`) to the txn client to prevent this case from causing weirdness on the client, but I don't think this could actually cause issues even without this protection. The reason is that the txn client might mark itself as aborted due to the heartbeat, but this will be overwritten when the EndTxn returns and the sql client will still hear back from the successful EndTxn. This must have actually always been an issue because it was always possible for a committed txn record to be GCed and then written as aborted later, at which point a concurrent heartbeat could have observed it. I'd like Andrei to sign off on this last hazard, as he's thought about this kind of thing more than anybody. Release note: None

nvanbenschoten · 2019-01-07T22:56:43Z

bors r+

33396: storage: allow HeartbeatTxn and EndTxn requests to create txn records r=nvanbenschoten a=nvanbenschoten Informs #25437. Informs #32971. This is the first part of addressing #32971. Part two will update concurrent txns to not immediately abort missing txn records and part three will update the txn client to stop sending BeginTxn records. The change closely resembles what was laid out in the corresponding RFC sections. `HeartbeatTxn` and `EndTxn` are both updated to permit the creation of transaction records when they finds that one is missing. To prevent replays from causing issues, they *both* check the write timestamp cache when creating a transaction record. This is one area where the change diverges from the RFC. Instead of having `EndTxn` always check the write timestamp cache, it only does so if it is creating a txn record from scratch. To make this safe, `HeartbeatTxn` also checks the write timestamp cache if it is creating a txn record from scratch. This prevents a long-running transaction from increasingly running the risk of being aborted by a lease transfer as its `EndTxn` continues to be delayed. Instead, a lease transfer will only abort a transaction if it comes before the transaction record creation, which should be within a second of the transaction starting. The change pulls out a new `CanCreateTxnRecord` method, which has the potential of being useful for detecting eagerly GCed transaction records and solving the major problem from the RFC without an extra QueryIntent. This is what Andrei was pushing for before. More details are included in TODOs within the PR. ### Migration Safety Most of the changes to the transaction state machine are straightforward to validate. Transaction records can still never be created without checking for replays and only an EndTxn from the client's sync path can GC an ABORTED txn record. This means that it is still impossible for a transaction to move between the two finalized statuses (at some point it would be worth it to draw this state machine). The one tricky part is ensuring that the changes in this PR are safe when run in mixed-version clusters. The two areas of concern are: 1. lease transfers between a new and an old cluster on a range that should/does contain a transaction record. 2. an old txn client that is not expecting HeartbeatTxn and EndTxn requests to create txn records if they are found to be missing. 3. an old txn client may not expect a HeartbeatTxn to hit the write timestamp cache if run concurrently with an EndTxn. The first concern is protected by the write timestamp cache. Regardless of which replica creates that transaction record from a BeginTxn req or a HeartbeatTxn req (on the new replica), it will first need to check the timestamp cache. This prevents against any kind of replay that could create a transaction record that the old replica is not prepared to handle. We can break the second concern into two parts. First, an old txn client will never send an EndTxn without having sent a successful BeginTxn, so the new behavior there is not an issue. Second, an old txn client may send a HeartbeatTxn concurrently with a BeginTxn. If the heartbeat wins, it will create a txn record on a new replica. If the BeginTxn evaluates on a new replica, it will be a no-op. If it evaluates on an old replica, it will result in a retryable error. The third concern is specific to the implementation of the heartbeat loop itself. If a HeartbeatTxn loses a race with an EndTxn, it may get an aborted error after checking the timestamp cache. This is desirable from the replica-side, but we'd need to ensure that the client will handle it correctly. This PR adds some protection (see `sendingEndTxn`) to the txn client to prevent this case from causing weirdness on the client, but I don't think this could actually cause issues even without this protection. The reason is that the txn client might mark itself as aborted due to the heartbeat, but this will be overwritten when the EndTxn returns and the sql client will still hear back from the successful EndTxn. This must have actually always been an issue because it was always possible for a committed txn record to be GCed and then written as aborted later, at which point a concurrent heartbeat could have observed it. I'd like Andrei to sign off on this last hazard, as he's thought about this kind of thing more than anybody. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

craig · 2019-01-07T23:12:32Z

Build succeeded

GitHub CI (Cockroach)

The changes in cockroachdb#33396 made it so that a `HeartbeatTxnRequest` that finds a missing transaction record will attempt to create the record. This means that it will discover if a transaction is not "committable" and return a TransactionAbortedError. Unfortunately, after a transaction commits and GCs its transaction record, it will also be considered not "committable". There will always be cases where a heartbeat request races with an EndTransaction request and incorrectly considers the transaction aborted (which is touched upon in a TODO a few lines down). However, in many of these cases, the coordinator already knows that the transaction is committed, so it doesn't need to attempt to roll back the transaction and clean up its intents. This commit checks for these cases and avoids sending useless rollbacks. The intention is to backport this to 19.1. Release note: None

Stumbled upon a function with an error in its return signature that never returns an error. Better to remove it and the stale comment that goes with it. The removal of the code paths which could have returned an error occurred in cockroachdb#33396. Release justification: Low risk, does not change logic. Could also hold off. Release note: None

41300: storage: more aggressively replica GC learner replicas r=ajwerner a=ajwerner This PR fixes a test flake in TestSystemZoneConfig: ``` client_replica_test.go:1753: condition failed to evaluate within 45s: mismatch between r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):7, (n7,s7):5, next=8, gen=14] r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):4, (n7,s7):5, (n3,s3):6LEARNER, next=7, gen=9] ``` The above flake happens because we set the expectation in the map to a descriptor which contains a learner which has since been removed. We shouldn't use a range descriptor which contains learners as the expectation. To avoid that we return an error in the succeeds soon if we come across a descriptor which contains learners. This behavior unvealed another issue, we are way too conservative with replica GC for learners. Most of the time when learners are removed they hear about their own removal, but if they don't we won't consider the Replica for removal for 10 days! This commit changes the replica gc queue behavior to treat learners line candidates. Fixes #40980. Release Justification: bug fixes and low-risk updates to new functionality. Release note: None 41308: storage: remove error from Replica.applyTimestampCache() r=ajwerner a=ajwerner Stumbled upon a function with an error in its return signature that never returns an error. Better to remove it and the stale comment that goes with it. The removal of the code paths which could have returned an error occurred in #33396. Release justification: Low risk, does not change logic. Could also hold off. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

nvanbenschoten requested review from bdarnell, andreimatei and a team December 28, 2018 19:38

nvanbenschoten force-pushed the nvanbenschoten/lazyPt1 branch 3 times, most recently from 076f1a3 to cae84f1 Compare December 29, 2018 06:54

tbg reviewed Jan 2, 2019

View reviewed changes

bdarnell reviewed Jan 2, 2019

View reviewed changes

andreimatei suggested changes Jan 2, 2019

View reviewed changes

nvanbenschoten commented Jan 2, 2019

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/lazyPt1 branch from cae84f1 to 6357047 Compare January 3, 2019 17:16

nvanbenschoten commented Jan 3, 2019

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/lazyPt1 branch 2 times, most recently from 05e8a16 to 3567e99 Compare January 3, 2019 22:32

bdarnell reviewed Jan 4, 2019

View reviewed changes

nvanbenschoten mentioned this pull request Jan 5, 2019

storage: tolerate missing transaction records when pushing #33523

Merged

andreimatei approved these changes Jan 7, 2019

View reviewed changes

nvanbenschoten added 4 commits January 7, 2019 16:47

storage: remove VersionClientSideWritingFlag

96b552a

The setting is baked into 2.2 binaries. Release note: None

storage: move tscache-related methods to new file

5110516

Updating and applying the timestamp cache are two sides of the same coin. It makes sense that these methods should live right next to each other. Release note: None

storage: remove setTxnAutoGC(true) in tests

ab4dffb

The global defaults to true. Release note: None

roachpb: ignore error timestamp when restarting aborted txn

7472973

The local hlc should have been advanced to at least the error's timestamp by the time PrepareTransactionForRetry is called. Release note: None

nvanbenschoten mentioned this pull request Jan 7, 2019

storage: BeginTxnRequest can be delayed long enough to expire #23945

Closed

nvanbenschoten force-pushed the nvanbenschoten/lazyPt1 branch from 3567e99 to 2a1ef61 Compare January 7, 2019 22:06

nvanbenschoten commented Jan 7, 2019

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/lazyPt1 branch from 2a1ef61 to 8143b45 Compare January 7, 2019 22:09

craig bot merged commit 8143b45 into cockroachdb:master Jan 7, 2019

nvanbenschoten deleted the nvanbenschoten/lazyPt1 branch January 7, 2019 23:20

nvanbenschoten mentioned this pull request Feb 5, 2019

roachtest: tpcc-1000 roachtests have failing checks #34025

Closed

bdarnell mentioned this pull request Mar 4, 2019

kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures #28876

Closed

ajwerner mentioned this pull request Oct 3, 2019

storage: remove error from Replica.applyTimestampCache() #41308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: allow HeartbeatTxn and EndTxn requests to create txn records #33396

storage: allow HeartbeatTxn and EndTxn requests to create txn records #33396

nvanbenschoten commented Dec 28, 2018 •

edited

Loading

cockroach-teamcity commented Dec 28, 2018

tbg left a comment

bdarnell left a comment

andreimatei left a comment

nvanbenschoten left a comment

nvanbenschoten left a comment

nvanbenschoten commented Jan 3, 2019

bdarnell left a comment

nvanbenschoten commented Jan 7, 2019

andreimatei left a comment

nvanbenschoten left a comment

nvanbenschoten commented Jan 7, 2019

craig bot commented Jan 7, 2019

storage: allow HeartbeatTxn and EndTxn requests to create txn records #33396

storage: allow HeartbeatTxn and EndTxn requests to create txn records #33396

Conversation

nvanbenschoten commented Dec 28, 2018 • edited Loading

Migration Safety

cockroach-teamcity commented Dec 28, 2018

tbg left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jan 3, 2019

bdarnell left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jan 7, 2019

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jan 7, 2019

craig bot commented Jan 7, 2019

Build succeeded

nvanbenschoten commented Dec 28, 2018 •

edited

Loading