Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: stop modifying requests and returning responses from TxnWaitQueue #43383

Merged

Conversation

nvanbenschoten
Copy link
Member

This PR contains three pieces of cleanup that pave the way for pulling the TxnWaitQueue underneath the upcoming pkg/storage/concurrency package. It removes all cases where the queue mandates an update to the PushTxnRequest that entered it. It also removes all cases where the queue could field requests directly and return results for them. This restricts the TxnWaitQueue's interface and allows it to be pushed into the concurrency manager abstraction without bloating this new abstraction.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: I quite like this change. It simplifies the role of the txnwait queue nicely.

I ran YCSB A at a few concurrency levels and saw no difference in performance.

That's encouraging enough for me.

Reviewed 4 of 4 files at r1, 4 of 4 files at r2, 5 of 5 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/batcheval/cmd_push_txn.go, line 128 at r1 (raw file):

	now := cArgs.EvalCtx.Clock().Now()
	if now.Less(h.Timestamp) {
		// The batch's timestamp should have been used to udpate the clock.

udpate


pkg/storage/txnwait/queue.go, line 653 at r2 (raw file):

					)
					metrics.DeadlocksTotal.Inc(1)
					// TODO(nvanbenschoten): As it, it would make sense to return

If for whatever reason you back off of the last commit: s/As it/As is/

This commit removes the use of a PushRequest's batch header timestamp
to determine whether a pushee has expired or not. This is the first
step in a refactor to remove cases where we mutate requests on the
server before command evaluation.

There is no migration needed for this change because it's impact
is local to a single node - the leaseholder evaluating the request.

Release note: None
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I actually decided to peel the entire final commit off for now. I don't think it's too much of an issue for the concurrency package to have the ability to field requests directly with responses where possible, as long as it's never mutating requests.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/storage/batcheval/cmd_push_txn.go, line 128 at r1 (raw file):

Previously, ajwerner wrote…

udpate

Done.


pkg/storage/txnwait/queue.go, line 653 at r2 (raw file):

Previously, ajwerner wrote…

If for whatever reason you back off of the last commit: s/As it/As is/

Removed the entire commit.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 8 files at r4, 4 of 4 files at r5.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/txnwait/queue_test.go, line 31 at r5 (raw file):

func TestShouldPushImmediately(t *testing.T) {
	// TODO WIP

did you want to do anything about this?

This commit removes the interaction between the TxnWaitQueue and the
Replica-level retry loop when transaction deadlocks are detected. Instead
of instructing a waiting PushTxnRequest to change to a force push request,
the queue now launches push abort directly. This is cleaner and avoids
the need to mutate the input batch in Replica.maybeWaitForPushee.

Release note: None
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/storage/txnwait/queue_test.go, line 31 at r5 (raw file):

Previously, ajwerner wrote…

did you want to do anything about this?

Done.

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 2, 2020

Build failed

@nvanbenschoten
Copy link
Member Author

F200102 20:05:04.162081 1531172 sqlmigrations/migrations.go:531  [n1] not enough time left on migration lease, terminating for safety

Huh, looks like a resurfacing of #21444, which we never tracked down but were able to correlate with slow TeamCity workers. We do see some serious slowdown in the logs of this test, like handle raft ready: 16.5s.

I'm not able to reproduce with stress or stressrace, so I'm going to chalk this up to a flake for now.

bors r+

craig bot pushed a commit that referenced this pull request Jan 2, 2020
42765: storage: write to AbortSpan less, clean up AbortSpan more r=nvanbenschoten a=nvanbenschoten

This PR introduces two improvements to our handling of the AbortSpan. It also introduces the first set of comprehensive testing around AbortSpan entry state transitions, which was sorely missing. 

This comes after a few different customer issues that at least at first glance appeared to be AbortSpan leaks. There's still more to do to resolve those, mostly by improving GC, but anything we can do here on the frontend to reduce the number of AbortSpan entries that need to be GCed in the first place helps.

### Clean up span on non-poisoning, aborting EndTransaction request

Fixes #29128.

Before this change, an EndTransaction request sent to rollback a transaction record would not remove any AbortSpan entries, even if its own Poison flag was set to false. This allowed AbortSpan entries to leak. This commit fixes this behavior by removing the AbortSpan entry in this case.

There were concerns raised in #29128 about this being safe. It turns out that we already do this on every Range except the transaction record's Range during normal intent resolution, so this isn't introducing any new concerns.

### Only write AbortSpan entries if intents are removed

This reduces the frequency of AbortSpan entries that can be abandoned even without a transaction coordinator failure. Specifically, it protects against the case where intent resolution races with a transaction coordinator cleaning up its own transaction record and intents. This can happen for both aborted and committed transactions.

In the first case, a pusher might find a transaction's intent and then find its record to be aborted after that transaction had cleanly rolled back its own intents. Even though the transaction's coordinator had already cleaned up and potentially "unpoisoned" AbortSpans, the pusher would happily re-introduce AbortSpan records when it goes to resolve the intents that were already cleaned up. These AbortSpan entries would be fully abandoned and have to wait out the GC.

Similarly, in the second case, the transaction might have committed. Here, the pushee might hit an intent and the txn coordinator might clean up and auto-GC its txn record before the pushee arrives at the txn record. Once the pushee gets there, it would mistake the txn for aborted (by design) and proceed to write an AbortSpan record where the intent it had once observed had been (not by design).

We can tell both of these cases by simply recognizing whether intent resolution actually succeeds. If intent resolution doesn't find an intent, then we might be in either case. That's fine, because we only need to ever poison the abort span if we actually remove an intent that could confuse a zombie transaction.

43383: storage: stop modifying requests and returning responses from TxnWaitQueue r=nvanbenschoten a=nvanbenschoten

This PR contains three pieces of cleanup that pave the way for pulling the TxnWaitQueue underneath the upcoming `pkg/storage/concurrency` package. It removes all cases where the queue mandates an update to the PushTxnRequest that entered it. It also removes all cases where the queue could field requests directly and return results for them. This restricts the TxnWaitQueue's interface and allows it to be pushed into the concurrency manager abstraction without bloating this new abstraction.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@nvanbenschoten
Copy link
Member Author

#41482 (comment) just hit the same issue, so something's up, but it's not due to this PR.

@craig
Copy link
Contributor

craig bot commented Jan 2, 2020

Build succeeded

And happy new year from bors! 🎉

@craig craig bot merged commit a311978 into cockroachdb:master Jan 2, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/pushTxnNoUpdate branch January 2, 2020 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants