txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load #36748

nvanbenschoten · 2019-04-11T02:17:47Z

This change increases the duration between transaction heartbeats before a transaction is considered expired from 2 seconds to 5 seconds. This has been found to dramatically reduce the frequency of transaction aborts when a cluster is under significant load. This is not expected to noticeably hurt cluster availability in the presence of dead nodes because we already have availability loss on the order of 9 seconds due to the epoch-based lease duration.

This is especially important now that the hack in #25034 is gone. That hack was hiding some of this badness and giving transactions a bit more room to avoid being aborted.

Here's some testing with TPC-C 2300 on AWS with 3 c5d.4xlarge machines. I first ran with the the 2 second TxnLivenessThreshold and then with the 5 second:

The top graph is transaction aborts and the bottom graph is p99 service latency. During my tests I actually saw an even bigger difference most of the time. In this trial, leases were still moving around because I didn't use a ramp period (which skews the txn aborts when it stops). I ran another 15 minute run again half an hour later once the leases had balanced (again with a 5 second TxnLivenessThreshold) and saw the latencies I was more accustomed to seeing with the change:

cc. @danhhz, who picked this up back in January when monitoring the performance of an alpha release. I dug into the changes made in 8143b45 that seemed to be causing issues with transaction aborts but never developed a good theory for what could be causing them and had trouble reliably reproducing them. Little did I know that it wasn't what was added in 8143b45 that made the difference, it was what was removed.

The first commit is from #36729 and seems like a generally good change to make. There's no reason we should be sending async aborts if we already know that the transaction was finalized. This was probably also made a little worse by 8143b45 because we now check to see whether a transaction is commitable when its record is missing when evaluating a HeartbeatTxn requests. Before we would simply return a TransactionNotFoundStatusError, which was ignored.

The changes in cockroachdb#33396 made it so that a `HeartbeatTxnRequest` that finds a missing transaction record will attempt to create the record. This means that it will discover if a transaction is not "committable" and return a TransactionAbortedError. Unfortunately, after a transaction commits and GCs its transaction record, it will also be considered not "committable". There will always be cases where a heartbeat request races with an EndTransaction request and incorrectly considers the transaction aborted (which is touched upon in a TODO a few lines down). However, in many of these cases, the coordinator already knows that the transaction is committed, so it doesn't need to attempt to roll back the transaction and clean up its intents. This commit checks for these cases and avoids sending useless rollbacks. The intention is to backport this to 19.1. Release note: None

cockroach-teamcity · 2019-04-11T02:17:54Z

This change is

This change increases the duration between transaction heartbeats before a transaction is considered expired from 2 seconds to 5 seconds. This has been found to dramatically reduce the frequency of transaction aborts when a cluster is under significant load. This is not expected to noticeably hurt cluster availability in the presence of dead nodes because we already have availability loss on the order of 9 seconds due to the epoch-based lease duration. This is especially important now that the hack in cockroachdb#25034 is gone. That hack was hiding some of this badness and giving transactions a bit more room to avoid being aborted. Release note: None

nvanbenschoten · 2019-04-15T23:14:16Z

@tbg mind weighing in on this? I'd like to get this baking on master as soon as we feel comfortable. The PR itself is almost all changes to tests outside of switching the constant.

tbg

Reviewed 1 of 1 files at r1, 4 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/txn_interceptor_heartbeater.go, line 396 at r1 (raw file):

	// If the txn is no longer pending, ignore the result of the heartbeat.
	if h.mu.txn.Status != roachpb.PENDING {

That's new, right? Just a cleanup?

nvanbenschoten

TFTR!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/kv/txn_interceptor_heartbeater.go, line 396 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

That's new, right? Just a cleanup?

Yeah, this is just a cleanup. The commit message for it is pretty detailed.

@danhhz

36748: txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load r=nvanbenschoten a=nvanbenschoten This change increases the duration between transaction heartbeats before a transaction is considered expired from 2 seconds to 5 seconds. This has been found to dramatically reduce the frequency of transaction aborts when a cluster is under significant load. This is not expected to noticeably hurt cluster availability in the presence of dead nodes because we already have availability loss on the order of 9 seconds due to the epoch-based lease duration. This is especially important now that the hack in #25034 is gone. That hack was hiding some of this badness and giving transactions a bit more room to avoid being aborted. Here's some testing with TPC-C 2300 on AWS with 3 `c5d.4xlarge` machines. I first ran with the the 2 second TxnLivenessThreshold and then with the 5 second: <img width="872" alt="Screen Shot 2019-04-10 at 4 51 28 PM" src="https://user-images.githubusercontent.com/5438456/55926033-ae332c00-5bdd-11e9-8d6d-184ba1d5ffcd.png"> The top graph is transaction aborts and the bottom graph is p99 service latency. During my tests I actually saw an even bigger difference most of the time. In this trial, leases were still moving around because I didn't use a ramp period (which skews the txn aborts when it stops). I ran another 15 minute run again half an hour later once the leases had balanced (again with a 5 second TxnLivenessThreshold) and saw the latencies I was more accustomed to seeing with the change: <img width="869" alt="Screen Shot 2019-04-10 at 5 59 19 PM" src="https://user-images.githubusercontent.com/5438456/55926104-09651e80-5bde-11e9-9540-9d7c697a9181.png"> cc. @danhhz, who picked this up back in January when monitoring the performance of an alpha release. I dug into the changes made in 8143b45 that seemed to be causing issues with transaction aborts but never developed a good theory for what could be causing them and had trouble reliably reproducing them. Little did I know that it wasn't what was added in 8143b45 that made the difference, it was what [was removed](8143b45#diff-3c31a6497771c33db423a4cead092979L130). The first commit is from #36729 and seems like a generally good change to make. There's no reason we should be sending async aborts if we already know that the transaction was finalized. This was probably also made a little worse by 8143b45 because we now check to see whether a transaction is commitable when its record is missing when evaluating a `HeartbeatTxn` requests. Before we would simply return a `TransactionNotFoundStatusError`, which was ignored. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

craig · 2019-04-16T22:23:40Z

Build succeeded

GitHub CI (Cockroach)

@solongordon

37199: storage: propagate errors from contentionQueue, catch stalls in roachtest r=nvanbenschoten a=nvanbenschoten Informs #36089. The PR is split into a series of commits. The first fixes part of a bug that was causing #36089 to fail (thanks to #36748) and the second improves the test to have a more obvious failure condition for this class of bug in the future. The third, fifth, and sixth clean up code. Finally, the fourth fixes another bug that could cause issues with #36089. Before the first commit, requests could get stuck repeatedly attempting to push a transaction only to repeatedly find that they themselves were already aborted. The error would not propagate up to the transaction coordinator and the request would get stuck. This commit fixes this behavior by correctly propagating errors observed by the `contentionQueue`. The second commit bumps the TxnLivenessThreshold for clusters running `kv/contention/nodes=4` to 10 minutes. This is sufficiently large such that if at any point a transaction is abandoned then all other transactions will begin waiting for it, throughput will drop to 0 for 10 straight minutes, and the test will fail to achieve its minimum QPS requirement. The fourth commit instructs pushers in the `txnwait.Queue` to inform all other pushers that are waiting for the same transaction when it observes an ABORTED transaction. I never saw this cause issues with #36089, but it seems very possible that it could given frequent tscache rotations. 38397: sql: deflake TestLogic//crdb_internal/max_retry_counter r=knz a=knz Fixes #38062. Release note: None 38654: exec: Handle NULLS in TopK sorter r=rohany a=rohany This commit fixes NULLs in the TopK sorter by avoiding use of the vec copy method, which has a bug. Instead, we add a set method to the vec comparator, and use the templatized comparator to perform the sets that the TopK sorter needs. To facilitate this, we add an UnsetNull method to the Nulls object. However, use of this method results in HasNull() maybe returning true even if the vector doesn't have nulls. This behavior already occurs when selection vectors are used. Based on discussions with @solongordon and @asubiotto, this behavior is OK, and future PR's will attempt to make this behavior better, and address the bugs within the Vec Copy method. 38725: cli/dump: more clearly inform the user upon tables with no visible columns r=knz a=knz Informs #37768. Informs #28948. This is coming up quite often on support, lately again on gitter and forum https://forum.cockroachlabs.com/t/error-while-dumping-core-backup/2901/3. This PR aims to lessen the burden on support and propose a clear "next action" for the user. Before: ``` kena@kenax ~/cockroach % ./cockroach dump --insecure defaultdb CREATE TABLE t (, FAMILY "primary" (rowid) ); Error: pq: at or near "from": syntax error Failed running "dump" ``` After: ``` kena@kenax ~/cockroach % ./cockroach dump --insecure defaultdb CREATE TABLE t (, FAMILY "primary" (rowid) ); Error: table "defaultdb.public.t" has no visible columns HINT: To proceed with the dump, either omit this table from the list of tables to dump, drop the table, or add some visible columns. -- See: #37768 Failed running "dump" ``` Release note (cli change): `cockroach dump` will now more clearly refer to issue #37768 when it encounters a table with no visible columns, which (currently) cannot be dumped successfully. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>

nvanbenschoten requested review from tbg and a team April 11, 2019 02:17

nvanbenschoten force-pushed the nvanbenschoten/reduceTxnAborts2 branch 2 times, most recently from 1410680 to 385e082 Compare April 14, 2019 22:14

nvanbenschoten force-pushed the nvanbenschoten/reduceTxnAborts2 branch from 385e082 to 9147c58 Compare April 15, 2019 14:38

ajwerner mentioned this pull request Apr 16, 2019

high latencies when running tpc-c 2100 on the same cluster as previous schema change testing #36856

Closed

tbg approved these changes Apr 16, 2019

View reviewed changes

nvanbenschoten commented Apr 16, 2019

View reviewed changes

craig bot merged commit 9147c58 into cockroachdb:master Apr 16, 2019

nvanbenschoten deleted the nvanbenschoten/reduceTxnAborts2 branch April 17, 2019 17:51

This was referenced Apr 29, 2019

roachtest: kv/contention/nodes=4 failed #36089

Closed

storage: propagate errors from contentionQueue, catch stalls in roachtest #37199

Merged

ajwerner mentioned this pull request May 9, 2019

Large number of KV transaction restarts + 15s KV transaction duration lead to poor performance #35290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load #36748

txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load #36748

nvanbenschoten commented Apr 11, 2019

cockroach-teamcity commented Apr 11, 2019

nvanbenschoten commented Apr 15, 2019

tbg left a comment

nvanbenschoten left a comment

craig bot commented Apr 16, 2019

txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load #36748

txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load #36748

Conversation

nvanbenschoten commented Apr 11, 2019

cockroach-teamcity commented Apr 11, 2019

nvanbenschoten commented Apr 15, 2019

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

craig bot commented Apr 16, 2019

Build succeeded