Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load #36748

Merged

Conversation

nvanbenschoten
Copy link
Member

This change increases the duration between transaction heartbeats before a transaction is considered expired from 2 seconds to 5 seconds. This has been found to dramatically reduce the frequency of transaction aborts when a cluster is under significant load. This is not expected to noticeably hurt cluster availability in the presence of dead nodes because we already have availability loss on the order of 9 seconds due to the epoch-based lease duration.

This is especially important now that the hack in #25034 is gone. That hack was hiding some of this badness and giving transactions a bit more room to avoid being aborted.

Here's some testing with TPC-C 2300 on AWS with 3 c5d.4xlarge machines. I first ran with the the 2 second TxnLivenessThreshold and then with the 5 second:

Screen Shot 2019-04-10 at 4 51 28 PM

The top graph is transaction aborts and the bottom graph is p99 service latency. During my tests I actually saw an even bigger difference most of the time. In this trial, leases were still moving around because I didn't use a ramp period (which skews the txn aborts when it stops). I ran another 15 minute run again half an hour later once the leases had balanced (again with a 5 second TxnLivenessThreshold) and saw the latencies I was more accustomed to seeing with the change:

Screen Shot 2019-04-10 at 5 59 19 PM

cc. @danhhz, who picked this up back in January when monitoring the performance of an alpha release. I dug into the changes made in 8143b45 that seemed to be causing issues with transaction aborts but never developed a good theory for what could be causing them and had trouble reliably reproducing them. Little did I know that it wasn't what was added in 8143b45 that made the difference, it was what was removed.

The first commit is from #36729 and seems like a generally good change to make. There's no reason we should be sending async aborts if we already know that the transaction was finalized. This was probably also made a little worse by 8143b45 because we now check to see whether a transaction is commitable when its record is missing when evaluating a HeartbeatTxn requests. Before we would simply return a TransactionNotFoundStatusError, which was ignored.

The changes in cockroachdb#33396 made it so that a `HeartbeatTxnRequest` that finds
a missing transaction record will attempt to create the record. This
means that it will discover if a transaction is not "committable" and
return a TransactionAbortedError. Unfortunately, after a transaction
commits and GCs its transaction record, it will also be considered not
"committable".

There will always be cases where a heartbeat request races with an
EndTransaction request and incorrectly considers the transaction
aborted (which is touched upon in a TODO a few lines down). However,
in many of these cases, the coordinator already knows that the transaction
is committed, so it doesn't need to attempt to roll back the transaction
and clean up its intents.

This commit checks for these cases and avoids sending useless rollbacks.
The intention is to backport this to 19.1.

Release note: None
@nvanbenschoten nvanbenschoten requested review from tbg and a team April 11, 2019 02:17
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/reduceTxnAborts2 branch 2 times, most recently from 1410680 to 385e082 Compare April 14, 2019 22:14
This change increases the duration between transaction heartbeats
before a transaction is considered expired from 2 seconds to 5 seconds.
This has been found to dramatically reduce the frequency of transaction
aborts when a cluster is under significant load. This is not expected
to noticeably hurt cluster availability in the presence of dead nodes
because we already have availability loss on the order of 9 seconds due
to the epoch-based lease duration.

This is especially important now that the hack in cockroachdb#25034 is gone. That
hack was hiding some of this badness and giving transactions a bit more
room to avoid being aborted.

Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/reduceTxnAborts2 branch from 385e082 to 9147c58 Compare April 15, 2019 14:38
@nvanbenschoten
Copy link
Member Author

@tbg mind weighing in on this? I'd like to get this baking on master as soon as we feel comfortable. The PR itself is almost all changes to tests outside of switching the constant.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 4 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/txn_interceptor_heartbeater.go, line 396 at r1 (raw file):

	// If the txn is no longer pending, ignore the result of the heartbeat.
	if h.mu.txn.Status != roachpb.PENDING {

That's new, right? Just a cleanup?

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/txn_interceptor_heartbeater.go, line 396 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

That's new, right? Just a cleanup?

Yeah, this is just a cleanup. The commit message for it is pretty detailed.

craig bot pushed a commit that referenced this pull request Apr 16, 2019
36748: txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load r=nvanbenschoten a=nvanbenschoten

This change increases the duration between transaction heartbeats before a transaction is considered expired from 2 seconds to 5 seconds. This has been found to dramatically reduce the frequency of transaction aborts when a cluster is under significant load. This is not expected to noticeably hurt cluster availability in the presence of dead nodes because we already have availability loss on the order of 9 seconds due to the epoch-based lease duration.

This is especially important now that the hack in #25034 is gone. That hack was hiding some of this badness and giving transactions a bit more room to avoid being aborted.

Here's some testing with TPC-C 2300 on AWS with 3 `c5d.4xlarge` machines. I first ran with the the 2 second TxnLivenessThreshold and then with the 5 second:

<img width="872" alt="Screen Shot 2019-04-10 at 4 51 28 PM" src="https://user-images.githubusercontent.com/5438456/55926033-ae332c00-5bdd-11e9-8d6d-184ba1d5ffcd.png">

The top graph is transaction aborts and the bottom graph is p99 service latency. During my tests I actually saw an even bigger difference most of the time. In this trial, leases were still moving around because I didn't use a ramp period (which skews the txn aborts when it stops). I ran another 15 minute run again half an hour later once the leases had balanced (again with a 5 second TxnLivenessThreshold) and saw the latencies I was more accustomed to seeing with the change:

<img width="869" alt="Screen Shot 2019-04-10 at 5 59 19 PM" src="https://user-images.githubusercontent.com/5438456/55926104-09651e80-5bde-11e9-9540-9d7c697a9181.png">

cc. @danhhz, who picked this up back in January when monitoring the performance of an alpha release. I dug into the changes made in 8143b45 that seemed to be causing issues with transaction aborts but never developed a good theory for what could be causing them and had trouble reliably reproducing them. Little did I know that it wasn't what was added in 8143b45 that made the difference, it was what [was removed](8143b45#diff-3c31a6497771c33db423a4cead092979L130).

The first commit is from #36729 and seems like a generally good change to make. There's no reason we should be sending async aborts if we already know that the transaction was finalized. This was probably also made a little worse by 8143b45 because we now check to see whether a transaction is commitable when its record is missing when evaluating a `HeartbeatTxn` requests. Before we would simply return a `TransactionNotFoundStatusError`, which was ignored.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Contributor

craig bot commented Apr 16, 2019

Build succeeded

@craig craig bot merged commit 9147c58 into cockroachdb:master Apr 16, 2019
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/reduceTxnAborts2 branch April 17, 2019 17:51
craig bot pushed a commit that referenced this pull request Jul 8, 2019
37199: storage: propagate errors from contentionQueue, catch stalls in roachtest r=nvanbenschoten a=nvanbenschoten

Informs #36089.

The PR is split into a series of commits. The first fixes part of a bug that was causing #36089  to fail (thanks to #36748) and the second improves the test to have a more obvious failure condition for this class of bug in the future. The third, fifth, and sixth clean up code. Finally, the fourth fixes another bug that could cause issues with #36089.

Before the first commit, requests could get stuck repeatedly attempting to push a transaction only to repeatedly find that they themselves were already aborted. The error would not propagate up to the transaction coordinator and the request would get stuck. This commit fixes this behavior by correctly propagating errors observed by the `contentionQueue`.

The second commit bumps the TxnLivenessThreshold for clusters running `kv/contention/nodes=4` to 10 minutes. This is sufficiently large such that if at any point a transaction is abandoned then all other transactions will begin waiting for it, throughput will drop to 0 for 10 straight minutes, and the test will fail to achieve its minimum QPS requirement.

The fourth commit instructs pushers in the `txnwait.Queue` to inform all other pushers that are waiting for the same transaction when it observes an ABORTED transaction. I never saw this cause issues with #36089, but it seems very possible that it could given frequent tscache rotations.

38397: sql: deflake TestLogic//crdb_internal/max_retry_counter r=knz a=knz

Fixes #38062.

Release note: None

38654: exec: Handle NULLS in TopK sorter r=rohany a=rohany

This commit fixes NULLs in the TopK sorter by avoiding use
of the vec copy method, which has a bug. Instead, we add
a set method to the vec comparator, and use the templatized
comparator to perform the sets that the TopK sorter needs.

To facilitate this, we add an UnsetNull method to the Nulls
object. However, use of this method results in HasNull()
maybe returning true even if the vector doesn't have nulls.
This behavior already occurs when selection vectors are used.
Based on discussions with @solongordon and @asubiotto, this behavior
is OK, and future PR's will attempt to make this behavior better, and address
the bugs within the Vec Copy method.

38725: cli/dump: more clearly inform the user upon tables with no visible columns r=knz a=knz

Informs #37768.
Informs #28948.
This is coming up quite often on support, lately again on gitter and forum https://forum.cockroachlabs.com/t/error-while-dumping-core-backup/2901/3.

This PR aims to lessen the burden on support and propose a clear "next action" for the user.

Before:

```
kena@kenax ~/cockroach % ./cockroach dump --insecure defaultdb
CREATE TABLE t (,
        FAMILY "primary" (rowid)
);
Error: pq: at or near "from": syntax error
Failed running "dump"
```

After:

```
kena@kenax ~/cockroach % ./cockroach dump --insecure defaultdb
CREATE TABLE t (,
        FAMILY "primary" (rowid)
);
Error: table "defaultdb.public.t" has no visible columns
HINT: To proceed with the dump, either omit this table from the list of tables to dump, drop the table, or add some visible columns.
--
See: #37768
Failed running "dump"
```

Release note (cli change): `cockroach dump` will now more clearly
refer to issue #37768 when it encounters a table with no visible
columns, which (currently) cannot be dumped successfully.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants