[WIP] kv: improve transaction heartbeats' interaction with concurrent commits #36729

nvanbenschoten · 2019-04-10T19:26:10Z

This change includes a series of improvements to txnHeartbeater that make it more effectively handle cases where a HeartbeatTxn request races with a committing EndTransaction request. There's more I'd like to do here, including:

unit test the logic in txnHeartbeater.heartbeat directly
make the transition from "probably ABORTED" to "COMMITTED" more explicit in the txn coordinator

I'm mostly leaving this as a WIP because it can wait on #35224, at which point I'll extend it with tests.

The changes in cockroachdb#33396 made it so that a `HeartbeatTxnRequest` that finds a missing transaction record will attempt to create the record. This means that it will discover if a transaction is not "committable" and return a TransactionAbortedError. Unfortunately, after a transaction commits and GCs its transaction record, it will also be considered not "committable". There will always be cases where a heartbeat request races with an EndTransaction request and incorrectly considers the transaction aborted (which is touched upon in a TODO a few lines down). However, in many of these cases, the coordinator already knows that the transaction is committed, so it doesn't need to attempt to roll back the transaction and clean up its intents. This commit checks for these cases and avoids sending useless rollbacks. The intention is to backport this to 19.1. Release note: None

This was not used in 19.1, so we can delete it in 19.2. The plan is not to backport this change. Release note: None

This commit addresses a TODO to return TransactionAbortedErrors from HeartbeatTxnRequests when an ABORTED transaction record is found. This allows us to unify the logic in txnHeartbeater about dealing with aborted transactions. Release note: None

This commit adapts the `txnHeartbeater` to look at the reason of a `TransactionAbortedError` to optimistically detect cases where a HeartbeatTxn request races with a committing EndTransaction request that GCs its transaction record. This check has false negatives because its using the write timestamp cache, which is why we can't rely on it for correctness. All of these changes are revealing that we need a better way to test the `txnHeartbeater`. Ideally we'd be able to mock out the response it gets from HeartbeatTxn requests and assert its behavior. Release note: None

Release note: None

cockroach-teamcity · 2019-04-10T19:26:20Z

This change is

@danhhz

36748: txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load r=nvanbenschoten a=nvanbenschoten This change increases the duration between transaction heartbeats before a transaction is considered expired from 2 seconds to 5 seconds. This has been found to dramatically reduce the frequency of transaction aborts when a cluster is under significant load. This is not expected to noticeably hurt cluster availability in the presence of dead nodes because we already have availability loss on the order of 9 seconds due to the epoch-based lease duration. This is especially important now that the hack in #25034 is gone. That hack was hiding some of this badness and giving transactions a bit more room to avoid being aborted. Here's some testing with TPC-C 2300 on AWS with 3 `c5d.4xlarge` machines. I first ran with the the 2 second TxnLivenessThreshold and then with the 5 second: <img width="872" alt="Screen Shot 2019-04-10 at 4 51 28 PM" src="https://user-images.githubusercontent.com/5438456/55926033-ae332c00-5bdd-11e9-8d6d-184ba1d5ffcd.png"> The top graph is transaction aborts and the bottom graph is p99 service latency. During my tests I actually saw an even bigger difference most of the time. In this trial, leases were still moving around because I didn't use a ramp period (which skews the txn aborts when it stops). I ran another 15 minute run again half an hour later once the leases had balanced (again with a 5 second TxnLivenessThreshold) and saw the latencies I was more accustomed to seeing with the change: <img width="869" alt="Screen Shot 2019-04-10 at 5 59 19 PM" src="https://user-images.githubusercontent.com/5438456/55926104-09651e80-5bde-11e9-9540-9d7c697a9181.png"> cc. @danhhz, who picked this up back in January when monitoring the performance of an alpha release. I dug into the changes made in 8143b45 that seemed to be causing issues with transaction aborts but never developed a good theory for what could be causing them and had trouble reliably reproducing them. Little did I know that it wasn't what was added in 8143b45 that made the difference, it was what [was removed](8143b45#diff-3c31a6497771c33db423a4cead092979L130). The first commit is from #36729 and seems like a generally good change to make. There's no reason we should be sending async aborts if we already know that the transaction was finalized. This was probably also made a little worse by 8143b45 because we now check to see whether a transaction is commitable when its record is missing when evaluating a `HeartbeatTxn` requests. Before we would simply return a `TransactionNotFoundStatusError`, which was ignored. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvanbenschoten added 5 commits April 10, 2019 13:52

kv: remove TransactionStatusError_REASON_TXN_NOT_FOUND

bfe0324

This was not used in 19.1, so we can delete it in 19.2. The plan is not to backport this change. Release note: None

kv: add TODO to fix ABORTED -> COMMITTED transition after heartbeat

1ac47ea

Release note: None

nvanbenschoten requested review from andreimatei and a team April 10, 2019 19:26

nvanbenschoten mentioned this pull request Apr 11, 2019

txnwait: increase TxnLivenessThreshold significantly, reduce txn aborts under load #36748

Merged

tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] kv: improve transaction heartbeats' interaction with concurrent commits #36729

[WIP] kv: improve transaction heartbeats' interaction with concurrent commits #36729

nvanbenschoten commented Apr 10, 2019

cockroach-teamcity commented Apr 10, 2019

[WIP] kv: improve transaction heartbeats' interaction with concurrent commits #36729

Are you sure you want to change the base?

[WIP] kv: improve transaction heartbeats' interaction with concurrent commits #36729

Conversation

nvanbenschoten commented Apr 10, 2019

cockroach-teamcity commented Apr 10, 2019