release-19.2: storage/txnwait: terminate push when pusher aborted at lower epoch #45664

nvanbenschoten · 2020-03-03T20:26:51Z

Backport 1/1 commits from #45603.

/cc @cockroachdb/release

This commit resolves a bug in distributed deadlock detection that would
allow a deadlock between transactions to go undetected, stalling the
workload indefinitely.

The issue materialized as follows:

two transactions would deadlock and each enter a txnwait queue
they would poll their pushees record along with their own
deadlock detection would eventually pick this up and abort one of the txns
using the pusher's copy of the txn's proto
however, the aborted txn has since restarted and bumped it epoch
the aborted txn continued to query its record, but failed to ingest any
updates from it because the record was at a lower epoch than its own
copy of its txn proto. So it never noticed that it was ABORTED
all other txns in the system including the original contending txn
piled up behind the aborted txn in the contention queue, waiting for
it to notice it was aborted and exit the queue
deadlock!

I'm optimistically closing the two kv/contention/nodes=4 issues both
because I hope this is the cause of their recent troubles and also because
I've been spending a lot of time with the test recently in light of #45482
and plan to stabilize it fully.

I plan to backport this to release-19.2. This doesn't need to go all the
way back to release-19.1 because this was introduces in aed892a.

This commit resolves a bug in distributed deadlock detection that would allow a deadlock between transactions to go undetected, stalling the workload indefinitely. The issue materialized as follows: 1. two transactions would deadlock and each enter a txnwait queue 2. they would poll their pushees record along with their own 3. deadlock detection would eventually pick this up and abort one of the txns using the pusher's copy of the txn's proto 4. however, the aborted txn has since restarted and bumped it epoch 5. the aborted txn continued to query its record, but failed to ingest any updates from it because the record was at a lower epoch than its own copy of its txn proto. So it never noticed that it was ABORTED 6. all other txns in the system including the original contending txn piled up behind the aborted txn in the contention queue, waiting for it to notice it was aborted and exit the queue 7. deadlock! Release note (bug fix): A bug causing distributed deadlock detection between transactions to stall and fail to resolve a deadlock was addressed.

cockroach-teamcity · 2020-03-03T20:26:59Z

This change is

nvanbenschoten requested a review from tbg March 3, 2020 20:26

tbg approved these changes Mar 4, 2020

View reviewed changes

nvanbenschoten merged commit 65de26e into cockroachdb:release-19.2 Mar 4, 2020

nvanbenschoten deleted the backport19.2-45603 branch March 4, 2020 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-19.2: storage/txnwait: terminate push when pusher aborted at lower epoch #45664

release-19.2: storage/txnwait: terminate push when pusher aborted at lower epoch #45664

nvanbenschoten commented Mar 3, 2020

cockroach-teamcity commented Mar 3, 2020

release-19.2: storage/txnwait: terminate push when pusher aborted at lower epoch #45664

release-19.2: storage/txnwait: terminate push when pusher aborted at lower epoch #45664

Conversation

nvanbenschoten commented Mar 3, 2020

cockroach-teamcity commented Mar 3, 2020