Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-19.2: storage/txnwait: terminate push when pusher aborted at lower epoch #45664

Merged

Conversation

nvanbenschoten
Copy link
Member

Backport 1/1 commits from #45603.

/cc @cockroachdb/release


Closes #40786.
Closes #44336.

This commit resolves a bug in distributed deadlock detection that would
allow a deadlock between transactions to go undetected, stalling the
workload indefinitely.

The issue materialized as follows:

  1. two transactions would deadlock and each enter a txnwait queue
  2. they would poll their pushees record along with their own
  3. deadlock detection would eventually pick this up and abort one of the txns
    using the pusher's copy of the txn's proto
  4. however, the aborted txn has since restarted and bumped it epoch
  5. the aborted txn continued to query its record, but failed to ingest any
    updates from it because the record was at a lower epoch than its own
    copy of its txn proto. So it never noticed that it was ABORTED
  6. all other txns in the system including the original contending txn
    piled up behind the aborted txn in the contention queue, waiting for
    it to notice it was aborted and exit the queue
  7. deadlock!

I'm optimistically closing the two kv/contention/nodes=4 issues both
because I hope this is the cause of their recent troubles and also because
I've been spending a lot of time with the test recently in light of #45482
and plan to stabilize it fully.

I plan to backport this to release-19.2. This doesn't need to go all the
way back to release-19.1 because this was introduces in aed892a.

This commit resolves a bug in distributed deadlock detection that would
allow a deadlock between transactions to go undetected, stalling the
workload indefinitely.

The issue materialized as follows:
1. two transactions would deadlock and each enter a txnwait queue
2. they would poll their pushees record along with their own
3. deadlock detection would eventually pick this up and abort one of the txns
   using the pusher's copy of the txn's proto
4. however, the aborted txn has since restarted and bumped it epoch
5. the aborted txn continued to query its record, but failed to ingest any
   updates from it because the record was at a lower epoch than its own
   copy of its txn proto. So it never noticed that it was ABORTED
6. all other txns in the system including the original contending txn
   piled up behind the aborted txn in the contention queue, waiting for
   it to notice it was aborted and exit the queue
7. deadlock!

Release note (bug fix): A bug causing distributed deadlock detection
between transactions to stall and fail to resolve a deadlock was
addressed.
@nvanbenschoten nvanbenschoten requested a review from tbg March 3, 2020 20:26
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten merged commit 65de26e into cockroachdb:release-19.2 Mar 4, 2020
@nvanbenschoten nvanbenschoten deleted the backport19.2-45603 branch March 4, 2020 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants