release-19.2: storage/txnwait: terminate push when pusher aborted at lower epoch #45664
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 1/1 commits from #45603.
/cc @cockroachdb/release
Closes #40786.
Closes #44336.
This commit resolves a bug in distributed deadlock detection that would
allow a deadlock between transactions to go undetected, stalling the
workload indefinitely.
The issue materialized as follows:
using the pusher's copy of the txn's proto
updates from it because the record was at a lower epoch than its own
copy of its txn proto. So it never noticed that it was ABORTED
piled up behind the aborted txn in the contention queue, waiting for
it to notice it was aborted and exit the queue
I'm optimistically closing the two
kv/contention/nodes=4
issues bothbecause I hope this is the cause of their recent troubles and also because
I've been spending a lot of time with the test recently in light of #45482
and plan to stabilize it fully.
I plan to backport this to release-19.2. This doesn't need to go all the
way back to release-19.1 because this was introduces in aed892a.