[DocDB] Update blocking transaction info for waiting requests without re-running conflict resolution #16286

robertsami · 2023-03-02T19:17:50Z

Jira Link: DB-5712

Description

We can currently run into a case like the following:

txn1 acquires exclusive lock on key k1 at tablet A
txn2 requests exclusive lock on k1/k2 at tablet A, blocks on txn1, enters wait queue with only txn1 as blocker
txn3 acquires exclusive lock on k2 at tablet A
txn3 requests exclusive lock on k3 at tablet B, blocks on txn2

At this point, we have a deadlock, but the coordinator for txn2 is not aware that txn2 is blocked on txn3, so we will not detect the deadlock

In bdec10c, we introduced a change to mitigate this issue by periodically re-running conflict resolution for waiting transactions, so their blocker list is eventually up-to-date at the coordinator

A better solution would be to keep the blocker list of each waiter up-to-date by forcing incoming requests to update waiters in the wait queue in real time

…oned workloads in use of WaitQueue and DeadlockDetector Summary: This diff enables wait-queues and deadlock detection for cross-region transactions. When wait-queues are enabled and wait-on conflict policy is set, each transaction enters the wait queue if it finds blocker transactions, post undergoing conflict resolution. The waiter txn records blocker_info (txn id, status tablet, conflicting subtxns etc) for each blocking transaction and registers itself with the deadlock detector. All this information is fetched from the tablet's transaction participant. In case of transaction promotion, all the involved transaction participants are notified of the updated status tablet location. The transaction gets promoted to global successfully only when all of the txn participants acknowledge this update, else the transaction is aborted. If successful, the old status tablet might be in kPending state until the commit time, at which it changes to kAborted state. In this diff, a `TransactionStausListerner` interface is introduced and that `WaitQueue` implements this interface. The wait-queue is notified on each transaction promotion. When notified, the wait queue submits a task of type `UpdateWaitersOnBlockerPromotion` to a background threadpool. When some thread gets a chance to execute the submitted task, it does the following - it checks if the transaction promoted was a waiter txn. If so, it is made to re-enter the wait queue by re-running conflict resolution. This ensures that the waiter transaction would definitely fetch the updated status tablet and the same is registered with the deadlock detector. - it forces all waiter transactions blocking on this promoted transaction to re-enter the wait queue. This ensures that the wait-for dependency is updated with the latest blocker txn's status tablet and the wait-for probes are forwarded to the right txn coordinator. There might be a control race encountered between a waiter txn entering the queue with the blocker's old status tablet and the wait queue process the promotion signal. The waiter wouldn't renter the queue if the signal gets processed before it entering the queue for the first time, and it follows that the deadlock detector wouldn't be aware of the latest wait-for dependencies. To prevent this, we check with the transaction participant on the latest status tablet of the blocker transaction on inserts to the wait-queue. This is done by acquiring a mutex, which is also acquired by while processing the promotion signal. Summarizing the changes, txn promotion could result in either a success (which leads to updated status tablet location) or a failure (aborted state)., In either case, the wait queue receives a signal and the waiter transactions waiting on the promoted/aborted blocker re-run conflict resolution (Currently wait-queue periodically polls for txn status, and hence aborted/committed cases are taken care of. Rob is working on a parallel diff where this too is being changed to a notification mechanism). Additionally, there is a backup mechanism in place where we force all waiter transactions in the wait-queue to re-run conflict resolution periodically. It follows that the deadlock detector would be updated with the latest wait-for dependencies periodically. User facing aspects - Since we don't execute the above logic of transaction promotion in-line with the promotion request, the shouldn't be any noticeable increase in latency for transaction promotion. The only concerning aspect would be that we re-run conflict resolution for the affected transactions when one of their blocker(s) get promoted. But this is necessary for maintaining up to date wait-for dependencies. And since the size of intentsdb should be considerably small, this shouldn't be much of a concern. Test Plan: Jenkins ``` ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter DeadlockDetectionWithTxnPromotionTest.TestBlockerPromotionWithDeadlock ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter DeadlockDetectionWithTxnPromotionTest.TestBlockerPromotionWithoutDeadlock ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter DeadlockDetectionWithTxnPromotionTest.TestDeadlockAmongstGlobalTransactions ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter DeadlockDetectionWithTxnPromotionTest.TestWaiterPromotionWithoutDeadlock ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter DeadlockDetectionWithTxnPromotionTest.TestWaiterPromotionWithDeadlock ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter DeadlockDetectionWithTxnPromotionTest.TestDelayedWaiterRegistrationInWaitQeue ./yb_build.sh --cxx-test pgwrapper_pg_wait_on_conflict-test --gtest_filter PgWaitQueuesTest.TestWaiterTxnReRunConflictResolution ``` Reviewers: esheng, sergei, rsami Reviewed By: rsami Subscribers: pjain, jenkins-bot, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D23215

robertsami · 2023-03-14T13:37:26Z

We can mitigate this issue by periodically re-running conflict resolution for waiting transactions, so their blocker list is eventually up-to-date at the coordinator

this was complete in bdec10c

This github issue is now tracking a better solution:

A better solution could be to keep the blocker list of each waiter up-to-date by forcing incoming requests to update waiters in the wait queue in real time

rthallamko3 · 2024-03-19T19:19:40Z

The impact is not clear - whether it impacts fairness etc.

rthallamko3 · 2024-05-06T20:48:32Z

Would be good to tackle this higher than the other backlog items.

robertsami added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Mar 2, 2023

robertsami self-assigned this Mar 2, 2023

robertsami assigned basavaraj29 Mar 2, 2023

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Mar 2, 2023

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Mar 7, 2023

rthallamko3 unassigned robertsami Mar 13, 2023

robertsami assigned robertsami and unassigned basavaraj29 Mar 14, 2023

robertsami changed the title ~~[DocDB] Ensure blocking transaction info is updated at deadlock detectors~~ [DocDB] Update blocking transaction info for waiting requests without re-running conflict resolution Apr 4, 2023

jharveysmith added this to Wait-Queue Based Locking Aug 16, 2023

jharveysmith moved this to Backlog in Wait-Queue Based Locking Aug 16, 2023

robertsami moved this from Backlog to Pending in Wait-Queue Based Locking Aug 21, 2023

This was referenced Dec 6, 2023

[DocDB] Use RPC start time of waiters to generate pg_locks wait_time #20120

Closed

[DocDB] Use statement start time to compute wait time of waiters in pg_locks #20288

Closed

rthallamko3 assigned basavaraj29 and unassigned robertsami May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Update blocking transaction info for waiting requests without re-running conflict resolution #16286

[DocDB] Update blocking transaction info for waiting requests without re-running conflict resolution #16286

robertsami commented Mar 2, 2023 •

edited

Loading

robertsami commented Mar 14, 2023

rthallamko3 commented Mar 19, 2024

rthallamko3 commented May 6, 2024

[DocDB] Update blocking transaction info for waiting requests without re-running conflict resolution #16286

[DocDB] Update blocking transaction info for waiting requests without re-running conflict resolution #16286

Comments

robertsami commented Mar 2, 2023 • edited Loading

Description

robertsami commented Mar 14, 2023

rthallamko3 commented Mar 19, 2024

rthallamko3 commented May 6, 2024

robertsami commented Mar 2, 2023 •

edited

Loading