-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Throw the same error message as Pg when a deadlock is detected #14114
Labels
Comments
pkj415
added
area/ysql
Yugabyte SQL (YSQL)
status/awaiting-triage
Issue awaiting triage
labels
Sep 21, 2022
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
Sep 21, 2022
Ideally we would also display the node_id of clients involved in the deadlock along with the transaction_id in the error message |
As a first cut, even simply returning the fact that a deadlock was detected, rather than a generic error message as we display today, would be a useful improvement |
basavaraj29
added a commit
that referenced
this issue
Jul 12, 2023
Summary: In the exisiting implementation, we return a generic error message when transactions are aborted due to a deadlock. This diff addresses the issue by returning a detailed error info when transactions are aborted due to a deadlock. Introduced a new flag `clear_deadlocked_txns_info_older_than_seconds`. The deadlock detector now stores transactions aborted due to a deadlock, and we prune the list every `transaction_deadlock_detection_interval_usec`. The info gets removed if it is older than `clear_deadlocked_txns_info_older_than_seconds`. The transaction coordinator now checks with the deadlock detector before reporting `ABORTED` for a txn that it is unable to find. If the txn is found at the deadlock detector, `GetTransactionStatusResponsePB` resp is populated with an newly added field `AppStatusPB reason`, which contains detailed info on the deadlock error. Deadlocked transactions waiting in the wait-queue are aborted in 2 ways: 1. When the txn participant isn't able to find the txn, we directly poll the status tablet for the waiter's status. 2. The txn participant responds with the status it is last aware of (in this case, `ABORTED`). Since both these codepaths use `GetTransactionStatusResponsePB`, we explicitly check if `reason` is populated and return back the detailed error, if any. Note: In some cases, it is still possible that we return a generic error. When a waiter txn is notified that all of its blockers have been aborted, it re-runs conflict resolution. When `PrepareMetadata` is executed, the participant returns a generic error message if the txn is not found. This can happen as the waiter txn too could been aborted as part of the deadlock, but that signal got delayed. If we deterministically abort just one txn when we find a deadlock (#14165), it should probably solve this issue. Jira: DB-3597 Test Plan: Reverting back the newly added test because of another issue #18100 Steps to manually test the feature: Create a cluster using ``` ./bin/yb-ctl create --rf=3 --data_dir ~/yugabyte-data --tserver_flags 'ysql_num_shards_per_tserver=1,enable_wait_queues=true,enable_deadlock_detection=true,enable_intentsdb_page=true' ``` Create a table using ``` create table test (k int primary key, v int); insert into test select generate_series(1,11), 0; ``` Spawn transactions and create deadlocks. Should see error similar to the below ``` ERROR: Transaction 6640330e-d7b6-4c6e-8fec-8d8e2b3311c3 aborted due to deadlock. 6640330e-d7b6-4c6e-8fec-8d8e2b3311c3 -> 12673b67-3ccb-4a90-8a3c-7ed72d92b8c8 -> 6640330e-d7b6-4c6e-8fec-8d8e2b3311c3 : 40P01 ``` Reviewers: rsami, pjain, sergei Reviewed By: rsami Subscribers: esheng, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D26099
basavaraj29
added a commit
that referenced
this issue
Jul 18, 2023
…sponsePB proto message due to recent regression Summary: Commit d6d0cf6 introduced a new field `deadlock_reason` in the proto message `GetTransactionStatusResponsePB`. The commit didn't handle rolling upgrade scenarios gracefully. It expects that `resp.deadlock_reason().size() == resp.status().size()` will always be true, which isn't the case. If the transaction coordinator is still on an older version, `deadlock_reason` field isn't populated. Accessing the field without checking the size is leading to a segv fault. This diff fixes the issue by explicitly check the size of `deadlock_reason` before accessing the field, ensuring that we handle upgrade scenarios gracefully. Jira: DB-3597, DB-7266 Test Plan: Jenkins Reviewers: rsami, arybochkin, sergei Reviewed By: rsami, arybochkin Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D27023
basavaraj29
added a commit
that referenced
this issue
Jul 31, 2023
…t in reported deadlocks Summary: Commit d6d0cf6 caused a regression as we were setting `YB_PG_T_R_DEADLOCK_DETECTED` error code for deadlocks. For READ COMMITTED isolation, Pg retries the whole statement with a new read point when it faces a kConflict or kReadRestart error. In case of SERIALIZABLE and REPEATABLE READ isolation levels, Pg retries the whole transaction with a new read point on kConflict or kReadRestart errors, only when no data has been sent yet to the external client as part of the transaction (i.e., no prior statement has been executed and ysql_output_buffer_size hasn't yet been used by the first statement). The above commit changed the error code, which led to increased failures in a conflicting workload as transactions weren't being retired. This diff addresses the issue by modifying the retry logic to also check for `kDeadlock` and act accordingly. - For READ COMMITTED, we have the ability to retry a statement in case of `kConflict` or `kReadRestart`, and we do so indefinitely to honor read committed semantics unless a filled buffer of response data has already been sent to the client for the statement (see `YBIsDataSentForCurrQuery()`). But for deadlock errors, there is no point in retrying the statement because the deadlock cycle might involve granted locks from previous statement(s). But we can still retry if it is the first statement in the transaction and no response data has been sent back to the client (retrying in this case will result in releasing the granted locks as well and hence allows progress). We check for this condition using `YBIsDataSent()`. - For other isolation levels (SERIALIZABLE & REPEATABLE READ), we retry the transaction for at most `ysql_max_write_restart_attempts` only when no data has been sent so far (i.e., `YBIsDataSent()` is false). Here's the log when a transaction encounters a deadlock (with the changes) ``` ERROR: deadlock detected DETAIL: Transaction ebf3d40f-0d39-4f26-b969-4d515043b3c5 aborted due to a deadlock. ebf3d40f-0d39-4f26-b969-4d515043b3c5 -> d8beba29-3ca5-4292-bd2b-24d878d381d4 -> ebf3d40f-0d39-4f26-b969-4d515043b3c5 : kDeadlock ``` Note: the test change is required as we previously used to set 40001, but now since we also return 40P01, we need to check for both. Without this change, the test fails, which further asserts that we return 40P01 now. Jira: DB-7281, DB-3597, DB-7420 Test Plan: ./yb_build.sh asan --java-test 'org.yb.pgsql.TestPgTransactions#testSerializableReadDelayWrite' ./yb_build.sh asan --cxx-test pgwrapper_pg_libpq-test --gtest_filter PgLibPqTest.TestLockedConcurrentCounterSerializable ./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter PgLibPqTest.TestConcurrentCounterReadCommitted ./yb_build.sh --cxx-test pgwrapper_pg_wait_on_conflict-test --gtest_filter PgWaitQueuesReadCommittedTest.TestDeadlockSimple Reviewers: rsami, pjain, smishra, mihnea, tvesely Reviewed By: pjain Subscribers: myang, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D27243
basavaraj29
added a commit
that referenced
this issue
Jul 31, 2023
… by error code set in reported deadlocks Summary: Original commit: d380779 / D27243 Commit d6d0cf6 caused a regression as we were setting `YB_PG_T_R_DEADLOCK_DETECTED` error code for deadlocks. For READ COMMITTED isolation, Pg retries the whole statement with a new read point when it faces a kConflict or kReadRestart error. In case of SERIALIZABLE and REPEATABLE READ isolation levels, Pg retries the whole transaction with a new read point on kConflict or kReadRestart errors, only when no data has been sent yet to the external client as part of the transaction (i.e., no prior statement has been executed and ysql_output_buffer_size hasn't yet been used by the first statement). The above commit changed the error code, which led to increased failures in a conflicting workload as transactions weren't being retired. This diff addresses the issue by modifying the retry logic to also check for `kDeadlock` and act accordingly. - For READ COMMITTED, we have the ability to retry a statement in case of `kConflict` or `kReadRestart`, and we do so indefinitely to honor read committed semantics unless a filled buffer of response data has already been sent to the client for the statement (see `YBIsDataSentForCurrQuery()`). But for deadlock errors, there is no point in retrying the statement because the deadlock cycle might involve granted locks from previous statement(s). But we can still retry if it is the first statement in the transaction and no response data has been sent back to the client (retrying in this case will result in releasing the granted locks as well and hence allows progress). We check for this condition using `YBIsDataSent()`. - For other isolation levels (SERIALIZABLE & REPEATABLE READ), we retry the transaction for at most `ysql_max_write_restart_attempts` only when no data has been sent so far (i.e., `YBIsDataSent()` is false). Here's the log when a transaction encounters a deadlock (with the changes) ``` ERROR: deadlock detected DETAIL: Transaction ebf3d40f-0d39-4f26-b969-4d515043b3c5 aborted due to a deadlock. ebf3d40f-0d39-4f26-b969-4d515043b3c5 -> d8beba29-3ca5-4292-bd2b-24d878d381d4 -> ebf3d40f-0d39-4f26-b969-4d515043b3c5 : kDeadlock ``` Note: the test change is required as we previously used to set 40001, but now since we also return 40P01, we need to check for both. Without this change, the test fails, which further asserts that we return 40P01 now. Jira: DB-7281, DB-3597, DB-7420 Test Plan: ./yb_build.sh asan --java-test 'org.yb.pgsql.TestPgTransactions#testSerializableReadDelayWrite' ./yb_build.sh asan --cxx-test pgwrapper_pg_libpq-test --gtest_filter PgLibPqTest.TestLockedConcurrentCounterSerializable ./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter PgLibPqTest.TestConcurrentCounterReadCommitted ./yb_build.sh --cxx-test pgwrapper_pg_wait_on_conflict-test --gtest_filter PgWaitQueuesReadCommittedTest.TestDeadlockSimple Reviewers: rsami, pjain, smishra, mihnea, tvesely Reviewed By: rsami Subscribers: ybase, yql, myang Differential Revision: https://phorge.dev.yugabyte.com/D27410
pkj415
added a commit
that referenced
this issue
Aug 9, 2023
…tion Summary: As part of d380779, we introduced retries for deadlock errors in the query layer. The retries for read committed isolation have some bugs and need more work (will be tracked as part of #18616). This diff removes the deadlock error retries for read committed until then. Jira: DB-3597 Test Plan: Jenkins: skip ./yb_build.sh --cxx-test pgwrapper_pg_wait_on_conflict-test --gtest_filter PgWaitQueuesReadCommittedTest.TestDeadlockSimple Reviewers: bkolagani, rsami Reviewed By: bkolagani Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D27642
basavaraj29
added a commit
that referenced
this issue
Sep 28, 2023
Summary: In d6d0cf6, we made a first step towards reporting a deadlock specific error message. This diff is an enhancement on top of the original one, where we try to be one step closer to throwing a consistent deadlock specific error immaterial of where the waiter transactions realizes of its aborted state. In this diff, we **additionally make transaction participant store recently deadlocked transactions info** in-memory after it requests the status from the coordinator. Fixes the following scenario: On detection of deadlocks, multiple involved transactions could be aborted. due to the uncertain control race, a waiter could resume from the wait-queue and undergo conflict resolution, and realize that it has been aborted (when it calls `PrepareMetadata` of the transaction participant). Since we make the participant store the recently deadlocked transactions info, we report the deadlock specific error if any. Planning to take the following up in a subsequent diff, if required - 1. The query layer (client/transaction.cc) proactively sends clean up requests for expired transactions. So it could happen that a txn participant sees that request first and cleans up the txn. And when an aborted waiter re-does conflict resolution, the participant might return a generic error from `::PrepareMetadata` func. Either the deadlock specific error from the query layer needs to be streamed back to the client, or the participant somehow needs to be made aware that the cleanup is due to a deadlock. 2. include the deadlock status/reason as part of replicating the abort op at the coordinator. Jira: DB-7377, DB-3597 Test Plan: Reverting the test new test since it would be flaky without addressing the status tablet leadership change and transaction clean up cases. for stats, the following test (where we explicitly disable proactive cleanup from query layer) fails with a different non-deadlock specific error (`message: ERROR: Transaction aborted: kAborted (pgsql error 40001)`) about 10% times when the test is run for 100 iters. When the enable the query layer cleanup, we can additionally see errors of type `Transaction was recently aborted...`. ``` TEST_F(PgWaitQueuesTest, TestDeadlockErrorMessage) { ANNOTATE_UNPROTECTED_WRITE(FLAGS_TEST_disable_proactive_txn_cleanup_on_abort) = true; ANNOTATE_UNPROTECTED_WRITE(FLAGS_clear_deadlocked_txns_info_older_than_heartbeats) = 40; constexpr int kNumTxns = 5; auto setup_conn = ASSERT_RESULT(Connect()); ASSERT_OK(setup_conn.Execute("CREATE TABLE foo (k INT PRIMARY KEY, v INT)")); ASSERT_OK(setup_conn.ExecuteFormat( "INSERT INTO foo SELECT generate_series(0, $0), 0", kNumTxns)); CountDownLatch num_keys_locked(kNumTxns); CountDownLatch did_deadlock(kNumTxns - 1); TestThreadHolder thread_holder; for (auto i = 0; i < kNumTxns ; i++) { thread_holder.AddThreadFunctor([this, &num_keys_locked, &did_deadlock, i] { auto conn = ASSERT_RESULT(Connect()); ASSERT_OK(conn.StartTransaction(IsolationLevel::SNAPSHOT_ISOLATION)); ASSERT_OK(conn.ExecuteFormat("UPDATE foo SET v=v+10 WHERE k=$0", i)); num_keys_locked.CountDown(); num_keys_locked.WaitFor(3s * kTimeMultiplier); for (auto j = 1 ; j <= kNumTxns ; j++) { auto s = conn.ExecuteFormat("UPDATE foo SET v=v+20 WHERE k=$0", (i + j) % kNumTxns); if (!s.ok()) { LOG(INFO) << "Thread " << i << " update failed with status " << s; ASSERT_STR_CONTAINS(s.ToString(), "aborted due to a deadlock"); did_deadlock.CountDown(); return; } } ASSERT_TRUE(did_deadlock.WaitFor(25s * kTimeMultiplier)); }); } ASSERT_TRUE(did_deadlock.WaitFor(40s * kTimeMultiplier)); thread_holder.WaitAndStop(15s * kTimeMultiplier); } ``` Reviewers: rsami, sergei Reviewed By: rsami Subscribers: pjain, mbautin, bogdan, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D27782
basavaraj29
added a commit
that referenced
this issue
Oct 3, 2023
…cted deadlocks Summary: Original commit: 798fc27 / D27782 In d6d0cf6, we made a first step towards reporting a deadlock specific error message. This diff is an enhancement on top of the original one, where we try to be one step closer to throwing a consistent deadlock specific error immaterial of where the waiter transactions realizes of its aborted state. In this diff, we **additionally make transaction participant store recently deadlocked transactions info** in-memory after it requests the status from the coordinator. Fixes the following scenario: On detection of deadlocks, multiple involved transactions could be aborted. due to the uncertain control race, a waiter could resume from the wait-queue and undergo conflict resolution, and realize that it has been aborted (when it calls `PrepareMetadata` of the transaction participant). Since we make the participant store the recently deadlocked transactions info, we report the deadlock specific error if any. Planning to take the following up in a subsequent diff, if required - 1. The query layer (client/transaction.cc) proactively sends clean up requests for expired transactions. So it could happen that a txn participant sees that request first and cleans up the txn. And when an aborted waiter re-does conflict resolution, the participant might return a generic error from `::PrepareMetadata` func. Either the deadlock specific error from the query layer needs to be streamed back to the client, or the participant somehow needs to be made aware that the cleanup is due to a deadlock. 2. include the deadlock status/reason as part of replicating the abort op at the coordinator. Jira: DB-7377, DB-3597 Test Plan: Reverting the test new test since it would be flaky without addressing the status tablet leadership change and transaction clean up cases. for stats, the following test (where we explicitly disable proactive cleanup from query layer) fails with a different non-deadlock specific error (`message: ERROR: Transaction aborted: kAborted (pgsql error 40001)`) about 10% times when the test is run for 100 iters. When the enable the query layer cleanup, we can additionally see errors of type `Transaction was recently aborted...`. ``` TEST_F(PgWaitQueuesTest, TestDeadlockErrorMessage) { ANNOTATE_UNPROTECTED_WRITE(FLAGS_TEST_disable_proactive_txn_cleanup_on_abort) = true; ANNOTATE_UNPROTECTED_WRITE(FLAGS_clear_deadlocked_txns_info_older_than_heartbeats) = 40; constexpr int kNumTxns = 5; auto setup_conn = ASSERT_RESULT(Connect()); ASSERT_OK(setup_conn.Execute("CREATE TABLE foo (k INT PRIMARY KEY, v INT)")); ASSERT_OK(setup_conn.ExecuteFormat( "INSERT INTO foo SELECT generate_series(0, $0), 0", kNumTxns)); CountDownLatch num_keys_locked(kNumTxns); CountDownLatch did_deadlock(kNumTxns - 1); TestThreadHolder thread_holder; for (auto i = 0; i < kNumTxns ; i++) { thread_holder.AddThreadFunctor([this, &num_keys_locked, &did_deadlock, i] { auto conn = ASSERT_RESULT(Connect()); ASSERT_OK(conn.StartTransaction(IsolationLevel::SNAPSHOT_ISOLATION)); ASSERT_OK(conn.ExecuteFormat("UPDATE foo SET v=v+10 WHERE k=$0", i)); num_keys_locked.CountDown(); num_keys_locked.WaitFor(3s * kTimeMultiplier); for (auto j = 1 ; j <= kNumTxns ; j++) { auto s = conn.ExecuteFormat("UPDATE foo SET v=v+20 WHERE k=$0", (i + j) % kNumTxns); if (!s.ok()) { LOG(INFO) << "Thread " << i << " update failed with status " << s; ASSERT_STR_CONTAINS(s.ToString(), "aborted due to a deadlock"); did_deadlock.CountDown(); return; } } ASSERT_TRUE(did_deadlock.WaitFor(25s * kTimeMultiplier)); }); } ASSERT_TRUE(did_deadlock.WaitFor(40s * kTimeMultiplier)); thread_holder.WaitAndStop(15s * kTimeMultiplier); } ``` Reviewers: rsami, sergei, rthallam Reviewed By: rthallam Subscribers: ybase, yql, bogdan, mbautin, pjain Differential Revision: https://phorge.dev.yugabyte.com/D28904
This was referenced Oct 3, 2023
jharveysmith
pushed a commit
that referenced
this issue
May 24, 2024
…cted deadlocks Summary: Original commit: 74ba77a4676a1ca08a97e339b66be98c64dd508b / D27782 In 5af785f, we made a first step towards reporting a deadlock specific error message. This diff is an enhancement on top of the original one, where we try to be one step closer to throwing a consistent deadlock specific error immaterial of where the waiter transactions realizes of its aborted state. In this diff, we **additionally make transaction participant store recently deadlocked transactions info** in-memory after it requests the status from the coordinator. Fixes the following scenario: On detection of deadlocks, multiple involved transactions could be aborted. due to the uncertain control race, a waiter could resume from the wait-queue and undergo conflict resolution, and realize that it has been aborted (when it calls `PrepareMetadata` of the transaction participant). Since we make the participant store the recently deadlocked transactions info, we report the deadlock specific error if any. Planning to take the following up in a subsequent diff, if required - 1. The query layer (client/transaction.cc) proactively sends clean up requests for expired transactions. So it could happen that a txn participant sees that request first and cleans up the txn. And when an aborted waiter re-does conflict resolution, the participant might return a generic error from `::PrepareMetadata` func. Either the deadlock specific error from the query layer needs to be streamed back to the client, or the participant somehow needs to be made aware that the cleanup is due to a deadlock. 2. include the deadlock status/reason as part of replicating the abort op at the coordinator. Jira: DB-7377, DB-3597 Test Plan: Reverting the test new test since it would be flaky without addressing the status tablet leadership change and transaction clean up cases. for stats, the following test (where we explicitly disable proactive cleanup from query layer) fails with a different non-deadlock specific error (`message: ERROR: Transaction aborted: kAborted (pgsql error 40001)`) about 10% times when the test is run for 100 iters. When the enable the query layer cleanup, we can additionally see errors of type `Transaction was recently aborted...`. ``` TEST_F(PgWaitQueuesTest, TestDeadlockErrorMessage) { ANNOTATE_UNPROTECTED_WRITE(FLAGS_TEST_disable_proactive_txn_cleanup_on_abort) = true; ANNOTATE_UNPROTECTED_WRITE(FLAGS_clear_deadlocked_txns_info_older_than_heartbeats) = 40; constexpr int kNumTxns = 5; auto setup_conn = ASSERT_RESULT(Connect()); ASSERT_OK(setup_conn.Execute("CREATE TABLE foo (k INT PRIMARY KEY, v INT)")); ASSERT_OK(setup_conn.ExecuteFormat( "INSERT INTO foo SELECT generate_series(0, $0), 0", kNumTxns)); CountDownLatch num_keys_locked(kNumTxns); CountDownLatch did_deadlock(kNumTxns - 1); TestThreadHolder thread_holder; for (auto i = 0; i < kNumTxns ; i++) { thread_holder.AddThreadFunctor([this, &num_keys_locked, &did_deadlock, i] { auto conn = ASSERT_RESULT(Connect()); ASSERT_OK(conn.StartTransaction(IsolationLevel::SNAPSHOT_ISOLATION)); ASSERT_OK(conn.ExecuteFormat("UPDATE foo SET v=v+10 WHERE k=$0", i)); num_keys_locked.CountDown(); num_keys_locked.WaitFor(3s * kTimeMultiplier); for (auto j = 1 ; j <= kNumTxns ; j++) { auto s = conn.ExecuteFormat("UPDATE foo SET v=v+20 WHERE k=$0", (i + j) % kNumTxns); if (!s.ok()) { LOG(INFO) << "Thread " << i << " update failed with status " << s; ASSERT_STR_CONTAINS(s.ToString(), "aborted due to a deadlock"); did_deadlock.CountDown(); return; } } ASSERT_TRUE(did_deadlock.WaitFor(25s * kTimeMultiplier)); }); } ASSERT_TRUE(did_deadlock.WaitFor(40s * kTimeMultiplier)); thread_holder.WaitAndStop(15s * kTimeMultiplier); } ``` Reviewers: rsami, sergei, rthallam Reviewed By: rthallam Subscribers: ybase, yql, bogdan, mbautin, pjain Differential Revision: https://phorge.dev.yugabyte.com/D28904
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Jira Link: DB-3597
Description
Pg throws a detailed error message when a deadlock is detected as shown below:
We should improve our error message to similarly convey that a transaction was cancelled due to deadlock. We can also include a list of transactions involved and IDs of nodes serving those transactions
The text was updated successfully, but these errors were encountered: