Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Some client connections hang after master leader is restarted #13659

Closed
rthallamko3 opened this issue Aug 17, 2022 · 1 comment
Closed
Assignees
Labels
2.14 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@rthallamko3
Copy link
Contributor

rthallamko3 commented Aug 17, 2022

Jira Link: DB-3224

Description

Steps to repro (From Stepan):
run 20 nodes cluster with RF=5 (i think we can reproduce with less RF and nodes amount)
start sample-apps load - just inserts in SQL table with 20 columns, batch_size=700
java -jar yb-sample-apps-1.8.15-b3.jar --workload SqlDataLoad --num_writes -1 --num_threads_write 5 --num_threads_read 0 --num_reads 0 --num_unique_keys 100000000000 --batch_size 774 --num_value_columns 8
after 10 min stop AWS instance manually in AWS console
wait several minutes and then start it back
go in live queries tab on portal and see stuck query
Here is running universes with CDC configured (connector is killed right now, load is going) - http://10.9.66.35/universes/45a79116-0925-4e80-ac2a-f43c9ef254a0
Here is running universe without CDC - http://10.9.66.35/universes/38479ee9-3a1f-45fb-af1f-432be47fdf75
creds: demo@yugabyte.com Password#123

The backtrace of the hang:

Thread 1 (LWP 23845):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007fc045ac6748 in std::__atomic_futex_unsigned_base::_M_futex_wait_until(unsigned int*, unsigned int, bool, std::chrono::duration<long, std::ratio<1l, 1l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) ()
from /home/yugabyte/yb-software/yugabyte-2.12.9.1-b1-centos-x86_64/linuxbrew/lib/libstdc++.so.6
#2 0x00007fc046265d42 in _M_load_and_test_until (__ns=..., __s=..., __has_timeout=, __mo=, __equal=, __operand=, __assumed=, this=)
at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/bits/atomic_futex.h:104
#3 _M_load_and_test (__mo=, __equal=, __operand=, __assumed=, this=)
at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/bits/atomic_futex.h:122
#4 _M_load_when_equal (__mo=std::memory_order_acquire, __val=1, this=0x2bf3c10) at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/bits/atomic_futex.h:162
#5 wait (this=0x2bf3c00) at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/future:322
#6 _M_get_result (this=0x7fff096b71d0) at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/future:681
#7 get (this=0x7fff096b71d0) at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/future:760
#8 yb::pggate::PgSession::FlushOperations (this=0x2422c40, ops=..., transactional=...) at ../../src/yb/yql/pggate/pg_session.cc:987
#9 0x00007fc046266209 in operator()<std::vectoryb::pggate::BufferableOperation, yb::StronglyTypedBoolyb::pggate::IsTransactionalSession_Tag > (txn=...,
ops=, __closure=) at ../../src/yb/yql/pggate/pg_session.cc:870
#10 std::_Function_handler<yb::Status(std::vector<yb::pggate::BufferableOperation, std::allocatoryb::pggate::BufferableOperation >, yb::StronglyTypedBoolyb::pggate::IsTransactionalSession_Tag), yb::pggate::PgSession::FlushBufferedOperations()::<lambda(auto:6, auto:7)> >::_M_invoke(const std::_Any_data &, <unknown type in /home/yugabyte/yb-software/yugabyte-2.12.9.1-b1-centos-x86_64/lib/yb/libyb_pggate.so, CU 0x39f473, DIE 0x47889b>, <unknown type in /home/yugabyte/yb-software/yugabyte-2.12.9.1-b1-centos-x86_64/lib/yb/libyb_pggate.so, CU 0x39f473, DIE 0x4788a0>) (__functor=..., __args#0=, __args#1=)
at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:1857
#11 0x00007fc0462621ba in operator() (__args#1=..., __args#0=..., this=0x7fff096b7370) at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267
#12 yb::pggate::PgSession::FlushBufferedOperationsImpl(std::function<yb::Status (std::vector<yb::pggate::BufferableOperation, std::allocatoryb::pggate::BufferableOperation >, yb::StronglyTypedBoolyb::pggate::IsTransactionalSession_Tag)> const&) (
this=0x2422c40, flusher=...) at ../../src/yb/yql/pggate/pg_session.cc:898
#13 0x00007fc046262454 in yb::pggate::PgSession::FlushBufferedOperations (this=) at ../../src/yb/yql/pggate/pg_session.cc:870
#14 0x00007fc0462624c1 in yb::pggate::PgSession::StopOperationsBuffering (this=) at ../../src/yb/yql/pggate/pg_session.cc:860
#15 0x00007fc04624b218 in yb::pggate::PgApiImpl::StopOperationsBuffering (this=) at ../../src/yb/yql/pggate/pggate.cc:1078
#16 0x00007fc04623e63c in yb::pggate::YBCPgStopOperationsBuffering () at ../../src/yb/yql/pggate/ybc_pggate.cc:652
#17 0x0000000000a3b7d9 in YBEndOperationsBuffering () at ../../../../../../../src/postgres/src/backend/utils/misc/pg_yb_utils.c:1409
#18 0x00000000006d03c9 in standard_ExecutorFinish (queryDesc=0x34fdc38) at ../../../../../../src/postgres/src/backend/executor/execMain.c:447
#19 0x00007fc03a589875 in pgss_ExecutorFinish (queryDesc=0x34fdc38) at ../../../../../src/postgres/contrib/pg_stat_statements/pg_stat_statements.c:970
#20 0x00007fc03a3822b3 in ybpgm_ExecutorFinish (queryDesc=0x34fdc38) at ../../../../../src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c:519
#21 0x0000000000888550 in ProcessQuery (isSingleRowModifyTxn=, completionTag=, dest=, queryEnv=, params=, sourceText=, plan=)
at ../../../../../../src/postgres/src/backend/tcop/pquery.c:217
#22 PortalRunMulti (portal=portal@entry=0x26e4118, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=0xed2760 , dest@entry=0x25dc558, altdest=0xed2760 , altdest@entry=0x25dc558,
completionTag=completionTag@entry=0x7fff096b78c0 "INSERT 0 1") at ../../../../../../src/postgres/src/backend/tcop/pquery.c:1331
#23 0x00000000008893f2 in PortalRun (portal=portal@entry=0x26e4118, count=count@entry=1, isTopLevel=isTopLevel@entry=true, run_once=, dest=dest@entry=0x25dc558, altdest=altdest@entry=0x25dc558,
completionTag=0x7fff096b78c0 "INSERT 0 1") at ../../../../../../src/postgres/src/backend/tcop/pquery.c:811
#24 0x0000000000882a3c in exec_execute_message (max_rows=1, portal_name=0x25dc118 "") at ../../../../../../src/postgres/src/backend/tcop/postgres.c:2066
#25 yb_exec_execute_message_impl (raw_ctx=raw_ctx@entry=0x7fff096b7c10) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4564
#26 0x0000000000881875 in yb_exec_query_wrapper (exec_context=exec_context@entry=0x25dc000, restart_data=restart_data@entry=0x25dc528, functor=functor@entry=0x882760 <yb_exec_execute_message_impl>,
---Type to continue, or q to quit---
functor_context=functor_context@entry=0x7fff096b7c10) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4517
#27 0x0000000000886536 in yb_exec_execute_message (exec_context=0x25dc000, restart_data=0x25dc528, max_rows=, portal_name=) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4581

@rthallamko3 rthallamko3 added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Aug 17, 2022
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Aug 17, 2022
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Aug 17, 2022
@rthallamko3 rthallamko3 changed the title [DocDB] Client connections hang after master leader is restarted [DocDB] Some client connections hang after master leader is restarted Aug 17, 2022
@rthallamko3 rthallamko3 modified the milestone: 2.4.0.0 Aug 17, 2022
samiahmedsiddiqui pushed a commit to samiahmedsiddiqui/yugabyte-db that referenced this issue Aug 19, 2022
Summary:
YBTransaction starts with looking up for status tablet.
When this lookup is failed the transaction is marked as failed.

We lack failure state check at Prepare, so calling Prepare after transaction failure will result in adding callback to waiters list.
But no more action will be taken, so callback will never be notified.

The side effect is master leader restarts can cause prepared statement connections to hang.

This diff fixed the issue by checking failure status in Prepare.

Test Plan:
Launch AWS cluster.
Run the following workload against it:
java  -jar yb-sample-apps.jar --workload SqlDataLoad  --num_writes -1 --num_threads_write 5 --num_threads_read 0 --num_reads 0 --num_unique_keys 100000000000 --batch_size 774 --num_value_columns 8 --nodes $NODES
Wait 10 minutes.
Stop node, that runs master leader, using AWS console.

W/o the fix there will be stuck queries.

Reviewers: esheng, rthallam

Reviewed By: esheng, rthallam

Subscribers: rthallam, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D18994
spolitov added a commit that referenced this issue Aug 19, 2022
Summary:
YBTransaction starts with looking up for status tablet.
When this lookup is failed the transaction is marked as failed.

We lack failure state check at Prepare, so calling Prepare after transaction failure will result in adding callback to waiters list.
But no more action will be taken, so callback will never be notified.

The side effect is master leader restarts can cause prepared statement connections to hang.

This diff fixed the issue by checking failure status in Prepare.

Original commit: 11cdcc6/D18994

Test Plan:
Launch AWS cluster.
Run the following workload against it:
java  -jar yb-sample-apps.jar --workload SqlDataLoad  --num_writes -1 --num_threads_write 5 --num_threads_read 0 --num_reads 0 --num_unique_keys 100000000000 --batch_size 774 --num_value_columns 8 --nodes $NODES
Wait 10 minutes.
Stop node, that runs master leader, using AWS console.

W/o the fix there will be stuck queries.

Reviewers: esheng, rthallam

Reviewed By: rthallam

Subscribers: rthallam, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D19043
spolitov added a commit that referenced this issue Aug 21, 2022
Summary:
YBTransaction starts with looking up for status tablet.
When this lookup is failed the transaction is marked as failed.

We lack failure state check at Prepare, so calling Prepare after transaction failure will result in adding callback to waiters list.
But no more action will be taken, so callback will never be notified.

The side effect is master leader restarts can cause prepared statement connections to hang.

This diff fixed the issue by checking failure status in Prepare.

Test Plan:
Launch AWS cluster.
Run the following workload against it:
java  -jar yb-sample-apps.jar --workload SqlDataLoad  --num_writes -1 --num_threads_write 5 --num_threads_read 0 --num_reads 0 --num_unique_keys 100000000000 --batch_size 774 --num_value_columns 8 --nodes $NODES
Wait 10 minutes.
Stop node, that runs master leader, using AWS console.

W/o the fix there will be stuck queries.

Original commit: 11cdcc6/D18994

Reviewers: esheng, rthallam

Reviewed By: rthallam

Subscribers: ybase, rthallam

Differential Revision: https://phabricator.dev.yugabyte.com/D19064
@bmatican
Copy link
Contributor

Seems like the 2.12 base backport also landed 0a9521a, so good to close this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.14 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

4 participants