-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CQL Delete with just the partition key is not supported #6
Comments
For now I've commented out the delete queries and I'm able to get it running. Heck who wants to delete data anyway ;) |
@brianhks : that's temporarily what @rkarthik007 had to do to move forward :). We noticed that the code was doing the "prepare" step upfront, but the actual "execute" for "delete" statements wouldn't be exercised in the common flows. |
@brianhks - I had hit a couple of other issues also. |
Yes I was going to mention the compression issue as well but forgot. In my previous performance tests with C* compression makes a big difference. |
Yes indeed. Just to note, we added the client to server compression support recently: We'll make sure to cut a newer build, and make it available for download soon. |
@kmuthukk, yes it's done and that commit is now included in the latest release (https://docs.yugabyte.com/quick-start/install/). Closing. |
… memtable Summary: There was a crash during one of our performance integration tests that was caused by Frontiers() not being set on a memtable. That could only possibly happen if the memtable is empty, and it is still not clear how an empty memtable could get into the list of immutable memtables. Regardless of that, instead of crashing, we should just flush that memtable and log an error message. ``` #0 operator() (memtable=..., __closure=0x7f2e454b67b0) at ../../../../../src/yb/tablet/tablet_peer.cc:178 #1 std::_Function_handler<bool(const rocksdb::MemTable&), yb::tablet::TabletPeer::InitTabletPeer(const std::shared_ptr<yb::tablet::enterprise::Tablet>&, const std::shared_future<std::shared_ptr<yb::client::YBClient> >&, const scoped_refptr<yb::server::Clock>&, const std::shared_ptr<yb::rpc::Messenger>&, const scoped_refptr<yb::log::Log>&, const scoped_refptr<yb::MetricEntity>&, yb::ThreadPool*)::<lambda()>::<lambda(const rocksdb::MemTable&)> >::_M_invoke(const std::_Any_data &, const rocksdb::MemTable &) (__functor=..., __args#0=...) at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:1857 #2 0x00007f2f7346a70e in operator() (__args#0=..., this=0x7f2e454b67b0) at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267 #3 rocksdb::MemTableList::PickMemtablesToFlush(rocksdb::autovector<rocksdb::MemTable*, 8ul>*, std::function<bool (rocksdb::MemTable const&)> const&) (this=0x7d02978, ret=ret@entry=0x7f2e454b6370, filter=...) at ../../../../../src/yb/rocksdb/db/memtable_list.cc:259 #4 0x00007f2f7345517f in rocksdb::FlushJob::Run (this=this@entry=0x7f2e454b6750, file_meta=file_meta@entry=0x7f2e454b68d0) at ../../../../../src/yb/rocksdb/db/flush_job.cc:143 #5 0x00007f2f7341b7c3 in rocksdb::DBImpl::FlushMemTableToOutputFile (this=this@entry=0x89d2400, cfd=cfd@entry=0x7d02300, mutable_cf_options=..., made_progress=made_progress@entry=0x7f2e454b709e, job_context=job_context@entry=0x7f2e454b70b0, log_buffer=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:1586 #6 0x00007f2f7341c19f in rocksdb::DBImpl::BackgroundFlush (this=this@entry=0x89d2400, made_progress=made_progress@entry=0x7f2e454b709e, job_context=job_context@entry=0x7f2e454b70b0, log_buffer=log_buffer@entry=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2816 #7 0x00007f2f7342539b in rocksdb::DBImpl::BackgroundCallFlush (this=0x89d2400) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2838 #8 0x00007f2f735154c3 in rocksdb::ThreadPool::BGThread (this=0x3b0bb20, thread_id=0) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:133 #9 0x00007f2f73515558 in rocksdb::BGThreadWrapper (arg=0xd970a20) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:157 #10 0x00007f2f6c964694 in start_thread (arg=0x7f2e454b8700) at pthread_create.c:333 ``` Test Plan: Jenkins Reviewers: hector, sergei Reviewed By: hector, sergei Subscribers: sergei, bogdan, bharat, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D4044
Summary: Here is an excerpt from RocksDB wiki with a description of "Version" data structure: > The list of files in an LSM tree is kept in a data structure called version. In the end of a compaction or a mem table flush, a new version is created for the updated LSM tree. At one time, there is only one "current" version that represents the files in the up-to-date LSM tree. New get requests or new iterators will use the current version through the whole read process or life cycle of iterator. All versions used by get or iterators need to be kept. An out-of-date version that is not used by any get or iterator needs to be dropped. All files not used by any other version need to be deleted." > ... > Both of an SST file and a version have a reference count. While we create a version, we incremented the reference counts for all files. If a version is not needed, all files’ of the version have their reference counts decremented. If a file’s reference count drops to 0, the file can be deleted. > In a similar way, each version has a reference count. When a version is created, it is an up-to-date one, so it has reference count 1. If the version is not up-to-date anymore, its reference count is decremented. Anyone who needs to work on the version has its reference count incremented by 1, and decremented by 1 when finishing using it. When a version’s reference count is 0, it should be removed. Either a version is up-to-date or someone is using it, its reference count is not 0, so it will be kept. A compaction job doesn't simply delete its input files. Instead, it finds obsoleted files (ignoring list of input files) and deletes them. When deleting obsolete files it doesn't delete live SST files and pending output files. There were several cases when deletion of compacted files was delayed: 1) A concurrent flush job is holding input version and therefore all files from this version. 2) At the end of a flush job, RocksDB can schedule a compaction and it starts holding its input version together with all files from this version (not only input files of scheduled compaction). 3) `DBImpl::FindObsoleteFiles` and `DBImplPurgeObsoleteFiles` functions don't delete unreferenced SST files with number greater than or equal to `min_pending_output`, which means that if some job is still writing file #4, already compacted and not used files #5, #6, #7 couldn't be deleted till next compaction which would trigger deleting obsolete files. This diff includes the following changes to address the issue: 1) Don't hold a version during flush. 2) In case of universal compaction, we don't actually need to hold the whole input version, so in this case we only hold input files and store some minimal information from input version. 3) Instead of relying on `min_pending_output`, utility classes `FileNumbersHolder` and `FileNumbersProvider` were implemented in order to allow tracking of the exact set of pending output files and don't block deletion of other unreferenced SST files. Test Plan: - Jenkins. - Long-running test with CassandraKeyValue workload. - Use debug check and logs to make sure SST files are deleted no later than 1 second after they were compacted. - Added unit tests for all 3 cases. Reviewers: mikhail, venkatesh, amitanand, sergei Reviewed By: sergei Subscribers: kannan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D5526
Bump up the cassandra-loader version to v0.0.27-yb-2 to support json
…ed to the earlier commit 864e72b Original commit message: ENG-2793 Do not fail when deciding if we can flush an empty immutable memtable Summary: There was a crash during one of our performance integration tests that was caused by Frontiers() not being set on a memtable. That could only possibly happen if the memtable is empty, and it is still not clear how an empty memtable could get into the list of immutable memtables. Regardless of that, instead of crashing, we should just flush that memtable and log an error message. ``` #0 operator() (memtable=..., __closure=0x7f2e454b67b0) at ../../../../../src/yb/tablet/tablet_peer.cc:178 #1 std::_Function_handler<bool(const rocksdb::MemTable&), yb::tablet::TabletPeer::InitTabletPeer(const std::shared_ptr<yb::tablet::enterprise::Tablet>&, const std::shared_future<std::shared_ptr<yb::client::YBClient> >&, const scoped_refptr<yb::server::Clock>&, const std::shared_ptr<yb::rpc::Messenger>&, const scoped_refptr<yb::log::Log>&, const scoped_refptr<yb::MetricEntity>&, yb::ThreadPool*)::<lambda()>::<lambda(const rocksdb::MemTable&)> >::_M_invoke(const std::_Any_data &, const rocksdb::MemTable &) (__functor=..., __args#0=...) at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:1857 #2 0x00007f2f7346a70e in operator() (__args#0=..., this=0x7f2e454b67b0) at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267 #3 rocksdb::MemTableList::PickMemtablesToFlush(rocksdb::autovector<rocksdb::MemTable*, 8ul>*, std::function<bool (rocksdb::MemTable const&)> const&) (this=0x7d02978, ret=ret@entry=0x7f2e454b6370, filter=...) at ../../../../../src/yb/rocksdb/db/memtable_list.cc:259 #4 0x00007f2f7345517f in rocksdb::FlushJob::Run (this=this@entry=0x7f2e454b6750, file_meta=file_meta@entry=0x7f2e454b68d0) at ../../../../../src/yb/rocksdb/db/flush_job.cc:143 #5 0x00007f2f7341b7c3 in rocksdb::DBImpl::FlushMemTableToOutputFile (this=this@entry=0x89d2400, cfd=cfd@entry=0x7d02300, mutable_cf_options=..., made_progress=made_progress@entry=0x7f2e454b709e, job_context=job_context@entry=0x7f2e454b70b0, log_buffer=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:1586 #6 0x00007f2f7341c19f in rocksdb::DBImpl::BackgroundFlush (this=this@entry=0x89d2400, made_progress=made_progress@entry=0x7f2e454b709e, job_context=job_context@entry=0x7f2e454b70b0, log_buffer=log_buffer@entry=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2816 #7 0x00007f2f7342539b in rocksdb::DBImpl::BackgroundCallFlush (this=0x89d2400) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2838 #8 0x00007f2f735154c3 in rocksdb::ThreadPool::BGThread (this=0x3b0bb20, thread_id=0) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:133 #9 0x00007f2f73515558 in rocksdb::BGThreadWrapper (arg=0xd970a20) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:157 #10 0x00007f2f6c964694 in start_thread (arg=0x7f2e454b8700) at pthread_create.c:333 ``` Test Plan: Jenkins Reviewers: hector, sergei Reviewed By: hector, sergei Subscribers: sergei, bogdan, bharat, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D4044
… memtable Summary: There was a crash during one of our performance integration tests that was caused by Frontiers() not being set on a memtable. That could only possibly happen if the memtable is empty, and it is still not clear how an empty memtable could get into the list of immutable memtables. Regardless of that, instead of crashing, we should just flush that memtable and log an error message. ``` #0 operator() (memtable=..., __closure=0x7f2e454b67b0) at ../../../../../src/yb/tablet/tablet_peer.cc:178 yugabyte#1 std::_Function_handler<bool(const rocksdb::MemTable&), yb::tablet::TabletPeer::InitTabletPeer(const std::shared_ptr<yb::tablet::enterprise::Tablet>&, const std::shared_future<std::shared_ptr<yb::client::YBClient> >&, const scoped_refptr<yb::server::Clock>&, const std::shared_ptr<yb::rpc::Messenger>&, const scoped_refptr<yb::log::Log>&, const scoped_refptr<yb::MetricEntity>&, yb::ThreadPool*)::<lambda()>::<lambda(const rocksdb::MemTable&)> >::_M_invoke(const std::_Any_data &, const rocksdb::MemTable &) (__functor=..., __args#0=...) at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:1857 yugabyte#2 0x00007f2f7346a70e in operator() (__args#0=..., this=0x7f2e454b67b0) at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267 yugabyte#3 rocksdb::MemTableList::PickMemtablesToFlush(rocksdb::autovector<rocksdb::MemTable*, 8ul>*, std::function<bool (rocksdb::MemTable const&)> const&) (this=0x7d02978, ret=ret@entry=0x7f2e454b6370, filter=...) at ../../../../../src/yb/rocksdb/db/memtable_list.cc:259 yugabyte#4 0x00007f2f7345517f in rocksdb::FlushJob::Run (this=this@entry=0x7f2e454b6750, file_meta=file_meta@entry=0x7f2e454b68d0) at ../../../../../src/yb/rocksdb/db/flush_job.cc:143 yugabyte#5 0x00007f2f7341b7c3 in rocksdb::DBImpl::FlushMemTableToOutputFile (this=this@entry=0x89d2400, cfd=cfd@entry=0x7d02300, mutable_cf_options=..., made_progress=made_progress@entry=0x7f2e454b709e, job_context=job_context@entry=0x7f2e454b70b0, log_buffer=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:1586 yugabyte#6 0x00007f2f7341c19f in rocksdb::DBImpl::BackgroundFlush (this=this@entry=0x89d2400, made_progress=made_progress@entry=0x7f2e454b709e, job_context=job_context@entry=0x7f2e454b70b0, log_buffer=log_buffer@entry=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2816 yugabyte#7 0x00007f2f7342539b in rocksdb::DBImpl::BackgroundCallFlush (this=0x89d2400) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2838 yugabyte#8 0x00007f2f735154c3 in rocksdb::ThreadPool::BGThread (this=0x3b0bb20, thread_id=0) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:133 yugabyte#9 0x00007f2f73515558 in rocksdb::BGThreadWrapper (arg=0xd970a20) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:157 yugabyte#10 0x00007f2f6c964694 in start_thread (arg=0x7f2e454b8700) at pthread_create.c:333 ``` Test Plan: Jenkins Reviewers: hector, sergei Reviewed By: hector, sergei Subscribers: sergei, bogdan, bharat, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D4044 Note: This commit provides additional functionality that is logically related to the earlier commit yugabyte@864e72b and supersedes the commit yugabyte@2932b0a
…up on macOS Summary: Add a DNS lookup of the local hostname to postmaster startup to force macOS network libraries to get initialized before any fork() calls happen. This fixes failures of ~20 tests in macOS debug mode. Without this, PostgreSQL backends would frequently crash with SIGSEGV and dump cores when trying to do the same DNS lookup. Here is a SIGSEGV stack trace that we would previously get without this fix: ``` frame #0: 0x00007fff7bd53e34 libsystem_trace.dylib`_os_log_cmp_key + 4 frame #1: 0x00007fff7bbfcb74 libsystem_c.dylib`rb_tree_find_node + 53 frame #2: 0x00007fff7bd52021 libsystem_trace.dylib`os_log_create + 368 frame #3: 0x00007fff7bc5b127 libsystem_info.dylib`gai_log_init + 23 frame #4: 0x00007fff7bd37ce3 libsystem_pthread.dylib`__pthread_once_handler + 65 frame #5: 0x00007fff7bd2daab libsystem_platform.dylib`_os_once_callout + 18 frame #6: 0x00007fff7bd37c7f libsystem_pthread.dylib`pthread_once + 56 frame #7: 0x00007fff7bc5a4ab libsystem_info.dylib`gai_log + 27 frame #8: 0x00007fff7bc5b33f libsystem_info.dylib`_gai_load_libnetwork_once + 63 frame #9: 0x00007fff7bd37ce3 libsystem_pthread.dylib`__pthread_once_handler + 65 frame #10: 0x00007fff7bd2daab libsystem_platform.dylib`_os_once_callout + 18 frame #11: 0x00007fff7bd37c7f libsystem_pthread.dylib`pthread_once + 56 frame #12: 0x00007fff7bc5b29b libsystem_info.dylib`_gai_load_libnetwork + 27 frame #13: 0x00007fff7bc5b64f libsystem_info.dylib`_gai_nat64_v4_address_requires_synthesis + 31 frame #14: 0x00007fff7bc5aaa0 libsystem_info.dylib`_gai_nat64_second_pass + 512 frame #15: 0x00007fff7bc39847 libsystem_info.dylib`si_addrinfo + 1959 frame #16: 0x00007fff7bc38f77 libsystem_info.dylib`_getaddrinfo_internal + 231 frame #17: 0x00007fff7bc38e7d libsystem_info.dylib`getaddrinfo + 61 frame #18: 0x000000011512f8e5 libyb_util.dylib`yb::GetFQDN(hostname="...") at net_util.cc:371:20 ``` Test Plan: Jenkins Reviewers: mihnea, dmitry Reviewed By: dmitry Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D7757
…condition between insert and truncate; disable rocksdb flush on truncate Summary: - Prevent rocksdb instance from calling `ListenFilesChanged` callback after being detached from `Tablet` Full stack available at - #3288 (comment) Follow Up Work: #3476 - Prevent race between `Tablet::AcquireLocksAndPerformDocOperations() -> Tablet::StartDocWriteOperation()` and `Tablet::Truncate()` by incrementing `pending_op_counter_` Full stack available at - #3288 (comment) - Do not flush rocksdb memtable when user truncates table to prevent following crash on flush - ``` #0 0x000056371c706b00 in ?? () #1 0x00007f5229291099 in std::__invoke_impl<void, void (yb::tablet::Tablet::*&)(), yb::tablet::Tablet*&> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/7/bits/invoke.h:73 #2 std::__invoke<void (yb::tablet::Tablet::*&)(), yb::tablet::Tablet*&> (__fn=<optimized out>) at /usr/include/c++/7/bits/invoke.h:95 #3 std::_Bind<void (yb::tablet::Tablet::*(yb::tablet::Tablet*))()>::__call<void, , 0ul>(std::tuple<>&&, std::_Index_tuple<0ul>) (__args=..., this=<optimized out>) at /usr/include/c++/7/functional:467 #4 std::_Bind<void (yb::tablet::Tablet::*(yb::tablet::Tablet*))()>::operator()<, void>() (this=<optimized out>) at /usr/include/c++/7/functional:551 #5 std::_Function_handler<void (), std::_Bind<void (yb::tablet::Tablet::*(yb::tablet::Tablet*))()> >::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/7/bits/std_function.h:316 #6 0x00007f5225aee92c in std::function<void ()>::operator()() const (this=0x56371c5534f0) at /usr/include/c++/7/bits/std_function.h:706 #7 rocksdb::DBImpl::FilesChanged (this=this@entry=0x56371c552b00) at ../../src/yb/rocksdb/db/db_impl.cc:4359 #8 0x00007f5225b14b1b in rocksdb::DBImpl::BackgroundCallFlush (this=this@entry=0x56371c552b00, cfd=cfd@entry=0x0) at ../../src/yb/rocksdb/db/db_impl.cc:3285 ``` Full stack available at - #3288 (comment) - WORKAROUND for race condition (#3288 (comment)) by using `ScopedPendingOperation` in `Tablet::ShouldApplyWrite()` to prevent it from seeing a null `regular_db_` due to a concurrent `Tablet::Truncate()` Full stack available at - #3288 (comment) Follow Up Work: #3477 Test Plan: ./yb_build.sh debug --cxx-test pg_libpq-test --gtest_filter PgLibPqTest.ConcurrentInsertTruncateForeignKey ./yb_build.sh debug --cxx-test snapshot-txn-test --gtest_filter SnapshotTxnTest.MultiWriteWithRestart Reviewers: bogdan, sergei, mikhail Reviewed By: mikhail Subscribers: kannan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D7808
Summary: Fixing lock order inversion with yql system.partitions building and table deletions: One thread doing deletes in catalog manager is holding CM's mutex_ and blocked on getting the yqlpartitions mutex_: ``` #5 std::lock_guard<std::shared_timed_mutex>::lock_guard (__m=..., this=<synthetic pointer>) at /cases/home/yugabyte/yb-software/yugabyte-2.12.5.0/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/mutex:386 #6 yb::master::YQLPartitionsVTable::RemoveFromCache (this=0x182091e0, table_id=...) at ../../src/yb/master/yql_partitions_vtable.cc:262 #7 0x00007f384c0931e5 in yb::master::CatalogManager::DeleteTableInternal (this=this@entry=0x275a000, req=req@entry=0x225a2138, resp=resp@entry=0x225a2160, rpc=rpc@entry=0x7f3815561cb0) at ../../src/yb/master/catalog_manager.cc:4831 ``` And another thread rebuilding the vtable has the yqlpartitions lock and is waiting on the CM mutex_: ``` #4 yb::NonRecursiveSharedLock<yb::rw_spinlock>::NonRecursiveSharedLock (this=0x7f3767405da0, mutex=...) at ../../src/yb/util/debug/lock_debug.h:41 #5 0x00007f384c04a4b1 in yb::master::CatalogManager::GetTables (this=0x275a000, mode=yb::master::GetTablesMode::kVisibleToClient) at ../../src/yb/master/catalog_manager.cc:5885 #6 0x00007f384c2148f2 in yb::master::YQLPartitionsVTable::GenerateAndCacheData (this=0x182091e0) at ../../src/yb/master/yql_partitions_vtable.cc:140 #7 0x00007f384c03c35d in yb::master::CatalogManager::RebuildYQLSystemPartitions (this=0x275a000) at ../../src/yb/master/catalog_manager.cc:10486 ``` For now keeping the default of generating the partitions cache on changes, due to create table perf concerns of the purely bg thread approach Test Plan: ``` ybd tsan --gtest_filter CppCassandraDriverTest.YQLPartitionsVtableCacheRefresh ``` Reviewers: bogdan, sergei, asrivastava Reviewed By: asrivastava Subscribers: kannan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D22943
Summary: This commit addresses a memory leak in the pg_isolation_regress test suite detected by ASAN: ``` ==18113==ERROR: LeakSanitizer: detected memory leaks Direct leak of 2048 byte(s) in 1 object(s) allocated from: #0 0x55f21c2846d6 in realloc /opt/yb-build/llvm/yb-llvm-v15.0.3-yb-1-1667030060-0b8d1183-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:85:3 #1 0x55f21c2cb0a9 in pg_realloc ${YB_SRC_ROOT}/src/postgres/src/common/../../../../../src/postgres/src/common/fe_memutils.c:72:8 #2 0x55f21c2c27b2 in addlitchar ${YB_SRC_ROOT}/src/postgres/src/test/isolation/specscanner.l:116:12 #3 0x55f21c2c27b2 in spec_yylex ${YB_SRC_ROOT}/src/postgres/src/test/isolation/specscanner.l:90:6 #4 0x55f21c2be95f in spec_yyparse ${YB_SRC_ROOT}/src/postgres/src/test/isolation/specparse.c:1190:16 #5 0x55f21c2c5492 in main ${YB_SRC_ROOT}/src/postgres/src/test/isolation/../../../../../../src/postgres/src/test/isolation/isolationtester.c:116:2 #6 0x7f027726ed84 in __libc_start_main (/lib64/libc.so.6+0x3ad84) (BuildId: d18afae5244bc9c85026bd7d64b276d51b452d93) Objects leaked above: 0x61d000000080 (2048 bytes) SUMMARY: AddressSanitizer: 2048 byte(s) leaked in 1 allocation(s). ``` The memory leak is caused by the litbuf variable in specscanner.l, which is being reallocated in the addlitchar function but not freed properly. The leak occurs in two ways: 1. litbuf is allocated multiple times without being freed, leading to memory leaks. 2. litbuf is not properly released after spec_yyparse() is executed. To resolve these issues, the following changes have been made: 1. litbuf memory allocation is now initialized only when it is NULL, preventing multiple allocations and enabling the buffer to be reused. 2. A spec_scanner_finish() function is introduced to clean up the allocated memory. This function frees the litbuf memory after spec_yyparse() is executed, preventing the memory leak. Test Plan: Run pg_isolation_regress to confirm that the memory leak is resolved: /yb_build.sh asan --java-test 'org.yb.pgsql.TestPgWithoutWaitQueuesIsolationRegress' -n 100 --tp 1 Reviewers: bogdan, pjain Reviewed By: pjain Subscribers: smishra, yql Differential Revision: https://phabricator.dev.yugabyte.com/D23912
Summary: This commit addresses a memory leak in the pg_isolation_regress test suite detected by ASAN: ``` ==18113==ERROR: LeakSanitizer: detected memory leaks Direct leak of 2048 byte(s) in 1 object(s) allocated from: #0 0x55f21c2846d6 in realloc /opt/yb-build/llvm/yb-llvm-v15.0.3-yb-1-1667030060-0b8d1183-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:85:3 #1 0x55f21c2cb0a9 in pg_realloc ${YB_SRC_ROOT}/src/postgres/src/common/../../../../../src/postgres/src/common/fe_memutils.c:72:8 #2 0x55f21c2c27b2 in addlitchar ${YB_SRC_ROOT}/src/postgres/src/test/isolation/specscanner.l:116:12 yugabyte#3 0x55f21c2c27b2 in spec_yylex ${YB_SRC_ROOT}/src/postgres/src/test/isolation/specscanner.l:90:6 yugabyte#4 0x55f21c2be95f in spec_yyparse ${YB_SRC_ROOT}/src/postgres/src/test/isolation/specparse.c:1190:16 yugabyte#5 0x55f21c2c5492 in main ${YB_SRC_ROOT}/src/postgres/src/test/isolation/../../../../../../src/postgres/src/test/isolation/isolationtester.c:116:2 yugabyte#6 0x7f027726ed84 in __libc_start_main (/lib64/libc.so.6+0x3ad84) (BuildId: d18afae5244bc9c85026bd7d64b276d51b452d93) Objects leaked above: 0x61d000000080 (2048 bytes) SUMMARY: AddressSanitizer: 2048 byte(s) leaked in 1 allocation(s). ``` The memory leak is caused by the litbuf variable in specscanner.l, which is being reallocated in the addlitchar function but not freed properly. The leak occurs in two ways: 1. litbuf is allocated multiple times without being freed, leading to memory leaks. 2. litbuf is not properly released after spec_yyparse() is executed. To resolve these issues, the following changes have been made: 1. litbuf memory allocation is now initialized only when it is NULL, preventing multiple allocations and enabling the buffer to be reused. 2. A spec_scanner_finish() function is introduced to clean up the allocated memory. This function frees the litbuf memory after spec_yyparse() is executed, preventing the memory leak. Test Plan: Run pg_isolation_regress to confirm that the memory leak is resolved: /yb_build.sh asan --java-test 'org.yb.pgsql.TestPgWithoutWaitQueuesIsolationRegress' -n 100 --tp 1 Reviewers: bogdan, pjain Reviewed By: pjain Subscribers: smishra, yql Differential Revision: https://phabricator.dev.yugabyte.com/D23912
Summary: TSAN detects this possible deadlock in PgWrapper::Supervisor: Minimal stack trace: ``` WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=8513) [ts-2] Cycle in lock order graph: M0 (0x7b5400002a48) => M1 (0x7b1800051f18) => M0 Mutex M1 acquired here while holding mutex M0 in main thread: [ts-2] #0 pthread_mutex_lock /opt/yb-build/llvm/yb-llvm-v15.0.3-yb-1-1667341687-0b8d1183-centos7-x86_64-build/src/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors.inc:4457:3 (yb-tserver+0xa4cfa) [ts-2] #3 yb::FlagsCallbackRegistry::RegisterCallback(void const*, string const&, std::function<void ()>) ${BUILD_ROOT}/../../src/yb/util/flags/flags_callback.cc:84:19 (libyb_util.so+0x24dcd9) [ts-2] #4 yb::RegisterFlagUpdateCallback(void const*, string const&, std::function<void ()>) ${BUILD_ROOT}/../../src/yb/util/flags/flags_callback.cc:184:24 (libyb_util.so+0x24f8c3) [ts-2] #5 yb::pgwrapper::PgSupervisor::RegisterReloadPgConfigCallback(void const*) ${BUILD_ROOT}/../../src/yb/yql/pgwrapper/pg_wrapper.cc:904:32 (libyb_pgwrapper.so+0x2f2f4) [ts-2] #6 yb::pgwrapper::PgSupervisor::RegisterPgFlagChangeNotifications() ${BUILD_ROOT}/../../src/yb/yql/pgwrapper/pg_wrapper.cc:919:7 (libyb_pgwrapper.so+0x2e69e) [ts-2] #7 yb::pgwrapper::PgSupervisor::Start() ${BUILD_ROOT}/../../src/yb/yql/pgwrapper/pg_wrapper.cc:749:3 (libyb_pgwrapper.so+0x2ce05) Mutex M0 previously acquired by the same thread here: [ts-2] #0 pthread_mutex_lock /opt/yb-build/llvm/yb-llvm-v15.0.3-yb-1-1667341687-0b8d1183-centos7-x86_64-build/src/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors.inc:4457:3 (yb-tserver+0xa4cfa) [ts-2] #3 yb::pgwrapper::PgSupervisor::Start() ${BUILD_ROOT}/../../src/yb/yql/pgwrapper/pg_wrapper.cc:746:31 (libyb_pgwrapper.so+0x2cdaf) Mutex M0 acquired here while holding mutex M1 in thread T66: [ts-2] #0 pthread_mutex_lock /opt/yb-build/llvm/yb-llvm-v15.0.3-yb-1-1667341687-0b8d1183-centos7-x86_64-build/src/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors.inc:4457:3 (yb-tserver+0xa4cfa) [ts-2] #3 yb::pgwrapper::PgSupervisor::UpdateAndReloadConfig() ${BUILD_ROOT}/../../src/yb/yql/pgwrapper/pg_wrapper.cc:894:31 (libyb_pgwrapper.so+0x2f1b7) ``` TSAN detects that there are two functions (Start() and UpdateAndReloadConfig) each acquiring M0 and M1 in inverse order which may run into a deadlock. However, Start() is always called first and will acquire M0 and M1 before it registers the callback that invokes UpdateAndReloadConfig(). Start() will never be called again. Thus the deadlock called out by TSAN is not possible. Jira: DB-5450 Test Plan: Tested with TSAN build in Detective runs of D22913 Reviewers: hsunder, mbautin Reviewed By: hsunder Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D25311
Summary: YB Seq Scan code path is not hit because Foreign Scan is the default and pg hint plan does not work. Upcoming merge with YB master will bring in master commit 465ee2c which changes the default to YB Seq Scan. To test YB Seq Scan, a temporary patch is needed (see the test plan). With that, two bugs are encountered: fix them. 1. FailedAssertion("TTS_IS_VIRTUAL(slot)" On simple test case create table t (i int primary key, j int); select * from t; get TRAP: FailedAssertion("TTS_IS_VIRTUAL(slot)", File: "../../../../../../../src/postgres/src/backend/access/yb_access/yb_scan.c", Line: 3473, PID: 2774450) Details: #0 0x00007fd52616eacf in raise () from /lib64/libc.so.6 #1 0x00007fd526141ea5 in abort () from /lib64/libc.so.6 #2 0x0000000000af33ad in ExceptionalCondition (conditionName=conditionName@entry=0xc2938d "TTS_IS_VIRTUAL(slot)", errorType=errorType@entry=0xc01498 "FailedAssertion", fileName=fileName@entry=0xc28f18 "../../../../../../../src/postgres/src/backend/access/yb_access/yb_scan.c", lineNumber=lineNumber@entry=3473) at ../../../../../../../src/postgres/src/backend/utils/error/assert.c:69 #3 0x00000000005c26bd in ybFetchNext (handle=0x2600ffc43680, slot=slot@entry=0x2600ff6c2980, relid=16384) at ../../../../../../../src/postgres/src/backend/access/yb_access/yb_scan.c:3473 #4 0x00000000007de444 in YbSeqNext (node=0x2600ff6c2778) at ../../../../../../src/postgres/src/backend/executor/nodeYbSeqscan.c:156 #5 0x000000000078b3c6 in ExecScanFetch (node=node@entry=0x2600ff6c2778, accessMtd=accessMtd@entry=0x7de2b9 <YbSeqNext>, recheckMtd=recheckMtd@entry=0x7de26e <YbSeqRecheck>) at ../../../../../../src/postgres/src/backend/executor/execScan.c:133 #6 0x000000000078b44e in ExecScan (node=0x2600ff6c2778, accessMtd=accessMtd@entry=0x7de2b9 <YbSeqNext>, recheckMtd=recheckMtd@entry=0x7de26e <YbSeqRecheck>) at ../../../../../../src/postgres/src/backend/executor/execScan.c:182 #7 0x00000000007de298 in ExecYbSeqScan (pstate=<optimized out>) at ../../../../../../src/postgres/src/backend/executor/nodeYbSeqscan.c:191 #8 0x00000000007871ef in ExecProcNodeFirst (node=0x2600ff6c2778) at ../../../../../../src/postgres/src/backend/executor/execProcnode.c:480 #9 0x000000000077db0e in ExecProcNode (node=0x2600ff6c2778) at ../../../../../../src/postgres/src/include/executor/executor.h:285 #10 ExecutePlan (execute_once=<optimized out>, dest=0x2600ff6b1a10, direction=<optimized out>, numberTuples=0, sendTuples=true, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2600ff6c2778, estate=0x2600ff6c2128) at ../../../../../../src/postgres/src/backend/executor/execMain.c:1650 #11 standard_ExecutorRun (queryDesc=0x2600ff675128, direction=<optimized out>, count=0, execute_once=<optimized out>) at ../../../../../../src/postgres/src/backend/executor/execMain.c:367 #12 0x000000000077dbfe in ExecutorRun (queryDesc=queryDesc@entry=0x2600ff675128, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=<optimized out>) at ../../../../../../src/postgres/src/backend/executor/execMain.c:308 #13 0x0000000000982617 in PortalRunSelect (portal=portal@entry=0x2600ff90e128, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x2600ff6b1a10) at ../../../../../../src/postgres/src/backend/tcop/pquery.c:954 #14 0x000000000098433c in PortalRun (portal=portal@entry=0x2600ff90e128, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x2600ff6b1a10, altdest=altdest@entry=0x2600ff6b1a10, qc=0x7fffc14a13c0) at ../../../../../../src/postgres/src/backend/tcop/pquery.c:786 #15 0x000000000097e65b in exec_simple_query (query_string=0x2600ffdc6128 "select * from t;") at ../../../../../../src/postgres/src/backend/tcop/postgres.c:1321 #16 yb_exec_simple_query_impl (query_string=query_string@entry=0x2600ffdc6128) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5060 #17 0x000000000097b7a5 in yb_exec_query_wrapper_one_attempt (exec_context=exec_context@entry=0x2600ffdc6000, restart_data=restart_data@entry=0x7fffc14a1640, functor=functor@entry=0x97e033 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x2600ffdc6128, attempt=attempt@entry=0, retry=retry@entry=0x7fffc14a15ff) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5028 #18 0x000000000097d077 in yb_exec_query_wrapper (exec_context=exec_context@entry=0x2600ffdc6000, restart_data=restart_data@entry=0x7fffc14a1640, functor=functor@entry=0x97e033 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x2600ffdc6128) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5052 #19 0x000000000097d0ca in yb_exec_simple_query (query_string=query_string@entry=0x2600ffdc6128 "select * from t;", exec_context=exec_context@entry=0x2600ffdc6000) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5075 #20 0x000000000097fe8a in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5794 #21 0x00000000008c8354 in BackendRun (port=0x2600ff8423c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4791 #22 BackendStartup (port=0x2600ff8423c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4491 #23 ServerLoop () at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1878 #24 0x00000000008caa55 in PostmasterMain (argc=argc@entry=25, argv=argv@entry=0x2600ffdc01a0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1533 #25 0x0000000000804ba8 in PostgresServerProcessMain (argc=25, argv=0x2600ffdc01a0) at ../../../../../../src/postgres/src/backend/main/main.c:208 #26 0x0000000000804bc8 in main () 3469 ybFetchNext(YBCPgStatement handle, 3470 TupleTableSlot *slot, Oid relid) 3471 { 3472 Assert(slot != NULL); 3473 Assert(TTS_IS_VIRTUAL(slot)); (gdb) p *slot $2 = {type = T_TupleTableSlot, tts_flags = 18, tts_nvalid = 0, tts_ops = 0xeaf5e0 <TTSOpsHeapTuple>, tts_tupleDescriptor = 0x2600ff6416c0, tts_values = 0x2600ff6c2a00, tts_isnull = 0x2600ff6c2a10, tts_mcxt = 0x2600ff6c2000, tts_tid = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 0, yb_item = {ybctid = 0}}, tts_tableOid = 0, tts_yb_insert_oid = 0} Fix by making YB Seq Scan always use virtual slot. This is similar to what is done for YB Foreign Scan. 2. segfault in ending scan Same simple test case gives segfault at a later stage. Details: #0 0x00000000007de762 in table_endscan (scan=0x3debfe3ab88) at ../../../../../../src/postgres/src/include/access/tableam.h:997 #1 ExecEndYbSeqScan (node=node@entry=0x3debfe3a778) at ../../../../../../src/postgres/src/backend/executor/nodeYbSeqscan.c:298 #2 0x0000000000787a75 in ExecEndNode (node=0x3debfe3a778) at ../../../../../../src/postgres/src/backend/executor/execProcnode.c:649 #3 0x000000000077ffaf in ExecEndPlan (estate=0x3debfe3a128, planstate=<optimized out>) at ../../../../../../src/postgres/src/backend/executor/execMain.c:1489 #4 standard_ExecutorEnd (queryDesc=0x2582fdc88928) at ../../../../../../src/postgres/src/backend/executor/execMain.c:503 #5 0x00000000007800f8 in ExecutorEnd (queryDesc=queryDesc@entry=0x2582fdc88928) at ../../../../../../src/postgres/src/backend/executor/execMain.c:474 #6 0x00000000006f140c in PortalCleanup (portal=0x2582ff900128) at ../../../../../../src/postgres/src/backend/commands/portalcmds.c:305 #7 0x0000000000b3c36a in PortalDrop (portal=portal@entry=0x2582ff900128, isTopCommit=isTopCommit@entry=false) at ../../../../../../../src/postgres/src/backend/utils/mmgr/portalmem.c:514 #8 0x000000000097e667 in exec_simple_query (query_string=0x2582ffdc6128 "select * from t;") at ../../../../../../src/postgres/src/backend/tcop/postgres.c:1331 #9 yb_exec_simple_query_impl (query_string=query_string@entry=0x2582ffdc6128) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5060 #10 0x000000000097b79a in yb_exec_query_wrapper_one_attempt (exec_context=exec_context@entry=0x2582ffdc6000, restart_data=restart_data@entry=0x7ffc81c0e7d0, functor=functor@entry=0x97e028 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x2582ffdc6128, attempt=attempt@entry=0, retry=retry@entry=0x7ffc81c0e78f) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5028 #11 0x000000000097d06c in yb_exec_query_wrapper (exec_context=exec_context@entry=0x2582ffdc6000, restart_data=restart_data@entry=0x7ffc81c0e7d0, functor=functor@entry=0x97e028 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x2582ffdc6128) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5052 #12 0x000000000097d0bf in yb_exec_simple_query (query_string=query_string@entry=0x2582ffdc6128 "select * from t;", exec_context=exec_context@entry=0x2582ffdc6000) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5075 #13 0x000000000097fe7f in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5794 #14 0x00000000008c8349 in BackendRun (port=0x2582ff8403c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4791 #15 BackendStartup (port=0x2582ff8403c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4491 #16 ServerLoop () at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1878 #17 0x00000000008caa4a in PostmasterMain (argc=argc@entry=25, argv=argv@entry=0x2582ffdc01a0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1533 #18 0x0000000000804b9d in PostgresServerProcessMain (argc=25, argv=0x2582ffdc01a0) at ../../../../../../src/postgres/src/backend/main/main.c:208 #19 0x0000000000804bbd in main () 294 /* 295 * close heap scan 296 */ 297 if (tsdesc != NULL) 298 table_endscan(tsdesc); Reason is initial merge 55782d5 incorrectly merges end of ExecEndYbSeqScan. Upstream PG 9ddef36278a9f676c07d0b4d9f33fa22e48ce3b5 removes code, but initial merge duplicates lines. Remove those lines. Test Plan: Apply the following patch to activate YB Seq Scan: diff --git a/src/postgres/src/backend/optimizer/path/allpaths.c b/src/postgres/src/backend/optimizer/path/allpaths.c index 8a4c38a965..854d84a648 100644 --- a/src/postgres/src/backend/optimizer/path/allpaths.c +++ b/src/postgres/src/backend/optimizer/path/allpaths.c @@ -576,7 +576,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, else { /* Plain relation */ - if (IsYBRelationById(rte->relid)) + if (false) { /* * Using a foreign scan which will use the YB FDW by On almalinux 8, ./yb_build.sh fastdebug --gcc11 pg15_tests/run_all_tests.sh fastdebug --gcc11 --sj --sp --scb fails the following tests: - test_D29546 - test_pg15_regress: yb_pg15 - test_types_geo: yb_pg_box - test_hash_in_queries: yb_hash_in_queries Manually check to see that they are due to YB Seq Scan explain output differences. Reviewers: aagrawal, tfoucher Reviewed By: tfoucher Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D31139
…ction Summary: The are several unit tests which suffers from tsan data race warning with the following stack: ``` WARNING: ThreadSanitizer: data race (pid=38656) Read of size 8 at 0x7f6f2a44b038 by thread T21: #0 memcpy /opt/yb-build/llvm/yb-llvm-v17.0.2-yb-1-1696896765-6a83e4b2-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:115:5 (pg_ddl_concurrency-test+0x9e197) #1 <null> <null> (libnss_sss.so.2+0x72ef) (BuildId: a17afeaa37369696ec2457ab7a311139707fca9b) #2 pqGetpwuid ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/thread.c:99:9 (libpq.so.5+0x4a8c9) #3 pqGetHomeDirectory ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:6674:9 (libpq.so.5+0x2d3c7) #4 connectOptions2 ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:1150:8 (libpq.so.5+0x2d3c7) #5 PQconnectStart ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:791:7 (libpq.so.5+0x2c2fe) #6 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:647:20 (libpq.so.5+0x2c279) #7 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${BUILD_ROOT}/../../src/yb/yql/pgwrapper/libpq_utils.cc:278:24 (libpq_utils.so+0x11d6b) ... Previous write of size 8 at 0x7f6f2a44b038 by thread T20 (mutexes: write M0): #0 mmap64 /opt/yb-build/llvm/yb-llvm-v17.0.2-yb-1-1696896765-6a83e4b2-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors.inc:7485:3 (pg_ddl_concurrency-test+0xda204) #1 <null> <null> (libnss_sss.so.2+0x7169) (BuildId: a17afeaa37369696ec2457ab7a311139707fca9b) #2 pqGetpwuid ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/thread.c:99:9 (libpq.so.5+0x4a8c9) #3 pqGetHomeDirectory ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:6674:9 (libpq.so.5+0x2d3c7) #4 connectOptions2 ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:1150:8 (libpq.so.5+0x2d3c7) #5 PQconnectStart ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:791:7 (libpq.so.5+0x2c2fe) #6 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:647:20 (libpq.so.5+0x2c279) #7 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${BUILD_ROOT}/../../src/yb/yql/pgwrapper/libpq_utils.cc:278:24 (libpq_utils.so+0x11d6b) ... Location is global '??' at 0x7f6f2a44b000 (passwd+0x38) Mutex M0 (0x7f6f2af29380) created at: #0 pthread_mutex_lock /opt/yb-build/llvm/yb-llvm-v17.0.2-yb-1-1696896765-6a83e4b2-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1339:3 (pg_ddl_concurrency-test+0xa464b) #1 <null> <null> (libnss_sss.so.2+0x70d6) (BuildId: a17afeaa37369696ec2457ab7a311139707fca9b) #2 pqGetpwuid ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/thread.c:99:9 (libpq.so.5+0x4a8c9) #3 pqGetHomeDirectory ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:6674:9 (libpq.so.5+0x2d3c7) #4 connectOptions2 ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:1150:8 (libpq.so.5+0x2d3c7) #5 PQconnectStart ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:791:7 (libpq.so.5+0x2c2fe) ... ``` All failing tests has common feature - all of them creates connection to postgres from multiple threads at same time. On creating new connection the `libpq` library calls the `getpwuid_r` standard function internally. This function is thread safe and tsan warning is not expected there. Solution is to suppress warning in the `getpwuid_r` function. **Note:** because there is no `getpwuid_r` function name in the tsan warning stack the warning for the caller function `pqGetpwuid` is suppressed. Jira: DB-9523 Test Plan: Jenkins Reviewers: sergei, bogdan Reviewed By: sergei Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31646
…retained for CDC" Summary: D33131 introduced a segmentation fault which was identified in multiple tests. ``` * thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4 frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11 frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32 frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45 frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5 frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16 frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7 ``` This diff reverts the change to unblock the tests. The proper fix for this problem is WIP Jira: DB-10780, DB-10466 Test Plan: Jenkins: urgent Reviewers: rthallam Reviewed By: rthallam Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D34245
…retained for CDC" Summary: D33131 introduced a segmentation fault which was identified in multiple tests. ``` * thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4 frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11 frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32 frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45 frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5 frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16 frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7 ``` This diff reverts the change to unblock the tests. The proper fix for this problem is WIP Jira: DB-10780, DB-10466 Test Plan: Jenkins: urgent Reviewers: rthallam Reviewed By: rthallam Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D34245
…23065) * initial commit for logical replication docs * title changes * changes to view table * fixed line break * fixed line break * added content for delete and update * added more content * replaced hyperlink todos with reminders * added snapshot metrics * added more content * added more config properties to docs * added more config properties to docs * added more config properties to docs * replaced postgresql instances with yugabytedb * added properties * added complete properties * changed postgresql to yugabytedb * added example for all record types * fixed highlighting of table header * added type representations * added type representations * full content in now; * full content in now; * changed postgres references appropriately * added a missing keyword * changed name * self review comments * self review comments * added section for logical replication * added section for logical replication * modified content for monitor page * added content for monitoring * rebased to master; * CDC logical replication overview (#3) Co-authored-by: Vaibhav Kushwaha <34186745+vaibhav-yb@users.noreply.github.com> * advanced-topic (#5) Co-authored-by: Vaibhav Kushwaha <34186745+vaibhav-yb@users.noreply.github.com> * removed references to incremental and ad-hoc snapshots * replaced index page with an empty one * addressed review comments * added getting started section * added section for get started * self review comments * self review comments * group review comments * added hstore and domain type docs * Advance configurations for CDC using logical replication (#2) * Fix overview section (#7) * Monitor section (#4) Co-authored-by: Vaibhav Kushwaha <34186745+vaibhav-yb@users.noreply.github.com> * Initial Snapshot content (#6) * Add getting started (#1) * Fix for broken note (#9) * Fix the issue yaml parsing Summary: Fixes the issue yaml parsing. We changed the formatting for yaml list. This diff fixes the usage for the same. Test Plan: Prepared alma9 node using ynp. Verified universe creation. Reviewers: vbansal, asharma Reviewed By: asharma Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D36711 * [PLAT-14534]Add regex match for GCP Instance template Summary: Added regex match for gcp instance template. Regex taken from gcp documentation [[https://cloud.google.com/compute/docs/reference/rest/v1/instanceTemplates | here]]. Test Plan: Tested manually that validation fails with invalid characters. Reviewers: #yba-api-review!, svarshney Reviewed By: svarshney Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D36543 * update diagram (#23245) * [/PLAT-14708] Fix JSON field name in TaskInfo query Summary: This was missed when task params were moved out from details field. Test Plan: Trivial - existing tests should succeed. Reviewers: vbansal, cwang Reviewed By: vbansal Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D36705 * [#23173] DocDB: Allow large bytes to be passed to RateLimiter Summary: RateLimiter has a debug assert that you cannot `Request` more than `GetSingleBurstBytes`. In release mode we do not perform this check and any call gets stuck forever. This change allows large bytes to be requested on RateLimiter. It does so by breaking requests larger than `GetSingleBurstBytes` into multiple smaller requests. This change is a temporary fix to allow xCluster to operate without any issues. RocksDB RateLimiter has multiple enhancements over the years that would help avoid this and more starvation issues. Ex: facebook/rocksdb@cb2476a. We should consider pulling in those changes. Fixes #23173 Jira: DB-12112 Test Plan: RateLimiterTest.LargeRequests Reviewers: slingam Reviewed By: slingam Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D36703 * [#23179] CDCSDK: Support data types with dynamically alloted oids in CDC Summary: This diff adds support for data types with dynamically alloted oids in CDC (for ex: hstore, enum array, etc). Such types contain invalid pg_type_oid for the corresponding columns in docdb schema. In the current implemtation, in `ybc_pggate`, while decoding the cdc records we look at the `type_map_` to obtain YBCPgTypeEntity, which is then used for decoding. However the `type_map_` does not contain any entries for the data types with dynamically alloted oids. As a result, this causes segmentation fault. To prevent such crashes, CDC prevents addition of tables with such columns to the stream. This diff removes the filtering logic and adds the tables to the stream even if it has such a type column. A function pointer will now be passed to `YBCPgGetCDCConsistentChanges`, which takes attribute number and the table_oid and returns the appropriate type entity by querying the `pg_type` catalog table. While decoding if a column is encountered with invalid pg_type_oid then, the passed function is invoked and type entity is obtained for decoding. **Upgrade/Rollback safety:** This diff adds a field `optional int32 attr_num` to DatumMessagePB. These changes are protected by the autoflag `ysql_yb_enable_replication_slot_consumption` which already exists but has not yet been released. Jira: DB-12118 Test Plan: Jenkins: urgent All the existing cdc tests ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#replicationConnectionConsumptionAllDataTypesWithYbOutput' Reviewers: skumar, stiwary, asrinivasan, dmitry Reviewed By: stiwary, dmitry Subscribers: steve.varnau, skarri, yql, ybase, ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36689 * [PLAT-14710] Do not return apiToken in response to getSessionInfo Summary: **Context** The GET /session_info YBA API returns: { "authToken": "…", "apiToken": "….", "apiTokenVersion": "….", "customerUUID": "uuid1", "userUUID": "useruuid1" } The apiToken and apiTokenVersion is supposed to be the last generated token that is valid. We had the following sequence of changes to this API. https://yugabyte.atlassian.net/browse/PLAT-8028 - Do not store YBA token in YBA. After the above fix, YBA does not store the apiToken anymore. So it cannot return it as part of the /session_info. The change for this ticket returned the hashed apiToken instead. https://yugabyte.atlassian.net/browse/PLAT-14672 - getSessionInfo should generate and return api key in response Since the hashed apiToken value is not useful to any client, and it broke YBM create cluster (https://yugabyte.atlassian.net/browse/CLOUDGA-22117), the first change for this ticket returned a new apiToken instead. Note that GET /session_info is meant to get customer and user information for the currently authenticated session. This is useful for automation starting off an authenticated session from an existing/cached API token. It is not necessary for the /session_info API to return the authToken and apiToken. The client already has one of authToken or apiToken with which it invoked /session_info API. In fact generating a new apiToken whenever /session_info is called will invalidate the previous apiToken which would not be expected by the client. There is a different API /api_token to regenerate the apiToken explicitly. **Fix in this change** So the right behaviour is for /session_info to stop sending the apiToken in the response. In fact, the current behaviour of generating a new apiToken everytime will break a client (for example node-agent usage of /session_info here (https://github.com/yugabyte/yugabyte-db/blob/4ca56cfe27d1cae64e0e61a1bde22406e003ec04/managed/node-agent/app/server/handler.go#L19). **Client impact of not returning apiToken in response of /session_info** This should not impact any normal client that was using /session_info only to get the user uuid and customer uuid. However, there might be a few clients (like YBM for example) that invoked /session_info to get the last generated apiToken from YBA. Unfortunately, this was a mis-use of this API. YBA generates the apiToken in response to a few entry point APIs like /register, /api_login and /api_token. The apiToken is long lived. YBA could choose to expire these apiTokens after a fixed amount of (long) time, but for now there is no expiration. The clients are expected to store the apiToken at their end and use the token to reestablish a session with YBA whenever needed. After establishinig a new session, clients would call GET /session_info to get the user uuid and customer uuid. This is getting fixed in YBM with https://yugabyte.atlassian.net/browse/CLOUDGA-22117. So this PLAT change should be taken up by YBM only after CLOUDGA-22117 is fixed. Test Plan: * Manually verified that session_info does not return authToken * Shubham verified that node-agent works with this fix. Thanks Shubham! Reviewers: svarshney, dkumar, tbedi, #yba-api-review! Reviewed By: svarshney Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D36712 * [docs] updates to CVE table status column (#23225) * updates to status column * review comment * format --------- Co-authored-by: Dwight Hodge <ghodge@yugabyte.com> * [docs] Fix load balance keyword in drivers page (#23253) [docs] Fix `load_balance` -> `load-balance` in jdbc driver [docs] Fix `load_balance` -> `loadBalance` in nodejs driver * fixed compilation * fix link, format * format, links * links, format * format * format * minor edit * best practice (#8) * moved sections * moved pages * added key concepts page * added link to getting started * Dynamic table doc changes (#11) * icons * added box for lead link * revert ybclient change * revert accidental change * revert accidental change * revert accidental change * fix link block for getting started page * format * minor edit * links, format * format * links * format * remove reminder references * Modified output plugin docs (#12) * Naming edits * format * review comments * diagram * review comment * fix links * format * format * link * review comments * copy to stable * link --------- Co-authored-by: siddharth2411 <43139012+siddharth2411@users.noreply.github.com> Co-authored-by: Shubham <svarshney@yugabyte.com> Co-authored-by: asharma-yb <asharma@yugabyte.com> Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com> Co-authored-by: Naorem Khogendro Singh <nsingh@yugabyte.com> Co-authored-by: Hari Krishna Sunder <hari90@users.noreply.github.com> Co-authored-by: Sumukh-Phalgaonkar <sumukhphalgaonkar@gmail.com> Co-authored-by: Subramanian Neelakantan <sneelakantan@yugabyte.com> Co-authored-by: Aishwarya Chakravarthy <ashchakravarthy@gmail.com> Co-authored-by: Dwight Hodge <ghodge@yugabyte.com> Co-authored-by: ddorian <dorian.hoxha@gmail.com> Co-authored-by: Sumukh-Phalgaonkar <61342752+Sumukh-Phalgaonkar@users.noreply.github.com>
Summary: The DDL atomicity stress tests failed more on pg15 branch with an error like: ``` WARNING: ThreadSanitizer: data race (pid=180911) Write of size 8 at 0x7b2c000257b8 by thread T17 (mutexes: write M0): #0 profile_open_file prof_file.c (libkrb5.so.3+0xf45b3) #1 profile_init_flags <null> (libkrb5.so.3+0xfb056) #2 k5_os_init_context <null> (libkrb5.so.3+0xe5546) #3 krb5_init_context_profile <null> (libkrb5.so.3+0xabc90) #4 krb5_init_context <null> (libkrb5.so.3+0xabbd5) #5 krb5_gss_init_context init_sec_context.c (libgssapi_krb5.so.2+0x448da) #6 acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39159) #7 krb5_gss_acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39072) #8 gss_add_cred_from <null> (libgssapi_krb5.so.2+0x1fcd3) #9 gss_acquire_cred_from <null> (libgssapi_krb5.so.2+0x1f69d) #10 gss_acquire_cred <null> (libgssapi_krb5.so.2+0x1f431) #11 pg_GSS_have_cred_cache ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-gssapi-common.c:68:10 (libpq.so.5+0x543fe) #12 PQconnectPoll ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2909:22 (libpq.so.5+0x359ca) #13 connectDBComplete ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2241:10 (libpq.so.5+0x30807) #14 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:719:10 (libpq.so.5+0x30af1) #15 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:348:24 (libpq_utils.so+0x13c5b) #16 yb::pgwrapper::PGConn::Connect(string const&, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.h:254:12 (libpq_utils.so+0x1a77e) #17 yb::pgwrapper::PGConnBuilder::Connect(bool) const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:743:10 (libpq_utils.so+0x1a77e) #18 yb::pgwrapper::LibPqTestBase::ConnectToDBAsUser(string const&, string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:54:6 (libpg_wrapper_test_base.so+0x26f34) #19 yb::pgwrapper::LibPqTestBase::ConnectToDB(string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:44:10 (libpg_wrapper_test_base.so+0x26b1e) #20 yb::pgwrapper::LibPqTestBase::Connect(bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:40:10 (libpg_wrapper_test_base.so+0x26b1e) #21 yb::pgwrapper::PgDdlAtomicityStressTest::Connect() ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:147:25 (pg_ddl_atomicity_stress-test+0x136d6c) #22 yb::pgwrapper::PgDdlAtomicityStressTest::TestDdl(std::vector<string, std::allocator<string>> const&, int) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:165:15 (pg_ddl_atomicity_stress-test+0x136df5) #23 yb::pgwrapper::PgDdlAtomicityStressTest_StressTest_Test::TestBody()::$_2::operator()() const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:316:5 (pg_ddl_atomicity_stress-test+0x13d2eb) ``` It appears that the function `yb::pgwrapper::LibPqTestBase::Connect` isn't thread safe. I restructured the code to make the connections in a single thread and then pass them to various concurrent threads for testing. Jira: DB-2996 Test Plan: ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19 --clang17 Verified that no more tsan errors. Reviewers: fizaa Reviewed By: fizaa Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D37111
… in tsan build Summary: The DDL atomicity stress tests failed more on pg15 branch with an error like: ``` WARNING: ThreadSanitizer: data race (pid=180911) Write of size 8 at 0x7b2c000257b8 by thread T17 (mutexes: write M0): #0 profile_open_file prof_file.c (libkrb5.so.3+0xf45b3) #1 profile_init_flags <null> (libkrb5.so.3+0xfb056) #2 k5_os_init_context <null> (libkrb5.so.3+0xe5546) #3 krb5_init_context_profile <null> (libkrb5.so.3+0xabc90) #4 krb5_init_context <null> (libkrb5.so.3+0xabbd5) #5 krb5_gss_init_context init_sec_context.c (libgssapi_krb5.so.2+0x448da) #6 acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39159) #7 krb5_gss_acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39072) #8 gss_add_cred_from <null> (libgssapi_krb5.so.2+0x1fcd3) #9 gss_acquire_cred_from <null> (libgssapi_krb5.so.2+0x1f69d) #10 gss_acquire_cred <null> (libgssapi_krb5.so.2+0x1f431) #11 pg_GSS_have_cred_cache ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-gssapi-common.c:68:10 (libpq.so.5+0x543fe) #12 PQconnectPoll ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2909:22 (libpq.so.5+0x359ca) #13 connectDBComplete ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2241:10 (libpq.so.5+0x30807) #14 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:719:10 (libpq.so.5+0x30af1) #15 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:348:24 (libpq_utils.so+0x13c5b) #16 yb::pgwrapper::PGConn::Connect(string const&, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.h:254:12 (libpq_utils.so+0x1a77e) #17 yb::pgwrapper::PGConnBuilder::Connect(bool) const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:743:10 (libpq_utils.so+0x1a77e) #18 yb::pgwrapper::LibPqTestBase::ConnectToDBAsUser(string const&, string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:54:6 (libpg_wrapper_test_base.so+0x26f34) #19 yb::pgwrapper::LibPqTestBase::ConnectToDB(string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:44:10 (libpg_wrapper_test_base.so+0x26b1e) #20 yb::pgwrapper::LibPqTestBase::Connect(bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:40:10 (libpg_wrapper_test_base.so+0x26b1e) #21 yb::pgwrapper::PgDdlAtomicityStressTest::Connect() ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:147:25 (pg_ddl_atomicity_stress-test+0x136d6c) #22 yb::pgwrapper::PgDdlAtomicityStressTest::TestDdl(std::vector<string, std::allocator<string>> const&, int) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:165:15 (pg_ddl_atomicity_stress-test+0x136df5) #23 yb::pgwrapper::PgDdlAtomicityStressTest_StressTest_Test::TestBody()::$_2::operator()() const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:316:5 (pg_ddl_atomicity_stress-test+0x13d2eb) ``` It appears that the function `yb::pgwrapper::LibPqTestBase::Connect` isn't thread safe. I restructured the code to make the connections in a single thread and then pass them to various concurrent threads for testing. Jira: DB-2996 Original commit: bd4874b / D37111 Test Plan: ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18 --clang17 ./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19 --clang17 Verified that no more tsan errors. Reviewers: fizaa Reviewed By: fizaa Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D37167
Summary: Call callback in ScopeExit block only. Not while holding the lock. Without this fix, it is possible that a thread can get into a deadlock, trying to request a shared_lock on a mutex, while already holding an exclusive lock on the same mutex: This deadlock can be triggered if there are active read/write requests to a Table (from more than 1 thread) right after the table had a tablet-split. If there is only 1 thread, it is unlikely to run into the deadlock, as the thread notices -- as part of the callback -- that the table's partition info is stale. Having a different thread refresh the partition version before the main thread checks if the table version is stale, is likely necessary to trigger the stack trace seen below. e.g: ``` #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00005640c3eb441b in std::__1::shared_timed_mutex::lock_shared() () #2 0x00005640c3ffcbff in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) () #3 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) () #4 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) () #5 0x00005640c401aa76 in yb::client::(anonymous namespace)::BatcherFlushDone(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig) () #6 0x00005640c401b371 in boost::detail::function::void_function_obj_invoker1<std::__1::__bind<void (*)(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig), std::__1::shared_ptr<yb::client::internal::Batcher> const&, std::__1::placeholders::__ph<1> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig&>, void, yb::Status const&>::invoke(boost::detail::function::function_buffer&, yb::Status const&) () #7 0x00005640c3f70398 in yb::client::internal::Batcher::Run() () #8 0x00005640c3f72656 in yb::client::internal::Batcher::FlushFinished() () #9 0x00005640c3f74a4d in yb::client::internal::Batcher::TabletLookupFinished(yb::client::internal::InFlightOp*, yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> >) () #10 0x00005640c3f759bc in std::__1::__function::__func<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0, std::__1::allocator<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0>, void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>::operator()(yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&) () #11 0x00005640c3fff05d in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) () ** Is holding an exclusive lock in MetaCache::LookupTabletByKey/DoLookupTabletByKey ** #12 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) () #13 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) () #14 0x00005640c4017130 in yb::client::YBSession::FlushAsync(boost::function<void (yb::client::FlushStatus*)>) () #15 0x00005640c5225a0c in yb::tserver::PgClientServiceImpl::Perform(yb::tserver::PgPerformRequestPB const*, yb::tserver::PgPerformResponsePB*, yb::rpc::RpcContext) () #16 0x00005640c51c4487 in std::__1::__function::__func<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20, std::__1::allocator<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) () #17 0x00005640c51d374f in yb::tserver::PgClientServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) () #18 0x00005640c4f5f420 in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) () #19 0x00005640c4e845af in yb::rpc::InboundCall::InboundCallTask::Run() () #20 0x00005640c4f6e243 in yb::rpc::(anonymous namespace)::Worker::Execute() () #21 0x00005640c570ecb4 in yb::Thread::SuperviseThread(void*) () #22 0x00007f808b7c6694 in start_thread (arg=0x7f76d8caf700) at pthread_create.c:333 #23 0x00007f808bac341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 ``` Jira: DB-12651 Test Plan: Jenkins yb_build.sh --cxx-test ql-stress-test QLStressTest.ReproMetaCacheDeadlock Reviewers: rthallam, hsunder, qhu, timur Reviewed By: hsunder Subscribers: svc_phabricator, ybase Differential Revision: https://phorge.dev.yugabyte.com/D37706
…ile holding the lock Summary: Original commit: c770d79 / D37706 Call callback in ScopeExit block only. Not while holding the lock. Without this fix, it is possible that a thread can get into a deadlock, trying to request a shared_lock on a mutex, while already holding an exclusive lock on the same mutex: This deadlock can be triggered if there are active read/write requests to a Table (from more than 1 thread) right after the table had a tablet-split. If there is only 1 thread, it is unlikely to run into the deadlock, as the thread notices -- as part of the callback -- that the table's partition info is stale. Having a different thread refresh the partition version before the main thread checks if the table version is stale, is likely necessary to trigger the stack trace seen below. e.g: ``` #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00005640c3eb441b in std::__1::shared_timed_mutex::lock_shared() () #2 0x00005640c3ffcbff in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) () #3 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) () #4 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) () #5 0x00005640c401aa76 in yb::client::(anonymous namespace)::BatcherFlushDone(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig) () #6 0x00005640c401b371 in boost::detail::function::void_function_obj_invoker1<std::__1::__bind<void (*)(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig), std::__1::shared_ptr<yb::client::internal::Batcher> const&, std::__1::placeholders::__ph<1> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig&>, void, yb::Status const&>::invoke(boost::detail::function::function_buffer&, yb::Status const&) () #7 0x00005640c3f70398 in yb::client::internal::Batcher::Run() () #8 0x00005640c3f72656 in yb::client::internal::Batcher::FlushFinished() () #9 0x00005640c3f74a4d in yb::client::internal::Batcher::TabletLookupFinished(yb::client::internal::InFlightOp*, yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> >) () #10 0x00005640c3f759bc in std::__1::__function::__func<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0, std::__1::allocator<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0>, void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>::operator()(yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&) () #11 0x00005640c3fff05d in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) () ** Is holding an exclusive lock in MetaCache::LookupTabletByKey/DoLookupTabletByKey ** #12 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) () #13 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) () #14 0x00005640c4017130 in yb::client::YBSession::FlushAsync(boost::function<void (yb::client::FlushStatus*)>) () #15 0x00005640c5225a0c in yb::tserver::PgClientServiceImpl::Perform(yb::tserver::PgPerformRequestPB const*, yb::tserver::PgPerformResponsePB*, yb::rpc::RpcContext) () #16 0x00005640c51c4487 in std::__1::__function::__func<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20, std::__1::allocator<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) () #17 0x00005640c51d374f in yb::tserver::PgClientServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) () #18 0x00005640c4f5f420 in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) () #19 0x00005640c4e845af in yb::rpc::InboundCall::InboundCallTask::Run() () #20 0x00005640c4f6e243 in yb::rpc::(anonymous namespace)::Worker::Execute() () #21 0x00005640c570ecb4 in yb::Thread::SuperviseThread(void*) () #22 0x00007f808b7c6694 in start_thread (arg=0x7f76d8caf700) at pthread_create.c:333 #23 0x00007f808bac341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 ``` Jira: DB-12651 Test Plan: Jenkins yb_build.sh --cxx-test ql-stress-test QLStressTest.ReproMetaCacheDeadlock Reviewers: rthallam, hsunder, qhu, timur Reviewed By: rthallam Subscribers: ybase, svc_phabricator Differential Revision: https://phorge.dev.yugabyte.com/D37788
…e holding the lock Summary: Original commit: c770d79 / D37706 Call callback in ScopeExit block only. Not while holding the lock. Without this fix, it is possible that a thread can get into a deadlock, trying to request a shared_lock on a mutex, while already holding an exclusive lock on the same mutex: This deadlock can be triggered if there are active read/write requests to a Table (from more than 1 thread) right after the table had a tablet-split. If there is only 1 thread, it is unlikely to run into the deadlock, as the thread notices -- as part of the callback -- that the table's partition info is stale. Having a different thread refresh the partition version before the main thread checks if the table version is stale, is likely necessary to trigger the stack trace seen below. e.g: ``` #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00005640c3eb441b in std::__1::shared_timed_mutex::lock_shared() () #2 0x00005640c3ffcbff in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) () #3 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) () #4 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) () #5 0x00005640c401aa76 in yb::client::(anonymous namespace)::BatcherFlushDone(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig) () #6 0x00005640c401b371 in boost::detail::function::void_function_obj_invoker1<std::__1::__bind<void (*)(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig), std::__1::shared_ptr<yb::client::internal::Batcher> const&, std::__1::placeholders::__ph<1> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig&>, void, yb::Status const&>::invoke(boost::detail::function::function_buffer&, yb::Status const&) () #7 0x00005640c3f70398 in yb::client::internal::Batcher::Run() () #8 0x00005640c3f72656 in yb::client::internal::Batcher::FlushFinished() () #9 0x00005640c3f74a4d in yb::client::internal::Batcher::TabletLookupFinished(yb::client::internal::InFlightOp*, yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> >) () #10 0x00005640c3f759bc in std::__1::__function::__func<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0, std::__1::allocator<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0>, void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>::operator()(yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&) () #11 0x00005640c3fff05d in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) () ** Is holding an exclusive lock in MetaCache::LookupTabletByKey/DoLookupTabletByKey ** #12 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) () #13 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) () #14 0x00005640c4017130 in yb::client::YBSession::FlushAsync(boost::function<void (yb::client::FlushStatus*)>) () #15 0x00005640c5225a0c in yb::tserver::PgClientServiceImpl::Perform(yb::tserver::PgPerformRequestPB const*, yb::tserver::PgPerformResponsePB*, yb::rpc::RpcContext) () #16 0x00005640c51c4487 in std::__1::__function::__func<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20, std::__1::allocator<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) () #17 0x00005640c51d374f in yb::tserver::PgClientServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) () #18 0x00005640c4f5f420 in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) () #19 0x00005640c4e845af in yb::rpc::InboundCall::InboundCallTask::Run() () #20 0x00005640c4f6e243 in yb::rpc::(anonymous namespace)::Worker::Execute() () #21 0x00005640c570ecb4 in yb::Thread::SuperviseThread(void*) () #22 0x00007f808b7c6694 in start_thread (arg=0x7f76d8caf700) at pthread_create.c:333 #23 0x00007f808bac341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 ``` Jira: DB-12651 Test Plan: Jenkins yb_build.sh --cxx-test ql-stress-test QLStressTest.ReproMetaCacheDeadlock Reviewers: rthallam, hsunder, qhu, timur Reviewed By: rthallam Subscribers: svc_phabricator, ybase Differential Revision: https://phorge.dev.yugabyte.com/D37789
…ile holding the lock Summary: Original commit: c770d79 / D37706 Call callback in ScopeExit block only. Not while holding the lock. Without this fix, it is possible that a thread can get into a deadlock, trying to request a shared_lock on a mutex, while already holding an exclusive lock on the same mutex: This deadlock can be triggered if there are active read/write requests to a Table (from more than 1 thread) right after the table had a tablet-split. If there is only 1 thread, it is unlikely to run into the deadlock, as the thread notices -- as part of the callback -- that the table's partition info is stale. Having a different thread refresh the partition version before the main thread checks if the table version is stale, is likely necessary to trigger the stack trace seen below. e.g: ``` #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00005640c3eb441b in std::__1::shared_timed_mutex::lock_shared() () #2 0x00005640c3ffcbff in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) () #3 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) () #4 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) () #5 0x00005640c401aa76 in yb::client::(anonymous namespace)::BatcherFlushDone(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig) () #6 0x00005640c401b371 in boost::detail::function::void_function_obj_invoker1<std::__1::__bind<void (*)(std::__1::shared_ptr<yb::client::internal::Batcher> const&, yb::Status const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig), std::__1::shared_ptr<yb::client::internal::Batcher> const&, std::__1::placeholders::__ph<1> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig&>, void, yb::Status const&>::invoke(boost::detail::function::function_buffer&, yb::Status const&) () #7 0x00005640c3f70398 in yb::client::internal::Batcher::Run() () #8 0x00005640c3f72656 in yb::client::internal::Batcher::FlushFinished() () #9 0x00005640c3f74a4d in yb::client::internal::Batcher::TabletLookupFinished(yb::client::internal::InFlightOp*, yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> >) () #10 0x00005640c3f759bc in std::__1::__function::__func<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0, std::__1::allocator<yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*)::$_0>, void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>::operator()(yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&) () #11 0x00005640c3fff05d in yb::client::internal::MetaCache::LookupTabletByKey(std::__1::shared_ptr<yb::client::YBTable> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >, std::__1::function<void (yb::Result<scoped_refptr<yb::client::internal::RemoteTablet> > const&)>, yb::StronglyTypedBool<yb::client::internal::FailOnPartitionListRefreshed_Tag>) () ** Is holding an exclusive lock in MetaCache::LookupTabletByKey/DoLookupTabletByKey ** #12 0x00005640c3f7549a in yb::client::internal::Batcher::LookupTabletFor(yb::client::internal::InFlightOp*) () #13 0x00005640c401855e in yb::client::(anonymous namespace)::FlushBatcherAsync(std::__1::shared_ptr<yb::client::internal::Batcher> const&, boost::function<void (yb::client::FlushStatus*)>, yb::client::YBSession::BatcherConfig, yb::StronglyTypedBool<yb::client::internal::IsWithinTransactionRetry_Tag>) () #14 0x00005640c4017130 in yb::client::YBSession::FlushAsync(boost::function<void (yb::client::FlushStatus*)>) () #15 0x00005640c5225a0c in yb::tserver::PgClientServiceImpl::Perform(yb::tserver::PgPerformRequestPB const*, yb::tserver::PgPerformResponsePB*, yb::rpc::RpcContext) () #16 0x00005640c51c4487 in std::__1::__function::__func<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20, std::__1::allocator<yb::tserver::PgClientServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_20>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) () #17 0x00005640c51d374f in yb::tserver::PgClientServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) () #18 0x00005640c4f5f420 in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) () #19 0x00005640c4e845af in yb::rpc::InboundCall::InboundCallTask::Run() () #20 0x00005640c4f6e243 in yb::rpc::(anonymous namespace)::Worker::Execute() () #21 0x00005640c570ecb4 in yb::Thread::SuperviseThread(void*) () #22 0x00007f808b7c6694 in start_thread (arg=0x7f76d8caf700) at pthread_create.c:333 #23 0x00007f808bac341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 ``` Jira: DB-12651 Test Plan: Jenkins yb_build.sh --cxx-test ql-stress-test QLStressTest.ReproMetaCacheDeadlock Reviewers: rthallam, hsunder, qhu, timur Reviewed By: rthallam Subscribers: ybase, svc_phabricator Differential Revision: https://phorge.dev.yugabyte.com/D37831
Summary: It is possible for tablet peer's `tablet_` to be null when a rocksdb flush finishes. We call `tablet_->MaxPersistentOpId()` after flush to clean up recently applied transaction state, and this causes a SIGSEGV: ``` * thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV * frame #0: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(yb::RWOperationCounter*, yb::StatusHolder const*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) [inlined] std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>::basic_string(this="", __str=<unavailable>) at string:898:9 frame #1: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(yb::RWOperationCounter*, yb::StatusHolder const*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) [inlined] yb::RWOperationCounter::resource_name(this=0x0000000000000378) const at operation_counter.h:95:12 frame #2: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(this=0x00007f9455305d58, counter=0x0000000000000378, abort_status_holder=<unavailable>, deadline=0x00007f9455305d98) at operation_counter.cc:190:62 frame #3: 0x000055885b247ea6 yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(bool) const [inlined] yb::ScopedRWOperation::ScopedRWOperation(this=0x00007f9455305d58, counter=<unavailable>, deadline=0x00007f9455305d98) at operation_counter.h:140:9 frame #4: 0x000055885b247e9f yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(bool) const [inlined] yb::tablet::Tablet::CreateScopedRWOperationBlockingRocksDbShutdownStart(this=0x0000000000000000, deadline=yb::CoarseTimePoint @ 0x00007f9455305d98) const at tablet.cc:3375:10 frame #5: 0x000055885b247e90 yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(this=0x0000000000000000, invalid_if_no_new_data=<unavailable>) const at tablet.cc:3540:32 frame #6: 0x000055885b277f5e yb-tserver`yb::tablet::TabletPeer::MaxPersistentOpId(this=<unavailable>) const at tablet_peer.cc:946:23 frame #7: 0x000055885b278e52 yb-tserver`non-virtual thunk to yb::tablet::TabletPeer::MaxPersistentOpId() const at tablet_peer.cc:0 frame #8: 0x000055885b2dec44 yb-tserver`yb::tablet::TransactionParticipant::Impl::DoProcessRecentlyAppliedTransactions(this=0x0000153123151500, retryable_requests_flushed_op_id=<unavailable>, persist=<unavailable>) at transaction_participant.cc:2186:22 frame #9: 0x000055885b2e0a8e yb-tserver`yb::tablet::TransactionParticipant::ProcessRecentlyAppliedTransactions() [inlined] yb::tablet::TransactionParticipant::Impl::ProcessRecentlyAppliedTransactions(this=0x0000153123151500) at transaction_participant.cc:1440:27 frame #10: 0x000055885b2e0a63 yb-tserver`yb::tablet::TransactionParticipant::ProcessRecentlyAppliedTransactions(this=<unavailable>) at transaction_participant.cc:2629:17 frame #11: 0x000055885b226093 yb-tserver`yb::tablet::Tablet::RocksDbListener::OnFlushCompleted(this=0x0000153110c2da58, (null)=<unavailable>, (null)=<unavailable>) at tablet.cc:503:34 frame #12: 0x000055885af0e507 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) at db_impl.cc:2121:19 frame #13: 0x000055885af0e275 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) [inlined] rocksdb::DBImpl::FlushMemTableToOutputFile(this=0x0000153123150a80, cfd=0x000015317d651600, mutable_cf_options=0x00007f94553077d8, made_progress=<unavailable>, job_context=0x00007f9455306938, log_buffer=0x00007f9455306048) at db_impl.cc:2008:3 frame #14: 0x000055885af0d859 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) [inlined] rocksdb::DBImpl::BackgroundFlush(this=0x0000153123150a80, made_progress=<unavailable>, job_context=0x00007f9455306938, log_buffer=0x00007f9455306048, cfd=0x000015317d651600) at db_impl.cc:3399:10 frame #15: 0x000055885af0d21f yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(this=0x0000153123150a80, cfd=<unavailable>) at db_impl.cc:3470:31 frame #16: 0x000055885b024a53 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() at thread_posix.cc:133:5 frame #17: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] rocksdb::ThreadPool::StartBGThreads(this=<unavailable>)::$_0::operator()() const at thread_posix.cc:172:5 frame #18: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] decltype(__f=<unavailable>)::$_0&>()()) std::__1::__invoke[abi:ue170006]<rocksdb::ThreadPool::StartBGThreads()::$_0&>(rocksdb::ThreadPool::StartBGThreads()::$_0&) at invoke.h:340:25 frame #19: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ue170006]<rocksdb::ThreadPool::StartBGThreads(__args=<unavailable>)::$_0&>(rocksdb::ThreadPool::StartBGThreads()::$_0&) at invoke.h:415:5 frame #20: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] std::__1::__function::__alloc_func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator(this=<unavailable>)[abi:ue170006]() at function.h:192:16 frame #21: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator(this=<unavailable>)() at function.h:363:12 frame #22: 0x000055885b9c1543 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator(this=0x000015313de3b380)[abi:ue170006]() const at function.h:517:16 frame #23: 0x000055885b9c152d yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator(this=0x000015313de3b380)() const at function.h:1168:12 frame #24: 0x000055885b9c152d yb-tserver`yb::Thread::SuperviseThread(arg=0x000015313de3b320) at thread.cc:866:3 frame #25: 0x00007f94994d81ca libpthread.so.0`start_thread + 234 frame #26: 0x00007f9499729e73 libc.so.6`__clone + 67 ``` This diff adds a null check and returns `OpId::Min()` (i.e. don't clean anything up) if `tablet_` is null and we cannot call `MaxPersistentOpId`. Jira: DB-12915 Test Plan: Jenkins Reviewers: sergei, rthallam Reviewed By: sergei, rthallam Subscribers: rthallam, ybase Differential Revision: https://phorge.dev.yugabyte.com/D38323
…fter flush Summary: Original commit: 250a4d5 / D38323 It is possible for tablet peer's `tablet_` to be null when a rocksdb flush finishes. We call `tablet_->MaxPersistentOpId()` after flush to clean up recently applied transaction state, and this causes a SIGSEGV: ``` * thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV * frame #0: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(yb::RWOperationCounter*, yb::StatusHolder const*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) [inlined] std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>::basic_string(this="", __str=<unavailable>) at string:898:9 frame #1: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(yb::RWOperationCounter*, yb::StatusHolder const*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) [inlined] yb::RWOperationCounter::resource_name(this=0x0000000000000378) const at operation_counter.h:95:12 frame #2: 0x000055885b97311d yb-tserver`yb::ScopedRWOperation::ScopedRWOperation(this=0x00007f9455305d58, counter=0x0000000000000378, abort_status_holder=<unavailable>, deadline=0x00007f9455305d98) at operation_counter.cc:190:62 frame #3: 0x000055885b247ea6 yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(bool) const [inlined] yb::ScopedRWOperation::ScopedRWOperation(this=0x00007f9455305d58, counter=<unavailable>, deadline=0x00007f9455305d98) at operation_counter.h:140:9 frame #4: 0x000055885b247e9f yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(bool) const [inlined] yb::tablet::Tablet::CreateScopedRWOperationBlockingRocksDbShutdownStart(this=0x0000000000000000, deadline=yb::CoarseTimePoint @ 0x00007f9455305d98) const at tablet.cc:3375:10 frame #5: 0x000055885b247e90 yb-tserver`yb::tablet::Tablet::MaxPersistentOpId(this=0x0000000000000000, invalid_if_no_new_data=<unavailable>) const at tablet.cc:3540:32 frame #6: 0x000055885b277f5e yb-tserver`yb::tablet::TabletPeer::MaxPersistentOpId(this=<unavailable>) const at tablet_peer.cc:946:23 frame #7: 0x000055885b278e52 yb-tserver`non-virtual thunk to yb::tablet::TabletPeer::MaxPersistentOpId() const at tablet_peer.cc:0 frame #8: 0x000055885b2dec44 yb-tserver`yb::tablet::TransactionParticipant::Impl::DoProcessRecentlyAppliedTransactions(this=0x0000153123151500, retryable_requests_flushed_op_id=<unavailable>, persist=<unavailable>) at transaction_participant.cc:2186:22 frame #9: 0x000055885b2e0a8e yb-tserver`yb::tablet::TransactionParticipant::ProcessRecentlyAppliedTransactions() [inlined] yb::tablet::TransactionParticipant::Impl::ProcessRecentlyAppliedTransactions(this=0x0000153123151500) at transaction_participant.cc:1440:27 frame #10: 0x000055885b2e0a63 yb-tserver`yb::tablet::TransactionParticipant::ProcessRecentlyAppliedTransactions(this=<unavailable>) at transaction_participant.cc:2629:17 frame #11: 0x000055885b226093 yb-tserver`yb::tablet::Tablet::RocksDbListener::OnFlushCompleted(this=0x0000153110c2da58, (null)=<unavailable>, (null)=<unavailable>) at tablet.cc:503:34 frame #12: 0x000055885af0e507 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) at db_impl.cc:2121:19 frame #13: 0x000055885af0e275 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) [inlined] rocksdb::DBImpl::FlushMemTableToOutputFile(this=0x0000153123150a80, cfd=0x000015317d651600, mutable_cf_options=0x00007f94553077d8, made_progress=<unavailable>, job_context=0x00007f9455306938, log_buffer=0x00007f9455306048) at db_impl.cc:2008:3 frame #14: 0x000055885af0d859 yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(rocksdb::ColumnFamilyData*) [inlined] rocksdb::DBImpl::BackgroundFlush(this=0x0000153123150a80, made_progress=<unavailable>, job_context=0x00007f9455306938, log_buffer=0x00007f9455306048, cfd=0x000015317d651600) at db_impl.cc:3399:10 frame #15: 0x000055885af0d21f yb-tserver`rocksdb::DBImpl::BackgroundCallFlush(this=0x0000153123150a80, cfd=<unavailable>) at db_impl.cc:3470:31 frame #16: 0x000055885b024a53 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() at thread_posix.cc:133:5 frame #17: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] rocksdb::ThreadPool::StartBGThreads(this=<unavailable>)::$_0::operator()() const at thread_posix.cc:172:5 frame #18: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] decltype(__f=<unavailable>)::$_0&>()()) std::__1::__invoke[abi:ue170006]<rocksdb::ThreadPool::StartBGThreads()::$_0&>(rocksdb::ThreadPool::StartBGThreads()::$_0&) at invoke.h:340:25 frame #19: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ue170006]<rocksdb::ThreadPool::StartBGThreads(__args=<unavailable>)::$_0&>(rocksdb::ThreadPool::StartBGThreads()::$_0&) at invoke.h:415:5 frame #20: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator()() [inlined] std::__1::__function::__alloc_func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator(this=<unavailable>)[abi:ue170006]() at function.h:192:16 frame #21: 0x000055885b024900 yb-tserver`std::__1::__function::__func<rocksdb::ThreadPool::StartBGThreads()::$_0, std::__1::allocator<rocksdb::ThreadPool::StartBGThreads()::$_0>, void ()>::operator(this=<unavailable>)() at function.h:363:12 frame #22: 0x000055885b9c1543 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator(this=0x000015313de3b380)[abi:ue170006]() const at function.h:517:16 frame #23: 0x000055885b9c152d yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator(this=0x000015313de3b380)() const at function.h:1168:12 frame #24: 0x000055885b9c152d yb-tserver`yb::Thread::SuperviseThread(arg=0x000015313de3b320) at thread.cc:866:3 frame #25: 0x00007f94994d81ca libpthread.so.0`start_thread + 234 frame #26: 0x00007f9499729e73 libc.so.6`__clone + 67 ``` This diff adds a null check and returns `OpId::Min()` (i.e. don't clean anything up) if `tablet_` is null and we cannot call `MaxPersistentOpId`. Jira: DB-12915 Test Plan: Jenkins Reviewers: sergei, rthallam Reviewed By: rthallam Subscribers: ybase, rthallam Differential Revision: https://phorge.dev.yugabyte.com/D38431
Summary: ### Issue Test ClockSynchronizationTest.TestClockSkewError fails with tsan failure ``` WARNING: ThreadSanitizer: data race (pid=226462) Read of size 8 at 0x7b4000000bf0 by thread T82: #0 boost::intrusive_ptr<yb::Status::State>::get() const ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:181:16 (libyb_util.so+0x3c5994) #1 bool boost::operator==<yb::Status::State>(boost::intrusive_ptr<yb::Status::State> const&, std::nullptr_t) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:263:14 (libyb_util.so+0x3c5994) #2 yb::Status::ok() const ${YB_SRC_ROOT}/src/yb/util/status.h:120:51 (libyb_util.so+0x3c5994) #3 yb::MockClock::Now() ${YB_SRC_ROOT}/src/yb/util/physical_time.cc:141:3 (libyb_util.so+0x3c5994) #4 yb::server::HybridClock::NowWithError(yb::HybridTime*, unsigned long*) ${YB_SRC_ROOT}/src/yb/server/hybrid_clock.cc:155:22 (libserver_common.so+0xa5e12) #5 yb::server::HybridClock::NowRange() ${YB_SRC_ROOT}/src/yb/server/hybrid_clock.cc:144:3 (libserver_common.so+0xa5ceb) #6 yb::ClockBase::Now() ${YB_SRC_ROOT}/src/yb/common/clock.h:26:29 (libtserver.so+0x23a77a) #7 yb::tserver::Heartbeater::Thread::TryHeartbeat() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:437:41 (libtserver.so+0x23a77a) #8 yb::tserver::Heartbeater::Thread::DoHeartbeat() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:650:19 (libtserver.so+0x23d05f) #9 yb::tserver::Heartbeater::Thread::RunThread() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:697:16 (libtserver.so+0x23d74d) #10 decltype(*std::declval<yb::tserver::Heartbeater::Thread*&>().*std::declval<void (yb::tserver::Heartbeater::Thread::*&)()>()()) std::__invoke[abi:ue170006]<void (yb::tserver::Heartbeater::Thread::*&)(), yb::tserver::Heartbeater::Thread*&, void>(void (yb::tserver::Heartbeater::Thread::*&)(), yb::tserver::Heartbeater::Thread*&) ${YB_THIRDPARTY_DIR}/installed/tsan/libcxx/include/c++/v1/__type_traits/invoke.h:308:25 (libtserver.so+0x24206b) ... Previous write of size 8 at 0x7b4000000bf0 by main thread: #0 boost::intrusive_ptr<yb::Status::State>::swap(boost::intrusive_ptr<yb::Status::State>&) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:210:16 (libyb_util.so+0x3c5c54) #1 boost::intrusive_ptr<yb::Status::State>::operator=(boost::intrusive_ptr<yb::Status::State>&&) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:122:61 (libyb_util.so+0x3c5c54) #2 yb::Status::operator=(yb::Status&&) ${YB_SRC_ROOT}/src/yb/util/status.h:98:7 (libyb_util.so+0x3c5c54) #3 yb::MockClock::Set(yb::PhysicalTime const&) ${YB_SRC_ROOT}/src/yb/util/physical_time.cc:147:16 (libyb_util.so+0x3c5c54) #4 yb::ClockSynchronizationTest_TestClockSkewError_Test::TestBody() ${YB_SRC_ROOT}/src/yb/integration-tests/clock_synchronization-itest.cc:131:15 (clock_synchronization-itest+0x12e3ca) #5 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2599:10 (libgtest.so.1.12.1+0x894f9) #6 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2635:14 (libgtest.so.1.12.1+0x894f9) #7 testing::Test::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2674:5 (libgtest.so.1.12.1+0x6123f) #8 testing::TestInfo::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2853:11 (libgtest.so.1.12.1+0x62a05) #9 testing::TestSuite::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:3012:30 (libgtest.so.1.12.1+0x63f04) #10 testing::internal::UnitTestImpl::RunAllTests() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:5870:44 (libgtest.so.1.12.1+0x7be3d) ... **SUMMARY**: ThreadSanitizer: data race ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:181:16 in boost::intrusive_ptr<yb::Status::State>::get() const ``` ### Fix Do what value_ does => wrap mock_status_ in boost::atomic. Jira: DB-13604 Test Plan: Jenkins Ran ``` ./yb_build.sh tsan --cxx-test integration-tests_clock_synchronization-itest --gtest_filter ClockSynchronizationTest.TestClockSkewError -n 50 ``` Reviewers: asrivastava Reviewed By: asrivastava Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D39315
Summary: Original commit: 21d7ad3 / D39315 ### Issue Test ClockSynchronizationTest.TestClockSkewError fails with tsan failure ``` WARNING: ThreadSanitizer: data race (pid=226462) Read of size 8 at 0x7b4000000bf0 by thread T82: #0 boost::intrusive_ptr<yb::Status::State>::get() const ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:181:16 (libyb_util.so+0x3c5994) #1 bool boost::operator==<yb::Status::State>(boost::intrusive_ptr<yb::Status::State> const&, std::nullptr_t) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:263:14 (libyb_util.so+0x3c5994) #2 yb::Status::ok() const ${YB_SRC_ROOT}/src/yb/util/status.h:120:51 (libyb_util.so+0x3c5994) #3 yb::MockClock::Now() ${YB_SRC_ROOT}/src/yb/util/physical_time.cc:141:3 (libyb_util.so+0x3c5994) #4 yb::server::HybridClock::NowWithError(yb::HybridTime*, unsigned long*) ${YB_SRC_ROOT}/src/yb/server/hybrid_clock.cc:155:22 (libserver_common.so+0xa5e12) #5 yb::server::HybridClock::NowRange() ${YB_SRC_ROOT}/src/yb/server/hybrid_clock.cc:144:3 (libserver_common.so+0xa5ceb) #6 yb::ClockBase::Now() ${YB_SRC_ROOT}/src/yb/common/clock.h:26:29 (libtserver.so+0x23a77a) #7 yb::tserver::Heartbeater::Thread::TryHeartbeat() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:437:41 (libtserver.so+0x23a77a) #8 yb::tserver::Heartbeater::Thread::DoHeartbeat() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:650:19 (libtserver.so+0x23d05f) #9 yb::tserver::Heartbeater::Thread::RunThread() ${YB_SRC_ROOT}/src/yb/tserver/heartbeater.cc:697:16 (libtserver.so+0x23d74d) #10 decltype(*std::declval<yb::tserver::Heartbeater::Thread*&>().*std::declval<void (yb::tserver::Heartbeater::Thread::*&)()>()()) std::__invoke[abi:ue170006]<void (yb::tserver::Heartbeater::Thread::*&)(), yb::tserver::Heartbeater::Thread*&, void>(void (yb::tserver::Heartbeater::Thread::*&)(), yb::tserver::Heartbeater::Thread*&) ${YB_THIRDPARTY_DIR}/installed/tsan/libcxx/include/c++/v1/__type_traits/invoke.h:308:25 (libtserver.so+0x24206b) ... Previous write of size 8 at 0x7b4000000bf0 by main thread: #0 boost::intrusive_ptr<yb::Status::State>::swap(boost::intrusive_ptr<yb::Status::State>&) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:210:16 (libyb_util.so+0x3c5c54) #1 boost::intrusive_ptr<yb::Status::State>::operator=(boost::intrusive_ptr<yb::Status::State>&&) ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:122:61 (libyb_util.so+0x3c5c54) #2 yb::Status::operator=(yb::Status&&) ${YB_SRC_ROOT}/src/yb/util/status.h:98:7 (libyb_util.so+0x3c5c54) #3 yb::MockClock::Set(yb::PhysicalTime const&) ${YB_SRC_ROOT}/src/yb/util/physical_time.cc:147:16 (libyb_util.so+0x3c5c54) #4 yb::ClockSynchronizationTest_TestClockSkewError_Test::TestBody() ${YB_SRC_ROOT}/src/yb/integration-tests/clock_synchronization-itest.cc:131:15 (clock_synchronization-itest+0x12e3ca) #5 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2599:10 (libgtest.so.1.12.1+0x894f9) #6 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2635:14 (libgtest.so.1.12.1+0x894f9) #7 testing::Test::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2674:5 (libgtest.so.1.12.1+0x6123f) #8 testing::TestInfo::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:2853:11 (libgtest.so.1.12.1+0x62a05) #9 testing::TestSuite::Run() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:3012:30 (libgtest.so.1.12.1+0x63f04) #10 testing::internal::UnitTestImpl::RunAllTests() ${YB_THIRDPARTY_DIR}/src/googletest-1.12.1/googletest/src/gtest.cc:5870:44 (libgtest.so.1.12.1+0x7be3d) ... **SUMMARY**: ThreadSanitizer: data race ${YB_THIRDPARTY_DIR}/installed/tsan/include/boost/smart_ptr/intrusive_ptr.hpp:181:16 in boost::intrusive_ptr<yb::Status::State>::get() const ``` ### Fix Use a mutex to prevent data race on mock_status_. Jira: DB-13604 Test Plan: Jenkins Ran ``` ./yb_build.sh tsan --cxx-test integration-tests_clock_synchronization-itest --gtest_filter ClockSynchronizationTest.TestClockSkewError -n 50 ``` Backport-through: 2024.2 Reviewers: asrivastava Reviewed By: asrivastava Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D39359
Here is my table schema:
CREATE TABLE kairosdb.row_keys ( metric text, row_time timestamp, data_type text, tags frozen<map<text, text>>, value text, PRIMARY KEY ((metric, row_time), data_type, tags) )
Here is the delete query:
DELETE FROM row_keys WHERE metric = ? AND row_time = ?
Here is the error I get:
I should think this would be a supported feature.
The text was updated successfully, but these errors were encountered: