Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Camunda docs #2

Merged
merged 32 commits into from
May 19, 2022
Merged

Camunda docs #2

merged 32 commits into from
May 19, 2022

Conversation

jayant07-yb
Copy link
Owner

Added the Documentation regarding the integration of YugabyteDB(ysql) with Camunda.

docs/content/preview/integrations/camunda.md Outdated Show resolved Hide resolved
docs/content/preview/integrations/camunda.md Outdated Show resolved Hide resolved
docs/content/preview/integrations/camunda.md Outdated Show resolved Hide resolved
docs/content/preview/integrations/camunda.md Outdated Show resolved Hide resolved
docs/content/preview/integrations/camunda.md Outdated Show resolved Hide resolved
docs/content/preview/integrations/camunda.md Outdated Show resolved Hide resolved
docs/content/preview/integrations/camunda.md Outdated Show resolved Hide resolved
@jayant07-yb jayant07-yb merged this pull request into master May 19, 2022
jayant07-yb pushed a commit that referenced this pull request Sep 26, 2022
…t.cc

Summary:
Addresses the following errors:

ASAN error:
```
==47493==ERROR: AddressSanitizer: heap-use-after-free on address 0x6150014911e8 at pc 0x00000046ce8e bp 0x7f84d2a01ee0 sp 0x7f84d2a01ed8
READ of size 4 at 0x6150014911e8 thread T561 (rpc_tp_CDCConsu)
    #0 0x46ce8d in google::protobuf::internal::RepeatedPtrFieldBase::size() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20220322021123-e488f7fa5b-almalinux8-x86_64-clang12/installed/asan/include/google/protobuf/repeated_field.h:1515:10
    #1 0x7f864b3e0118 in google::protobuf::RepeatedPtrField<yb::cdc::CDCRecordPB>::size() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20220322021123-e488f7fa5b-almalinux8-x86_64-clang12/installed/asan/include/google/protobuf/repeated_field.h:1984:32
    #2 0x7f864b3dd9fc in yb::cdc::GetChangesResponsePB::records_size() const $YB_SRC_ROOT/build/asan-clang12-dynamic-ninja/src/yb/cdc/cdc_service.pb.h:9334:19
    #3 0x7f864b3da477 in
    yb::tserver::enterprise::TwoDCOutputClient::ProcessChangesStartingFromIndex(int)
    $YB_SRC_ROOT/build/asan-clang12-dynaamic-ninja/../../ent/src/yb/tserver/twodc_output_client.cc:213:44
    #4 0x7f864b3dd131 in yb::tserver::enterprise::TwoDCOutputClient::WriteCDCRecordDone(yb::Status const&, yb::tserver::WriteResponsePB const&) $YB_SRC_ROOT/build/asan-clang12-dynamic-ninja/../../ent/src/yb/tserver/twodc_output_client.cc:408:18
```
```
0x6150014911e8 is located 360 bytes inside of 512-byte region [0x615001491080,0x615001491280)
freed by thread T772 (CDCConsumerHand) here:
    #0 0x7f8657e7465d in operator delete(void*) /opt/yb-build/llvm/yb-llvm-v12.0.1-yb-1-1633143152-bdb147e6-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:160:3
    #1 0x7f864b3e00f5 in yb::tserver::enterprise::TwoDCOutputClient::~TwoDCOutputClient() $YB_SRC_ROOT/build/asan-clang12-dynamic-ninja/../../ent/src/yb/tserver/twodc_output_client.cc:77:24
```
Fixed by adding in `shutdown_` variable to stop processing future records if the client is being
shutdown.

---
TSAN error:
```
WARNING: ThreadSanitizer: data race (pid=430)
  Write of size 4 at 0x7b5000b81168 by thread T306 (mutexes: write M1014289907445001104):
    #0 void google::protobuf::internal::RepeatedPtrFieldBase::Clear<google::protobuf::RepeatedPtrField<yb::cdc::CDCRecordPB>::TypeHandler>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20220420172553-91c632476c-centos7-x86_64-clang12/installed/tsan/include/google/protobuf/repeated_field.h:1593:19 (libcdc_service_proto.so+0xf1df8)
    #1 google::protobuf::RepeatedPtrField<yb::cdc::CDCRecordPB>::Clear() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20220420172553-91c632476c-centos7-x86_64-clang12/installed/tsan/include/google/protobuf/repeated_field.h:2102:25 (libcdc_service_proto.so+0xe5ce9)
    #2 yb::cdc::GetChangesResponsePB::Clear() /nfusr/centos-gcp-cloud/jenkins-worker-2lx048/jenkins/jenkins-github-yugabyte-db-centos-master-clang12-tsan-175/build/tsan-clang12-dynamic-ninja/src/yb/cdc/cdc_service.pb.cc:10118:12 (libcdc_service_proto.so+0xcdaa9)
```
```
 Previous read of size 4 at 0x7b5000b81168 by thread T352:
    #0 google::protobuf::internal::RepeatedPtrFieldBase::size() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20220420172553-91c632476c-centos7-x86_64-clang12/installed/tsan/include/google/protobuf/repeated_field.h:1515:10 (xcluster-tablet-split-itest+0x35072a)
    #1 google::protobuf::RepeatedPtrField<yb::cdc::CDCRecordPB>::size() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20220420172553-91c632476c-centos7-x86_64-clang12/installed/tsan/include/google/protobuf/repeated_field.h:1984:32 (libtserver.so+0x591ff9)
    #2 yb::cdc::GetChangesResponsePB::records_size() const /nfusr/centos-gcp-cloud/jenkins-worker-2lx048/jenkins/jenkins-github-yugabyte-db-centos-master-clang12-tsan-175/build/tsan-clang12-dynamic-ninja/src/yb/cdc/cdc_service.pb.h:9367:19 (libtserver.so+0x59028d)
    #3 yb::tserver::enterprise::TwoDCOutputClient::ProcessChangesStartingFromIndex(int) /nfusr/centos-gcp-cloud/jenkins-worker-2lx048/jenkins/jenkins-github-yugabyte-db-centos-master-clang12-tsan-175/build/tsan-clang12-dynamic-ninja/../../ent/src/yb/tserver/twodc_output_client.cc:213:44 (libtserver.so+0x58e8ee)
    #4 yb::tserver::enterprise::TwoDCOutputClient::WriteCDCRecordDone(yb::Status const&, yb::tserver::WriteResponsePB const&) /nfusr/centos-gcp-cloud/jenkins-worker-2lx048/jenkins/jenkins-github-yugabyte-db-centos-master-clang12-tsan-175/build/tsan-clang12-dynamic-ninja/../../ent/src/yb/tserver/twodc_output_client.cc:408:18 (libtserver.so+0x58fa0e)
```
Fixed by returning immediately after completing ProcessSplitOp and there are no more records to
process. Previously we would only continue the loop, which would still do an access on the
`twodc_resp_copy_` object.

Test Plan:
```
ybd tsan --cxx-test integration-tests_xcluster-tablet-split-itest --gtest_filter XClusterTabletSplitITest.SplittingOnProducerAndConsumer
ybd asan --cxx-test integration-tests_xcluster-tablet-split-itest --gtest_filter XClusterTabletSplitITest.SplittingOnProducerAndConsumer
```

Reviewers: rahuldesirazu, nicolas

Reviewed By: nicolas

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D16955
jayant07-yb pushed a commit that referenced this pull request Sep 26, 2022
… behavior on postgres shutdown

Summary:
The diff consists of 3 parts.

**Part #1:** Multiple call of the `PerformFuture::Get` method
In case of session close when temp tables exists the following stack is possible

```
std::__1::future<yb::pggate::PerformResult>::get()
yb::pggate::PerformFuture::Get()                   // Second call of the PerformFuture::Get on same object
yb::pggate::PgDocResponse::Get()
yb::pggate::PgDocOp::~PgDocOp()
yb::pggate::PgDocReadOp::~PgDocReadOp()
...
yb::pggate::PgDml::~PgDml()
yb::pggate::PgDmlRead::~PgDmlRead()
yb::pggate::PgSelect::~PgSelect()
...
yb::pggate::PgMemctx::Clear()
yb::pggate::PgMemctx::~PgMemctx()
...
yb::pggate::ClearGlobalPgMemctxMap()
YBCDestroyPgGate
YBOnPostgresBackendShutdown
quickdie                                           // Caused by interruption by the SIGQUIT signal
...
std::__1::future<yb::pggate::PerformResult>::get()
yb::pggate::PerformFuture::Get()                   // First call of the PerformFuture::Get
yb::pggate::PgDocResponse::Get()
yb::pggate::PgDocOp::GetResult(std::__1::list<yb::pggate::PgDocResult, std::__1::allocator<yb::pggate::PgDocResult> >*)
yb::pggate::PgDml::FetchDataFromServer()
yb::pggate::PgDml::Fetch(int, unsigned long*, bool*,
yb::pggate::PgApiImpl::DmlFetch(yb::pggate::PgStatement*, int, unsigned long*,
YBCPgDmlFetch
ybcFetchNextHeapTuple
ybc_getnext_heaptuple
ybc_systable_getnext
systable_getnext
findDependentObjects
performDeletion
RemoveTempRelations
RemoveTempRelationsCallback
shmem_exit
proc_exit_prepare
proc_exit
PostgresMain
BackendRun
BackendStartup
ServerLoop
PostmasterMain
PostgresServerProcessMain
main
```

In this stack the `PerformFuture::Get` method is called on same object twice. But the `PerformFuture` object is designed to call `Get` method only once. After the first call the `PerformFuture::Valid` method should return `false` and `Get` method should not be called. But in the above stack the second call of the `PerformFuture::Get`method is done before the first one is returned (due to interruption by the `SIGQUIT` signal).

To handle this situation `session_` field is set to `null` at the very beginning of the `PerformFuture::Get`. In case `session_` is `null` the `PerformFuture::Valid` returns `false`.

**Part #2:** Access deleted `PgApiImpl` object
The `YBCDestroyPgGate` function destroys the `PgApiImpl` object and then calls the `ClearGlobalPgMemctxMap` function. But this function may call the code which access the `PgApiImpl` object fields. Here is the stack trace

```
std::__1::unique_ptr<yb::pggate::PgClient::Impl, std::__1::default_delete<yb::pggate::PgClient::Impl> >::operator->() const
yb::pggate::PgClient::FinishTransaction(yb::StronglyTypedBool<yb::pggate::Commit_Tag>, yb::StronglyTypedBool<yb::pggate::DdlMode_Tag>)
   yb::pggate::PgTxnManager::ExitSeparateDdlTxnMode(yb::StronglyTypedBool<yb::pggate::Commit_Tag>)
yb::pggate::PgTxnManager::FinishTransaction(yb::StronglyTypedBool<yb::pggate::Commit_Tag>)
yb::pggate::PgTxnManager::AbortTransaction()
yb::pggate::PgTxnManager::~PgTxnManager()
...
yb::pggate::PgSession::~PgSession()
...
yb::pggate::PgStatement::~PgStatement()
yb::pggate::PgDml::~PgDml()
yb::pggate::PgDmlRead::~PgDmlRead()
yb::pggate::PgSelect::~PgSelect()
...
yb::pggate::PgMemctx::Clear()
yb::pggate::PgMemctx::~PgMemctx()
...
yb::pggate::ClearGlobalPgMemctxMap()
...
```

Solution is to move PgMemctx map into the `PgApiImpl` object (this also fixes yugabyte#7216)

**Part #3:** Access shut down `PgClient`object

The `~PgApiImpl()` explicitly shut downs `pg_client_` object but the `pg_session_` is still alive. And when it will be destroyed the pg_client_ object may be used.

The stack is

```
yb::pggate::PgClient::Impl::FinishTransaction(yb::StronglyTypedBool<yb::pggate::Commit_Tag>, yb::StronglyTypedBool<yb::pggate::DdlMode_Tag>)
yb::pggate::PgClient::FinishTransaction(yb::StronglyTypedBool<yb::pggate::Commit_Tag>, yb::StronglyTypedBool<yb::pggate::DdlMode_Tag>)
yb::pggate::PgTxnManager::ExitSeparateDdlTxnMode(yb::StronglyTypedBool<yb::pggate::Commit_Tag>)
yb::pggate::PgTxnManager::FinishTransaction(yb::StronglyTypedBool<yb::pggate::Commit_Tag>)
yb::pggate::PgTxnManager::AbortTransaction()
yb::pggate::PgTxnManager::~PgTxnManager()
...
yb::pggate::PgSession::~PgSession()
...
yb::pggate::PgApiImpl::~PgApiImpl()
...
YBCDestroyPgGate
YBOnPostgresBackendShutdown
quickdie
...
```

Solution is to explicitly destroy `pg_session_` before shutting down of `pg_client_`

Test Plan:
Jenkins

The following tests had crashed without the fix

```
./yb_build.sh asan --gtest_filter PgFKeyTest.AddFKCorrectnessOnTempTables
./yb_build.sh asan --gtest_filter PgCatalogPerfTest.CacheRefreshRPCCountWithPartitionTables
./yb_build.sh asan --gtest_filter PgMiniTest.BigInsertWithRestart
./yb_build.sh asan --gtest_filter AlterTableWithConcurrentTxnTest.TServerLeaderChange
```

Reviewers: dsrinivasan, jason, alex

Reviewed By: alex

Subscribers: mbautin, bogdan, yql

Differential Revision: https://phabricator.dev.yugabyte.com/D17259
jayant07-yb pushed a commit that referenced this pull request Sep 26, 2022
…le changing oom_score_adj

Summary:
When a `ysqlsh` shell is opened, the postmaster tries to set the oom_score_adj to a specific value. This is accomplished by modifying the oom value in the following file `/proc/[pid]/oom_score_adj`. However, when trying to set this sometimes the postmaster crashes with the following error.

```
ysqlsh: could not connect to server: Connection refused\n\tIs the server running on host "172.151.31.53" and accepting\n\tTCP/IP connections on port 5433?
```

```
warning: File "/home/yugabyte/yb-software/yugabyte-2.15.1.0-b3-centos-x86_64/linuxbrew/lib/libthread_db.so.1" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
To enable execution of this file add
	add-auto-load-safe-path /home/yugabyte/yb-software/yugabyte-2.15.1.0-b3-centos-x86_64/linuxbrew/lib/libthread_db.so.1
line to your configuration file "/root/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/root/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

warning: File "/home/yugabyte/yb-software/yugabyte-2.15.1.0-b3-centos-x86_64/linuxbrew/lib/libthread_db.so.1" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Core was generated by `/home/yugabyte/yb-software/yugabyte-2.15.1.0-b3-centos-x86_64/postgres/bin/post'.
Program terminated with signal 11, Segmentation fault.
#0  _IO_new_fclose (fp=0x0) at iofclose.c:53
53	iofclose.c: No such file or directory.
(gdb) bt
#0  _IO_new_fclose (fp=0x0) at iofclose.c:53
#1  0x00000000008ab237 in BackendStartup (port=0x1b7c1e0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4255
#2  ServerLoop () at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1767
#3  0x00000000008a7d01 in PostmasterMain (argc=<optimized out>, argv=0x1b92780) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1423
#4  0x00000000007c85c3 in PostgresServerProcessMain (argc=23, argv=0x1b92780) at ../../../../../../src/postgres/src/backend/main/main.c:234
#5  0x00000000004f5ae2 in main ()
```

This is due to the fact that fclose tries to close a file descriptor when fopen fails.

The diff that caused this regression is as follows --> https://phabricator.dev.yugabyte.com/D14099

Log message looks like this

```
2022-06-09 19:12:18.685 UTC [17266] LOG:  error 2: No such file or directory, unable to open file /proc/17390/oom_score_adj123
```

Test Plan:
Force fclose to fail by providing a garbage path to the oom_score_adj.
Make sure that after the fix postmaster does not restart or segfault does not happen.

Reviewers: sagarwal, zyu, mihnea, smishra

Reviewed By: zyu, smishra

Subscribers: rthallam, kannan, yql

Differential Revision: https://phabricator.dev.yugabyte.com/D17567
jayant07-yb pushed a commit that referenced this pull request Sep 26, 2022
…f tserver

Summary:
Given a universe where data can be moved off a node, we now wait for tablets to move off of the tserver that was first stopped on platform and then removed . Previously, if a node is stopped on Platform, and then the node is removed, we do not wait for the tablets to move off and just continue for the removal of the node from the universe.

We try to move tablets off of a node when possible

We remove the `isTServer` condition in `UpdatePlacementInfo.java` when blacklisting nodes because if a node is stopped, it's `isTServer` value is set as `false` but we would still like to blacklist this node

Test Plan:
Some things to understand beforehand:
1. When a node’s tserver/master is not running, isTserver/isMaster is false
2. If a node is stopped, the node is still alive, just that the tserver/master process is not running, thus `isTserverAliveOnNode` will be false on a stopped node

Create a GCP universe with 6 nodes and rf3, with AZs comprising of us-west-1, us-west-2, and us-east1. In the below tests, we should have a clean universe and all the nodes should be live.

Perform the following tests:

a) Happy path, stopping and then immediately removing a node from the universe
  1. Stop a node in us-west-2
  2. Immediately after the node is stopped, remove the same node from the universe
  3. Go to master UI at <master-ip>:7000/tablet-servers, on a node that is currently not being removed
  4. Keep refreshing the page, we should see the values for the node's `User Tablet-Peers / Leaders` slowly decrease until it hits 0 / 0
  5. The node should successfully be removed

b) Edge Case #1, Only removing a node from the universe
  1. Remove a node in us-west-2 from the universe
  2. Go to master UI at <master-ip>:7000/tablet-servers, on a node that is currently not being removed
  3. Keep refreshing the page, we should see the values for the node's `User Tablet-Peers / Leaders` slowly decrease until it hits 0 / 0
  4. The node should successfully be removed

c) Edge Case #2: Stopping a node, wait until tablets are moved off, then remove node from universe
  1. Stop a node in  us-west-2
  2. Go to master UI at <master-ip>:7000/tablet-servers
  3. Wait for around 10 - 15 mins, tablets from the stopped node should be moved off automatically after this timeframe, i.e. under the `User Tablet-Peers / Leaders` column, that node should display 0 / 0.
  4. Remove the same node from the universe
  5. Since the tablets are already moved off the node, this node should not have much of a wait time for node removal
  6. The node should successfully be removed

d) Edge case #3: Remove 2 nodes from the same AZ
  1.  Remove a node in us-west-2 from the universe
  2. Remove another node from us-west-2
  3. For the second node, since there is nowhere for the tablets to go to, we will not wait for the tablets to move, so on the master UI, we should see  x / 0  under the `User Tablet-Peers / Leaders` columns, where 'x' is the number of tablet peers. The RemoveNodeFromUniverse task should finish. However the value of 'x' should slowly decrease until it hits 0.

e) Edge case #4: Remove as many nodes as possible on Platform
  1. We should only be able to remove at most 1 node with a master server on it to maintain a majority of tablet peers (in our case, we have rf3, so 3 masters servers, thus we can only remove one master server).
  2. We should be able to remove all nodes with only tservers

All areas that use `UpdatePlacementInfo.java` either have the number of nodes to be blacklisted as 0 except for in `EditKubernetesUniverse.java` but we are already using tservers, so it is safe to remove the `isTserver` check in `UpdatePlacementInfo.java`

Reviewers: sanketh, nsingh

Reviewed By: nsingh

Subscribers: yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D18596
jayant07-yb pushed a commit that referenced this pull request Sep 26, 2022
Summary:
BlockingQueueDeathTest.TestPointerParamsMustBeEmptyOnDestruct was failing under ASAN on Jenkins.
```
    Result: died but not with expected error.
  Expected: BlockingQueue holds bare pointers
Actual msg:
[  DEATH   ]
```

It turned out that `gtest-death-test_test` gtest's own unit-tests are also failing on Jenkins with the common failure pattern:
```
[  DEATH   ] AddressSanitizer:DEADLYSIGNAL
[  DEATH   ] =================================================================
[  DEATH   ] ==2262872==ERROR: AddressSanitizer: stack-overflow on address 0x7f823b324f80 (pc 0x7f823b2edc03 bp 0x7f823b325fb0 sp 0x7f823b324f80 T0)
[  DEATH   ]     #0 0x7f823b2edc03  (/lib64/ld-linux-x86-64.so.2+0x1c03)
[  DEATH   ]     #1 0x7f823a362684  (/nfusr/alma8-gcp-cloud/jenkins-worker-74vryc/jenkins/ty/yugabyte-db-thirdparty/build/asan/gmock-1.8.0/shared/googlemock/gtest/libgtest.so+0x134684)
[  DEATH   ]     #2 0x7f8239900dd2  (/lib64/libc.so.6+0x39dd2)
[  DEATH   ]
[  DEATH   ] SUMMARY: AddressSanitizer: stack-overflow (/lib64/ld-linux-x86-64.so.2+0x1c03)
[  DEATH   ] ==2262872==ABORTING
[  DEATH   ]
```

Which is related to https://chromium.googlesource.com/external/github.com/pwnall/googletest/+/681454dae48f109abf68c424c9d2e6db9a092238. With ASAN on x86_64, ExecDeathTestChildMain has frame size of 1728 bytes. Call to `chdir()` in `ExecDeathTestChildMain` ends up in `_dl_runtime_resolve_xsavec`, which attempts to save register state on the stack. And `XSAVE register save area size` is larger than 1728 bytes on Jenkins nodes:
```
$ cpuid -i | grep -i xsave
...
   XSAVE features (0xd/0):
      bytes required by XSAVE/XRSTOR area     = 0x00000a80 (2688)

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) CPU @ 2.80GHz
```

GoogleTest 1.11.0 has a fix for this issue: google/googletest@681454d.
Decided to upgrade to latest GoogleTest 1.12.0 that also contains the fix.

Test Plan: - Jenkins

Reviewers: esheng, jhe, mbautin

Reviewed By: esheng, jhe, mbautin

Subscribers: bogdan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D19390
jayant07-yb pushed a commit that referenced this pull request Dec 2, 2022
…e image

Summary:
We observed a crash while running TPCC workload with CDCSDK enabled.
The stack trace is:

```
(gdb) bt
#0  0x0000557f25b11910 in yb::DatumMessagePB::MergeFrom(yb::DatumMessagePB const&) ()
#1  0x0000557f258a41ef in yb::cdc::PopulateBeforeImage(std::__1::shared_ptr<yb::tablet::TabletPeer> const&, yb::ReadHybridTime const&, yb::cdc::RowMessage*, std::__1::unordered_map<unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&, std::__1::unordered_map<unsigned int, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> > > > > const&, yb::docdb::SubDocKey const&, yb::Schema const&, unsigned int) ()
#2  0x0000557f258a7304 in yb::cdc::PopulateCDCSDKIntentRecord(yb::OpId const&, yb::StronglyTypedUuid<yb::TransactionId_Tag> const&, std::__1::vector<yb::docdb::IntentKeyValueForCDC, std::__1::allocator<yb::docdb::IntentKeyValueForCDC> > const&, yb::cdc::StreamMetadata const&, std::__1::shared_ptr<yb::tablet::TabletPeer> const&, std::__1::unordered_map<unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&, std::__1::unordered_map<unsigned int, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> > > > > const&, yb::cdc::GetChangesResponsePB*, yb::ScopedTrackedConsumption*, unsigned int*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, yb::Schema*, unsigned int, unsigned long const&) ()
#3  0x0000557f258aaa27 in yb::cdc::ProcessIntents(yb::OpId const&, yb::StronglyTypedUuid<yb::TransactionId_Tag> const&, yb::cdc::StreamMetadata const&, std::__1::unordered_map<unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&, std::__1::unordered_map<unsigned int, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> > > > > const&, yb::cdc::GetChangesResponsePB*, yb::ScopedTrackedConsumption*, yb::cdc::CDCSDKCheckpointPB*, std::__1::shared_ptr<yb::tablet::TabletPeer> const&, std::__1::vector<yb::docdb::IntentKeyValueForCDC, std::__1::allocator<yb::docdb::IntentKeyValueForCDC> >*, yb::docdb::ApplyTransactionState*, yb::client::YBClient*, std::__1::shared_ptr<yb::Schema>*, unsigned int*, unsigned long const&) ()
#4  0x0000557f258b00c1 in yb::cdc::GetChangesForCDCSDK(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, yb::cdc::CDCSDKCheckpointPB const&, yb::cdc::StreamMetadata const&, std::__1::shared_ptr<yb::tablet::TabletPeer> const&, std::__1::shared_ptr<yb::MemTracker> const&, std::__1::unordered_map<unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&, std::__1::unordered_map<unsigned int, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> > > > > const&, yb::client::YBClient*, yb::consensus::ReplicateMsgsHolder*, yb::cdc::GetChangesResponsePB*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, std::__1::shared_ptr<yb::Schema>*, unsigned int*, yb::OpId*, long*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >) ()
#5  0x0000557f2586c448 in yb::cdc::CDCServiceImpl::GetChanges(yb::cdc::GetChangesRequestPB const*, yb::cdc::GetChangesResponsePB*, yb::rpc::RpcContext) ()
#6  0x0000557f25908246 in std::__1::__function::__func<yb::cdc::CDCServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_3, std::__1::allocator<yb::cdc::CDCServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_3>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) ()
#7  0x0000557f2590a6af in yb::cdc::CDCServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#8  0x0000557f26227a1e in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#9  0x0000557f2616db2f in yb::rpc::InboundCall::InboundCallTask::Run() ()
#10 0x0000557f26236583 in yb::rpc::(anonymous namespace)::Worker::Execute() ()
#11 0x0000557f268698cf in yb::Thread::SuperviseThread(void*) ()
#12 0x00007fa6fce89694 in ?? ()
#13 0x0000000000000000 in ?? ()
```

The problem is in the method: PopulateBeforeImage
When we drop a column, the the row won't have data for the dropped column, and hence will not be added to the "old_tuple" member of RowMessage. This will mean the size of "old_tuple" does not match the number of columns in the schema.
Which means this line: "row_message->old_tuple(static_cast<int>(index))" could lead to an out of bounds exception.
Instead,  now we are keeping track of the found columns in the row.

Test Plan: Running existing ctests

Reviewers: srangavajjula, sdash, skumar

Reviewed By: sdash, skumar

Differential Revision: https://phabricator.dev.yugabyte.com/D21338
jayant07-yb pushed a commit that referenced this pull request Dec 7, 2022
…e image

Summary:
We observed a crash while running TPCC workload with CDCSDK enabled.
The stack trace is:

```
(gdb) bt
#0  0x0000557f25b11910 in yb::DatumMessagePB::MergeFrom(yb::DatumMessagePB const&) ()
#1  0x0000557f258a41ef in yb::cdc::PopulateBeforeImage(std::__1::shared_ptr<yb::tablet::TabletPeer> const&, yb::ReadHybridTime const&, yb::cdc::RowMessage*, std::__1::unordered_map<unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&, std::__1::unordered_map<unsigned int, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> > > > > const&, yb::docdb::SubDocKey const&, yb::Schema const&, unsigned int) ()
#2  0x0000557f258a7304 in yb::cdc::PopulateCDCSDKIntentRecord(yb::OpId const&, yb::StronglyTypedUuid<yb::TransactionId_Tag> const&, std::__1::vector<yb::docdb::IntentKeyValueForCDC, std::__1::allocator<yb::docdb::IntentKeyValueForCDC> > const&, yb::cdc::StreamMetadata const&, std::__1::shared_ptr<yb::tablet::TabletPeer> const&, std::__1::unordered_map<unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&, std::__1::unordered_map<unsigned int, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> > > > > const&, yb::cdc::GetChangesResponsePB*, yb::ScopedTrackedConsumption*, unsigned int*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, yb::Schema*, unsigned int, unsigned long const&) ()
#3  0x0000557f258aaa27 in yb::cdc::ProcessIntents(yb::OpId const&, yb::StronglyTypedUuid<yb::TransactionId_Tag> const&, yb::cdc::StreamMetadata const&, std::__1::unordered_map<unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&, std::__1::unordered_map<unsigned int, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> > > > > const&, yb::cdc::GetChangesResponsePB*, yb::ScopedTrackedConsumption*, yb::cdc::CDCSDKCheckpointPB*, std::__1::shared_ptr<yb::tablet::TabletPeer> const&, std::__1::vector<yb::docdb::IntentKeyValueForCDC, std::__1::allocator<yb::docdb::IntentKeyValueForCDC> >*, yb::docdb::ApplyTransactionState*, yb::client::YBClient*, std::__1::shared_ptr<yb::Schema>*, unsigned int*, unsigned long const&) ()
#4  0x0000557f258b00c1 in yb::cdc::GetChangesForCDCSDK(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, yb::cdc::CDCSDKCheckpointPB const&, yb::cdc::StreamMetadata const&, std::__1::shared_ptr<yb::tablet::TabletPeer> const&, std::__1::shared_ptr<yb::MemTracker> const&, std::__1::unordered_map<unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&, std::__1::unordered_map<unsigned int, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> >, std::__1::hash<unsigned int>, std::__1::equal_to<unsigned int>, std::__1::allocator<std::__1::pair<unsigned int const, std::__1::vector<yb::master::PgAttributePB, std::__1::allocator<yb::master::PgAttributePB> > > > > const&, yb::client::YBClient*, yb::consensus::ReplicateMsgsHolder*, yb::cdc::GetChangesResponsePB*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, std::__1::shared_ptr<yb::Schema>*, unsigned int*, yb::OpId*, long*, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >) ()
#5  0x0000557f2586c448 in yb::cdc::CDCServiceImpl::GetChanges(yb::cdc::GetChangesRequestPB const*, yb::cdc::GetChangesResponsePB*, yb::rpc::RpcContext) ()
#6  0x0000557f25908246 in std::__1::__function::__func<yb::cdc::CDCServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_3, std::__1::allocator<yb::cdc::CDCServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_3>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) ()
#7  0x0000557f2590a6af in yb::cdc::CDCServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#8  0x0000557f26227a1e in yb::rpc::ServicePoolImpl::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) ()
#9  0x0000557f2616db2f in yb::rpc::InboundCall::InboundCallTask::Run() ()
#10 0x0000557f26236583 in yb::rpc::(anonymous namespace)::Worker::Execute() ()
#11 0x0000557f268698cf in yb::Thread::SuperviseThread(void*) ()
#12 0x00007fa6fce89694 in ?? ()
#13 0x0000000000000000 in ?? ()
```

The problem is in the method: PopulateBeforeImage
When we drop a column, the the row won't have data for the dropped column, and hence will not be added to the "old_tuple" member of RowMessage. This will mean the size of "old_tuple" does not match the number of columns in the schema.
Which means this line: "row_message->old_tuple(static_cast<int>(index))" could lead to an out of bounds exception.
Instead,  now we are keeping track of the found columns in the row.

Test Plan: Running existing ctests

Reviewers: srangavajjula, sdash, skumar

Reviewed By: sdash, skumar

Differential Revision: https://phabricator.dev.yugabyte.com/D21338
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants