Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Core dump after PITR restore with many tables #23399

Closed
1 task done
pilshchikov opened this issue Aug 5, 2024 · 0 comments
Closed
1 task done

[DocDB] Core dump after PITR restore with many tables #23399

pilshchikov opened this issue Aug 5, 2024 · 0 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures qa_stress Bugs identified via Stress automation

Comments

@pilshchikov
Copy link
Contributor

pilshchikov commented Aug 5, 2024

Jira Link: DB-12318

Description

case:

  1. Start 3 node RF=3 cluster, c5.xlarge instances
  2. Start a loop:
    2.1. remove old databases/tables/PITR schedules
    2.2. create new database
    2.3. Create 1000 + (loop cycle * 100) tables
    2.4. Enable PITR
    2.5. Start sample-apps workload --workload SqlDataLoadWithDDL --num_writes -1 --num_reads -1 --num_threads_write 30 --num_threads_read 10 --num_unique_keys 999000000000 --num_value_columns 30 --ddl_operations UPDATE_ROW,INSERT_ROW,DELETE_ROW,DROP_COLUMN,ADD_COLUMN,CHANGE_TYPE --ddl_weights 100,100,10,1,1,1 --use_datatypes true --num_tables 1200 --create_table_name test --batch_size 50 --large_value_multiplier 10 with DDL operations enabled
    2.6. Wait 20 minutes (In background migh happen some nemesis like restart VM or network partitioning)
    2.7. Restore on 10 minutes ago

On 2.7 step on 3 cycle cluster throw core dump error

(lldb) target create "/home/yugabyte/yb-software/yugabyte-2.23.0.0-b676-centos-x86_64/bin/yb-server" --core "/home/yugabyte/cores/core_1402_1722668891_!home!yugabyte!yb-software!yugabyte-2.23.0.0-b676-centos-x86_64!bin!yb-server"
Core file '/home/yugabyte/cores/core_1402_1722668891_!home!yugabyte!yb-software!yugabyte-2.23.0.0-b676-centos-x86_64!bin!yb-server' (x86_64) was loaded.
(lldb) bt all
error: yb-server GetDIE for DIE 0x40 is outside of its CU 0x6e77a1b
...... many lines like that ......
error: yb-server GetDIE for DIE 0x40 is outside of its CU 0x6e77a1b
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x00005618a3f27877 yb-server`std::__1::__function::__func<yb::tablet::TransactionParticipant::Impl::StopActiveTxnsPriorTo(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>, yb::StronglyTypedUuid<yb::TransactionId_Tag>*)::'lambda'(yb::Result<yb::TransactionStatusResult>), std::__1::allocator<yb::tablet::TransactionParticipant::Impl::StopActiveTxnsPriorTo(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>, yb::StronglyTypedUuid<yb::TransactionId_Tag>*)::'lambda'(yb::Result<yb::TransactionStatusResult>)>, void (yb::Result<yb::TransactionStatusResult>)>::operator()(yb::Result<yb::TransactionStatusResult>&&) [inlined] unsigned long std::__1::__cxx_atomic_fetch_sub[abi:ue170006]<unsigned long>(__a=0x00000000000494ca, __delta=1, __order=acq_rel) at cxx_atomic_impl.h:464:10
    frame #1: 0x00005618a3f27877 yb-server`std::__1::__function::__func<yb::tablet::TransactionParticipant::Impl::StopActiveTxnsPriorTo(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>, yb::StronglyTypedUuid<yb::TransactionId_Tag>*)::'lambda'(yb::Result<yb::TransactionStatusResult>), std::__1::allocator<yb::tablet::TransactionParticipant::Impl::StopActiveTxnsPriorTo(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>, yb::StronglyTypedUuid<yb::TransactionId_Tag>*)::'lambda'(yb::Result<yb::TransactionStatusResult>)>, void (yb::Result<yb::TransactionStatusResult>)>::operator()(yb::Result<yb::TransactionStatusResult>&&) [inlined] std::__1::__atomic_base<unsigned long, true>::fetch_sub[abi:ue170006](this=0x00000000000494ca, __op=1, __m=acq_rel) at atomic_base.h:171:14
    frame #2: 0x00005618a3f27877 yb-server`std::__1::__function::__func<yb::tablet::TransactionParticipant::Impl::StopActiveTxnsPriorTo(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>, yb::StronglyTypedUuid<yb::TransactionId_Tag>*)::'lambda'(yb::Result<yb::TransactionStatusResult>), std::__1::allocator<yb::tablet::TransactionParticipant::Impl::StopActiveTxnsPriorTo(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>, yb::StronglyTypedUuid<yb::TransactionId_Tag>*)::'lambda'(yb::Result<yb::TransactionStatusResult>)>, void (yb::Result<yb::TransactionStatusResult>)>::operator()(yb::Result<yb::TransactionStatusResult>&&) [inlined] yb::intrusive_ptr_release(state=0x00000000000494ca) at status.cc:637:22
    frame #3: 0x00005618a3f27877 yb-server`std::__1::__function::__func<yb::tablet::TransactionParticipant::Impl::StopActiveTxnsPriorTo(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>, yb::StronglyTypedUuid<yb::TransactionId_Tag>*)::'lambda'(yb::Result<yb::TransactionStatusResult>), std::__1::allocator<yb::tablet::TransactionParticipant::Impl::StopActiveTxnsPriorTo(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>, yb::StronglyTypedUuid<yb::TransactionId_Tag>*)::'lambda'(yb::Result<yb::TransactionStatusResult>)>, void (yb::Result<yb::TransactionStatusResult>)>::operator()(yb::Result<yb::TransactionStatusResult>&&) [inlined] boost::intrusive_ptr<yb::Status::State>::~intrusive_ptr(this=<unavailable>) at intrusive_ptr.hpp:98:23
    frame #4: 0x00005618a3f27872 yb-server`std::__1::__function::__func<yb::tablet::TransactionParticipant::Impl::StopActiveTxnsPriorTo(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>, yb::StronglyTypedUuid<yb::TransactionId_Tag>*)::'lambda'(yb::Result<yb::TransactionStatusResult>), std::__1::allocator<yb::tablet::TransactionParticipant::Impl::StopActiveTxnsPriorTo(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>, yb::StronglyTypedUuid<yb::TransactionId_Tag>*)::'lambda'(yb::Result<yb::TransactionStatusResult>)>, void (yb::Result<yb::TransactionStatusResult>)>::operator()(yb::Result<yb::TransactionStatusResult>&&) [inlined] boost::intrusive_ptr<yb::Status::State>::operator=(this=0x00007fdfa87530c0, rhs=0x00007fdfd8b39d58) at intrusive_ptr.hpp:154:9

.... trimmed  ....

Full text in attachment
core.txt

It happen on 2.23.0.0-b676 only once

Logs in JIRA attachments

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@pilshchikov pilshchikov added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Aug 5, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Aug 5, 2024
@pilshchikov pilshchikov added qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures qa_stress Bugs identified via Stress automation labels Aug 6, 2024
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Aug 26, 2024
amitanandaiyer added a commit that referenced this issue Sep 12, 2024
…in the call back

Summary:
use a shared_ptr instead of using variables on the stack, to handle the case where the callback may run after the function has exited (due to timeout)
Jira: DB-12318

Test Plan: yb_build.sh --cxx-test yb-admin-snapshot-schedule-test --gtest_filter YbAdminSnapshotScheduleTestWithYsql.TransactionDuringPITRRepro23399

Reviewers: asrivastava

Reviewed By: asrivastava

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D37883
jasonyb pushed a commit that referenced this issue Sep 12, 2024
Summary:
 aec9a66 [doc][xcluster] Truncate limitation (#23833)
 fe8890d [PLAT-13984] Change default metric graph points count from 100 to 250 + make runtime configurable.
 7d9b57b [#23869] YSQL: Fix one type of ddl atomicity stress test flakiness
 7c1bca8 [PLAT-14052][PLAT-15237] :Add advanced date-time option,Restrict CP for K8s
 afce6ad [PLAT-14158] Support multiple transactional xCluster configs per universe on YBA UI
 33342b3 [PLAT-15101] Add runtime config to turn on/off rollN for k8s
 361a99a [#23809] CDCSDK: Filter out records corresponding to index tables in colocated DBs
 8bbdf66 [docs] changed date (#23885)
 ee639f4 [#22104,#23506] YSQL: auto analyze service collects mutation counts from DDL and skips indexes
 3dbf6da Delete architecture/design/multi-region-xcluster-async-replication.md
 e013578 [#23864] DocDB: Move cluster_uuid to server common flags
 2b6a2d3 [PLAT-15062][PLAT-15071] Support DB scoped on UI and display schema change mode information
 8260075 [PLAT-15079] Treat dropped on target tables as unconfigured but preselected
 f5169ca DocDB: Follow redirects for callhome, and fix URL
 eb61ef6 [PLAT-15228] Update package installation command for YBA
 e791c40 [#18822] YSQL: Add serialization/deserialization mechanism for update optimization metadata
 e69d8cb [doc] Backups and DDLs (#23840)
 e72ae64 [PLAT-14552]filter support bundle core files by date
 8796c83 [PLAT-15274]: BackFill Pitr params for DR config and add not-null constraints in DB.
 dc9cc67 Fixed NPM test:ci build error
 2e5ebef [PLAT-15287]: Add PITR Column and Restore Window Information to Backup List
 d053e45 [#238989]yugabyted: Node doesn't join using `--join` flag
 da4da45 [#23399] DocDB: Fix StopActiveTxnsPriorTo from using local variables in the call back

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Differential Revision: https://phorge.dev.yugabyte.com/D38008
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures qa_stress Bugs identified via Stress automation
Projects
None yet
Development

No branches or pull requests

3 participants