Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] SafeTimeForTransactionParticipant advances safe time before Raft processes pending operation #24285

Closed
1 task done
yusong-yan opened this issue Oct 4, 2024 · 2 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@yusong-yan
Copy link
Contributor

yusong-yan commented Oct 4, 2024

Jira Link: DB-13174

Description

TServer encountered a FATAL when adding a pending operation to RAFT:

71 mvcc.cc:395] T 162c330b82464ffeb706914c75904aa4 P daeb952b891b4d78bbb206e5228a8a8b: T 162c330b82464ffeb706914c75904aa4 P daeb952b891b4d78bbb206e5228a8a8b: Recent 32 MVCC operations:
1. Replicated { ht: { physical: 1726815991693666 } op_id: 19.6300 }
2. AddFollowerPending { ht: { physical: 1726816009801056 } op_id: 19.6606 }
3. Replicated { ht: { physical: 1726816009801056 } op_id: 19.6606 }
...
32. LastReplicatedHybridTime { last_replicated: { physical: 1726816316861045 } }
New operation's hybrid time too low: { physical: 1726816317000017 }, op id: 19.9984
  max_safe_time_returned_with_lease_={ safe_time: <min> source: kUnknown }
  ht <= max_safe_time_returned_with_lease_.safe_time=0
  static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_with_lease_.safe_time.ToUint64())=7073039634432069632
  ht.PhysicalDiff(max_safe_time_returned_with_lease_.safe_time)=1726816317000017
  
  max_safe_time_returned_without_lease_={ safe_time: { physical: 1726824340967198 } source: kNow }
  !!! ht <= max_safe_time_returned_without_lease_.safe_time=1
  static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_without_lease_.safe_time.ToUint64())=-32866169573376
  ht.PhysicalDiff(max_safe_time_returned_without_lease_.safe_time)=-8023967181
  
  max_safe_time_returned_for_follower_={ safe_time: <min> source: kUnknown }
  ht <= max_safe_time_returned_for_follower_.safe_time=0
  static_cast<int64_t>(ht.ToUint64() - max_safe_time_returned_for_follower_.safe_time.ToUint64())=7073039634432069632
  ht.PhysicalDiff(max_safe_time_returned_for_follower_.safe_time)=1726816317000017
  
  last_replicated_={ physical: 1726816316861045 }
  ht <= last_replicated_=0
  static_cast<int64_t>(ht.ToUint64() - last_replicated_.ToUint64())=569229312
  ht.PhysicalDiff(last_replicated_)=138972
  
  last_ht_in_queue=<min>
  ht <= last_ht_in_queue=0
  static_cast<int64_t>(ht.ToUint64() - last_ht_in_queue.ToUint64())=7073039634432069632
  ht.PhysicalDiff(last_ht_in_queue)=1726816317000017
  
  propagated_safe_time_=<min>
  ht <= propagated_safe_time_=0
  static_cast<int64_t>(ht.ToUint64() - propagated_safe_time_.ToUint64())=7073039634432069632
  ht.PhysicalDiff(propagated_safe_time_)=1726816317000017
  
  queue_.size()=0
  queue_=[]

The hypothesis of the root cause is is,

  • Right after tablet bootstrap complete, SetCleanupPool get called, and it eventually calls MinRunningHybridTime.
  • MinRunningHybridTime sends a status request with a callback function RunningTransaction::StatusReceived.
  • This callback (RunningTransaction::StatusReceived) is the starting point of the call stack that leads to the SafeTimeForTransactionParticipant, which advanced safe time to kNow before Raft process the pending operation. Here is the call stack
    yb::tablet::MvccManager::SafeTimeForFollower
    yb::TabletPeer::SafeTimeForTransactionParticipant
    yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked
    yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked
    yb::tablet::RunningTransaction::DoStatusReceived

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@yusong-yan yusong-yan added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Oct 4, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Oct 4, 2024
@yusong-yan
Copy link
Contributor Author

If the above hypothesis of the root cause if correct, then this issue can be resolved by applying fix from #21877.

@rthallamko3
Copy link
Contributor

@yusong-yan , Since this is resolved by backporting #21877 to 2.20 branch, I am closing it. Please use #21877 for the backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

4 participants