-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make MyRocks transactions never expire additional rows with TTL while running #1458
Make MyRocks transactions never expire additional rows with TTL while running #1458
Conversation
6909760
to
6b2757f
Compare
@hermanlee has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
I think there is a potential issue in rdb_compact_filter.h with the MyRocks currently relies on RocksDB to maintain this heap. I wonder if this diff should be based on the RR/RC change to allow the read-only snapshot in RR to be created from the first read-only select. In that case, the snapshot never goes away, and the snapshot time never changes. For RC, if we want the same behavior, we could also keep the read-only snapshot around just to maintain the timestamp for compaction, but each new read-only query would still acquire its own snapshot. |
@hermanlee If I understand your comment correctly, currently there is a race condition between the compaction filter and the oldest transaction temporarily releasing its snapshot, thus maintaining only a per-transaction timestamp for the per-transaction filtering is not enough, and a transaction snapshot must be kept open constantly once it has been acquired. |
@hermanlee , it seems that each transaction has to have three snapshot fields, even if it does not use them all in all cases:
The 3rd one and the 1st one cannot be merged obviously. It could be possible to reduce complexity of transaction object maintaining three snapshots by moving the oldest timestamp tracking from RocksDB to MyRocks (have a heap/priority queue of timestamps in Rdb_compact_filter, insert/remove from it on transaction 1st snapshot create/commit&rollback), but a shared data structure that duplicates RocksDB internals might not be the best trade-off here. |
6b2757f
to
6d86059
Compare
@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing. |
As discussed offline, will use the global transaction list instead. Step 1 of this approach is at #1471 |
@hermanlee has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
storage/rocksdb/rdb_compact_filter.h
Outdated
} | ||
|
||
m_expiration_timestamp = oldest_snapshot_timestamp; | ||
m_expiration_timestamp = oldest_transaction_timestamp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Laurynas, we recently notice a new issue that will bring inconsistency to primary and secondary in long running transactions. The oldest_transaction_timestamp here is retrieved from snapshot->GetUnixTime(). This timestamp will be different between primary and secondary. Secondary will have a slightly larger timestamp, hence compact more data away than primary. Can we change the oldest_transaction_timestamp from snapshot->GetUnixTime() to mysql_bin_log.get_max_write_hlc_nsec()? This can ensure the consistency between primary and secondary because mysql_bin_log.get_max_write_hlc_nsec() returns the same value if primary and secondary has the same GTID set when taking snapshot.
cc @abhinav04sharma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1471 has been closed. Can we have a rebase to include that change? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunshine-Chun, OK, I will make it use mysql_bin_log.get_max_write_hlc_nsec()
. I am rebasing (and rewriting) this PR right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunshine-Chun , I am not too familiar with the binlog HLC patch. What value should we use if the HLC value is zero? Should we use snapshot->GetUnixTime()
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunshine-Chun, the relationship between binlog and RocksDB timestamps is not clear to me. We initialize the earliest timestamp from binlog, but then the current snapshot timestamps come from the RocksDB snapshots directly. I'd expect the invariant "the current timestamp can be earlier than the earliest transaction timestamp" to hold, but, i.e. rocksdb_rpl.rocksdb_binlog_ttl
test shows (result
is the current snapshot timestamp)
(lldb) p/x m_earliest_snapshot_ts
(int64_t) 0x00047fc94dc5cd7c
(lldb) p/x result
(const int64_t) 0x0000000066c7143b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mysql_bin_log.get_max_write_hlc_nsec()
is unlikely to be zero. But yeah, snapshot->GetUnixTime()
seems the only thing we can do to guard the new_earliest_ts
from zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MTR test seems to be failing. Have you rebased it to the lastest branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunshine-Chun, thank you, this makes sense. It seems that the "L" in "HLC" is not stored explicitly, thus the binlog HLC is directly comparable to RocksDB timestamps, different resolutions aside. In which case I still cannot understand what is going on with rocksdb_rpl.rocksdb_binlog_ttl
. The testcase does not inject any artificial timestamps, and yet:
$ p result # The current RDB snapshot timestamp
(const int64_t) 1724752351 # OK, resolves to $NOWish timestamp at the second resolution
(lldb) p m_earliest_snapshot_ts # Binlog HLC
(int64_t) 1695877370630958 # Does not resolve to anything reasonable at any resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run the test in my local server and got the following number. The m_earliest_snapshot_ts
looks correct to me. Note, this is the value that hit the assertion assert(m_earliest_snapshot_ts <= result || result == 0);
(gdb) p m_earliest_snapshot_ts
$1 = 1724827776952667189
(gdb) p result
$2 = 1724827777
Let me post the logic to get HLC here.
uint64_t HybridLogicalClock::get_next() {
uint64_t current_hlc, next_hlc = 0;
bool done = false;
while (!done) {
// Get the current wall clock in nanosecond precision
uint64_t current_wall_clock =
std::chrono::duration_cast<std::chrono::nanoseconds>(
std::chrono::high_resolution_clock::now().time_since_epoch())
.count();
// get the 'current' internal HLC
current_hlc = current_.load();
// Next HLC timestamp is max of current-hlc and current-wall-clock
next_hlc = std::max(current_hlc + 1, current_wall_clock);
// Conditionally update the internal hlc
done = current_.compare_exchange_strong(current_hlc, next_hlc);
}
return next_hlc;
}
Can try to print the current_wall_clock
to see if the value is abnormal? This determines the baseline value of the HLC. My guess is that the abnormal value starts from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_wall_clock
value is indeed correct, thus the only issue was comparing nanoseconds to seconds directly. Adding the required conversion fixed the tests.
Non-functional changes in this PR split off to #1487 |
The change has been committed to the codebase. |
6d86059
to
29bccfd
Compare
@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing. |
@sunshine-Chun , I have pushed my current version, which is as finished as possible, except for the binlog/RDB timestamp question above. Keeping in draft status until that is resolved. |
@sunshine-Chun has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
MTR test seems to be failing. Have you rebased it to the lastest
branch?
Which tests? Two debug MTR tests are failing with assert RDB
timestamp > binlog timestamp, as I wrote before, and are the
reason this PR is still draft.
|
Other than the assertions:
|
Just think aloud. Feel free to correct me if I miss anything here.
|
Cannot reproduce these two. For the latter, I intermittently get a pre-existing failure:
|
The current change and binlog ttl feature covers different scenarios. Both are needed.
|
Then they are not related to this change. |
This was caused by concurrent accesses to
|
This patch is not enough. I'll remove the extra added accesses from other threads until the |
29bccfd
to
9c48838
Compare
@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing. |
@sunshine-Chun, addressed all comments as discussed above, fixed one linked list middle pointer manipulation logic error, and nanosecond to second conversion. Ready for review! |
@sunshine-Chun has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -5479,12 +5525,12 @@ class Rdb_transaction_impl : public Rdb_transaction { | |||
} | |||
|
|||
virtual ~Rdb_transaction_impl() override { | |||
rollback(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen when we finish a transaction? My understanding is this transaction will be first reinsert without snapshot to the end of the list in on_commit(). Then when the transaction is deconstructed, it will be moved to the end again in the rollback(). Lastly, it will be removed from the list via erase(). Did I miss something, or we do have duplicated/redundant operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The destructor is only executed on a session disconnect, where it may be called with or without going through commit/rollback first. While a session is alive, the transaction object does not go through destruction but is reused after completing rollback/commit.
@@ -3823,18 +3868,22 @@ class Rdb_transaction { | |||
|
|||
void incr_update_count(TABLE_TYPE table_type) { | |||
assert(!is_ac_nl_ro_rc_transaction()); | |||
assert(table_type != TABLE_TYPE::USER_TABLE || get_snapshot_ts() != 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add similar check into incr_insert_count()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one might be called without snapshot during bulkloading. I
have added a comment there.
do_erase(tx); | ||
fixup_newest_snapshot_after_erase(tx); | ||
tx->m_earliest_snapshot_ts = 0; | ||
tx->next = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only clean tx->next? how about tx->prev?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do_insert_without_snapshot
handles tx->prev
. It does not handle tx->next
because the other caller sets it outside the mutex critical section, to have it as short as possible.
9c48838
to
93894c3
Compare
@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing. |
Pushed the branch again. The only changes are an extra comment at |
storage/rocksdb/ha_rocksdb.cc
Outdated
if (new_earliest_ts == 0) new_earliest_ts = snapshot->GetUnixTime(); | ||
Rdb_transaction_list::reinsert_with_snapshot(this, new_earliest_ts); | ||
} else { | ||
assert(earliest_ts <= snapshot->GetUnixTime()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abhinav left some comment here:
- Can we use the mysql_bin_log.get_max_write_hlc_nsec() only when binlog ttl is enabled?
- For the assertion, when binlog ttl is enabled, we are comparing hlc with wall clock, that doesn't seem correct.
Can we do something like
if (earliest_ts == 0) {
auto new_earliest_ts =
rdb_is_binlog_ttl_enabled() && mysql_bin_log.get_max_write_hlc_nsec()
? mysql_bin_log.get_max_write_hlc_nsec() / 1'000'000'000
: snapshot->GetUnixTime();
Rdb_transaction_list::reinsert_with_snapshot(this, new_earliest_ts);
} else {
assert(earliest_ts <= rdb_is_binlog_ttl_enabled() &&
mysql_bin_log.get_max_write_hlc_nsec()
? mysql_bin_log.get_max_write_hlc_nsec() / 1'000'000'000
: snapshot->GetUnixTime());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the last comment needs to be addressed before commit if Luqun and Herman don't have more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am testing the following:
[[nodiscard]] static std::int64_t binlog_hlc_or_snapshot_ts(
const rocksdb::Snapshot &snapshot) {
if (!rdb_is_binlog_ttl_enabled()) return snapshot.GetUnixTime();
const auto binlog_max_write_hlc_ns = mysql_bin_log.get_max_write_hlc_nsec();
return (binlog_max_write_hlc_ns != 0) ? binlog_max_write_hlc_ns
: snapshot.GetUnixTime();
}
void snapshot_created(const rocksdb::Snapshot *const snapshot) {
assert(snapshot != nullptr);
m_read_opts[USER_TABLE].snapshot = snapshot;
const auto earliest_ts = get_earliest_snapshot_ts();
if (earliest_ts == 0) {
const auto new_earliest_ts = binlog_hlc_or_snapshot_ts(*snapshot);
Rdb_transaction_list::reinsert_with_snapshot(this, new_earliest_ts);
} else {
assert(earliest_ts <= binlog_hlc_or_snapshot_ts(*snapshot));
}
m_is_delayed_snapshot = false;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, of course with a division by a billion
93894c3
to
7266728
Compare
@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing. |
Addressed last review comments, ready for review again |
@sunshine-Chun has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
rocksdb_binlog_ttl.test is failing for |
… starting The TTL row filtering is governed by Rdb_transaction::m_snapshot_timestamp field. This field sometimes get reset to zero when a snapshot is released, indicating that it should be reinitialized at the next snapshot, or to the current timestamp (when a single statement in a multi-statement transaction gets rolled back). The timestamp going forward means that additional rows may be hidden in the same transaction if they happened to expire between the two snapshots. Change it so that additional rows are never expired: - Only initialize the field once per transaction, when the first snapshot is created. - Do not touch it for single-statement rollbacks in multi-statement transactions. - Rename the field to m_earliest_snapshot_ts to better reflect what it does. - Change the global transaction list to be sorted by the oldest snapshot timestamp, so that the oldest global one can be found by reading the transaction at the list head. The transactions without snapshot are inserted to the its tail, and reinserted as needed upon acquiring the first snapshot. To make reinsertions fast, a third list pointer is maintained that points to the newest transaction with a snapshot. - Extend the SHOW ENGINE ROCKSDB TRANSACTION STATUS output with transactions that currently do not have a snapshot, for both cases of having a snapshot before and never having had one. - Add a test to rocksdb.ttl_primary_read_filtering for the behavior difference introduced by this commit. At the same time: - Initialize m_earliest_snapshot_ts from the binlog HLC instead of the current system time. Note for downstreams: if you don't carry the binlog HLC patch, binlog_hlc_or_snapshot_ts method simplifies to a snapshot->GetUnixTime call.
7266728
to
c7cc10a
Compare
@laurynas-biveinis has updated the pull request. You must reimport the pull request before landing. |
@sunshine-Chun , division by a billion got lost again :( Fixed |
@sunshine-Chun has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
This pull request has been merged in 8bf72fa. |
The TTL row filtering is governed by Rdb_transaction::m_snapshot_timestamp
field. This field sometimes get reset to zero when a snapshot is released,
indicating that it should be reinitialized at the next snapshot, or to the
current timestamp (when a single statement in a multi-statement transaction gets
rolled back). The timestamp going forward means that additional rows may be
hidden in the same transaction if they happened to expire between the two
snapshots. Change it so that additional rows are never expired:
created.
transactions.
timestamp, so that the oldest global one can be found by reading the
transaction at the list head. The transactions without snapshot are inserted
to the its tail, and reinserted as needed upon acquiring the first snapshot.
To make reinsertions fast, a third list pointer is maintained that points to
the newest transaction with a snapshot.
that currently do not have a snapshot, for both cases of having a snapshot
before and never having had one.
difference introduced by this commit.
At the same time:
system time.