[mysql-5.6][PR] Range Locking support, MyRocks part, updated #1430

hermanlee · 2024-02-21T17:28:46Z

Summary:
(Range Locking pull request,filed against fb-mysql-8.0.23)

This adds a my.cnf parameter, rocksdb_use_range_locking.

When it is ON, MyRocks will:

initialize RocksDB to use range-locking lock manager
for all DML operations (including SELECT .. FOR UPDATE) will lock the scanned range before reading/modifying rows.
In range locking mode, there is no snapshot checking (cannot do that for ranges). Instead, MyRocks will read and modify latest committed data, just like InnoDB does (in the code, grep for (start|end)ignore snapshot)
Queries that do not have a finite range to scan, like UPDATE t1 .... ORDER BY t1.key LIMIT n will use a "Locking iterator" which will read rows, lock the range, and re-read the rows. See class LockingIterator.

Pull Request resolved: #1185
GitHub Author: Sergei Petrunia sergey@mariadb.com

Summary: (Range Locking pull request,filed against fb-mysql-8.0.23) This adds a my.cnf parameter, rocksdb_use_range_locking. When it is ON, MyRocks will: - initialize RocksDB to use range-locking lock manager - for all DML operations (including SELECT .. FOR UPDATE) will lock the scanned range before reading/modifying rows. - In range locking mode, there is no snapshot checking (cannot do that for ranges). Instead, MyRocks will read and modify latest committed data, just like InnoDB does (in the code, grep for (start|end)_ignore_ snapshot) - Queries that do not have a finite range to scan, like UPDATE t1 .... ORDER BY t1.key LIMIT n will use a "Locking iterator" which will read rows, lock the range, and re-read the rows. See class LockingIterator. Pull Request resolved: facebook#1185 GitHub Author: Sergei Petrunia <sergey@mariadb.com>

hermanlee

Adding Manuel's comments.

hermanlee · 2024-02-21T17:31:15Z

storage/rocksdb/ha_rocksdb.h

@@ -266,6 +269,8 @@ class ha_rocksdb : public my_core::handler, public blob_buffer {
  /* Type of locking to apply to rows */
  Rdb_lock_type m_lock_rows;

+  bool m_use_range_locking;


It seems to me like this belongs more on the Rdb_transaction object than on ha_rocksdb (whereas m_lock_rows may be different per table in the same transaction).

It seems to me like this belongs more on the Rdb_transaction object than on ha_rocksdb (whereas m_lock_rows may be different per table in the same transaction).

My current best understanding is 50-50 re. the best home for this field. In its current form, its value is decided by ha_rocksdb::m_lock_rows too. If it's changed to be independent (as it already half-way is in the checks), then it can be stored in Rdb_transaction, but then some of the code paths currently avoid looking at the transaction object at all, so they would get extra get_or_create_tx calls.

hermanlee · 2024-02-21T17:31:43Z

storage/rocksdb/ha_rocksdb.cc

+    Note that in a multi-statement transaction, the snapshot may have been
+    allocated by another statement.
+  */
+  bool m_stmt_ignores_snapshot = false;


I wonder if this works well with TTL tables, since it relies on rocksdb tracking oldest snapshot timestamp for us so that old transactions won't see rows expiring while they are still in progress. We won't get this if we don't create rocksdb snapshots though.

hermanlee · 2024-02-21T17:33:33Z

storage/rocksdb/ha_rocksdb.cc

+  rocksdb::Status lock_singlepoint_range(rocksdb::ColumnFamilyHandle *const cf,
+                                         const rocksdb::Slice &point) {
+    // Normally, one needs to "flip" the endpoint type for reverse-ordered CFs.
+    // But here we are locking just one point so this is not necessary.


I wonder if this comment is necessarily true. For rev cfs, the locking iterator will always lock adjacent keys using inf_suffix=true. However, locking using inf_suffix=false here would mean that the locking iterator would not block modifications from happening.

For example, if the locking iterator iterates from 6->8 and then it locks the range (8, inf_suffix=true) to (6, inf_suffix=true). A second query could delete 8 since it would use the point lock (8, inf_suffix=false) which falls outside of the locked range.

The following shows this is possible:

CONNECTION 1: mysql> show create table t; +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t | CREATE TABLE `t` ( `i` int NOT NULL, `j` int NOT NULL, PRIMARY KEY (`i`) COMMENT 'rev:cf', KEY `j` (`j`) COMMENT 'cf' ) ENGINE=ROCKSDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci | +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> begin; select * from t force index (PRIMARY) where i > 5 order by i asc limit 3 for update ; Query OK, 0 rows affected (0.00 sec) +---+---+ | i | j | +---+---+ | 6 | 6 | | 7 | 7 | | 8 | 8 | +---+---+ 3 rows in set (2.60 sec) mysql> select * From information_schema.rocksdb_locks; +------------------+-----------------+---------------------------------------+------+ | COLUMN_FAMILY_ID | TRANSACTION_ID | KEY | MODE | +------------------+-----------------+---------------------------------------+------+ | 3 | 140534119312896 | 0000010780000008:1-0000010780000006:1 | X | +------------------+-----------------+---------------------------------------+------+ 1 row in set (0.00 sec) CONNECTION 2: mysql> delete from t where i = 8; Query OK, 1 row affected (6.58 sec) CONNECTION 1: mysql> select * from t force index (PRIMARY) where i > 5 order by i asc limit 3 for update ; +---+---+ | i | j | +---+---+ | 6 | 6 | | 7 | 7 | | 9 | 9 | +---+---+ 3 rows in set (3.21 sec)

Incidentally, RangeLockManagerBase::TryLock also has the same behaviour of always setting inf_suffix=false and it's possible that also introduces similar bugs (This is called when we call get_for_update, I wonder if we should call get_for_update with nullptr as value instead of adding a lock_singlepoint_range function here).

hermanlee · 2024-02-21T17:34:03Z

storage/rocksdb/ha_rocksdb.cc

@@ -5516,7 +5686,7 @@ class Rdb_snapshot_status : public Rdb_tx_list_walker {

    /* Calculate the duration the snapshot has existed */
    int64_t snapshot_timestamp = tx->m_snapshot_timestamp;
-    if (snapshot_timestamp != 0) {
+    if (snapshot_timestamp != 0 && tx->has_snapshot()) {


Some of this information still seems useful even if there's no actual snapshot on the transaction. It might be useful to also indicate whether this transaction has an active snapshot too.
I guess some of this information is in rocksdb_trx, but dogpiles only has show engine status.

Some of this information still seems useful even if there's no actual snapshot on the transaction.

Are there any constraints on the output should look like in the case of no snapshot, if tooling parses it?

Discussed offline

hermanlee · 2024-02-21T17:34:25Z

storage/rocksdb/ha_rocksdb.cc

@@ -6837,6 +7050,28 @@ static int rocksdb_init_internal(void *const p) {
  tx_db_options.write_policy =
      static_cast<rocksdb::TxnDBWritePolicy>(rocksdb_write_policy);

+  if (rocksdb_use_range_locking && rocksdb_use_range_lock_manager_as_point) {
+    // rdb_log_status_error(


Is it useful to keep this commented out?

hermanlee · 2024-02-21T17:40:37Z

storage/rocksdb/rdb_iterator.h

+
+      iter.set_use_locking() // sets m_iter_should_use_locking=true
+      iter.seek() // this will re-create m_scan_it to be the right kind
+                  // of iterator


It's not clear to me whether this workflow is ever useful where we change locking mode between seeks. It seems like the code would be simpler if we just specified if we want a locking iterator or not at construction time, and we wouldn't have to track whether we need to recreate the iterator or not.

It seems this would result in cleaner iterator code, but it would require postponing the creation of iterator from ha_rocksdb::index_init right to the point of attempting to take the range lock, and at that point the iterator is already in use. I will look some more.

Looks possible. It might result in a separate iterator reorg PR

Reimplemented this without much refactoring, essentially merged ~~reset~~ and ~~set_use_locking~~.

hermanlee · 2024-02-21T17:41:07Z

storage/rocksdb/rdb_iterator.cc

+  Rdb_transaction *tx = get_tx_from_thd(m_thd);
+  return rdb_tx_set_status_error(tx, s, *kd, m_tbl_def);
+}
+


We currently have most of our error handling code for iterators in is_valid_iterator. Would be nice to keep it in one place.

The functions are a bit different:

is_valid_iterator: works on a RDB-level iterator, has no THD or MyRocks transaction

iter_status_to_retval: works on a MyRocks-level iterator, has THD, transaction, key definition.

Almost all uses follow the pattern of calling is_valid_iterator first and then iter_status_to_retval. But there is also calculate_cardinality_table_scan, which works with an RDB-level iterator directly.

Unless we find a different way to address that last use, the best I can think of is add a call to is_valid_iterator to iter_status_to_retval, perhaps with a rename, so that it's more natural to use in iteration loops.

hermanlee · 2024-02-21T17:41:54Z

storage/rocksdb/rdb_locking_iter.cc

+  Lock the range between [target, (current m_iter position)] and position
+  the iterator on the first record in it.
+
+  @param call_next false means current iterator position is achieved by


We have ha_rocksdb::index_next_with_direction_intern already with skip_next that does something similar so I wonder if we should stick to one naming convention.

hermanlee · 2024-02-21T17:42:10Z

storage/rocksdb/rdb_locking_iter.cc

+    DEBUG_SYNC(my_core::thd_get_current_thd(), "rocksdb.locking_iter_scan");
+
+    if (my_core::thd_killed(current_thd)) {
+      m_status = rocksdb::Status::Aborted();


It's a little strange that we end up returning HA_ERR_ROCKSDB_STATUS_ABORTED instead of the usual HA_ERR_QUERY_INTERRUPTED here.

I don't see a straightforward way to do the regular THD kill flag handling here, will need to revisit.

hermanlee · 2024-02-21T17:42:37Z

storage/rocksdb/rdb_locking_iter.cc

+   - lock the range [-inf; $first_key]
+   - return, the iterator is positioned on $first_key
+
+  The problem here is that we cannot have "-infinity" bound.


Looking at the comparator functions, I would assume that "-infinity" would be Endpoint(slice="", inf_suffix=false), though I agree that this function is basically unused for us. Might be worth clarifying what the comment means.

laurynas-biveinis · 2024-04-26T11:59:47Z

Continued in #1449

facebook-github-bot added the CLA Signed label Feb 21, 2024

hermanlee commented Feb 21, 2024

View reviewed changes

hermanlee mentioned this pull request Feb 21, 2024

Range Locking support, MyRocks part #1185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mysql-5.6][PR] Range Locking support, MyRocks part, updated #1430

[mysql-5.6][PR] Range Locking support, MyRocks part, updated #1430

hermanlee commented Feb 21, 2024

hermanlee left a comment

hermanlee Feb 21, 2024

laurynas-biveinis Mar 18, 2024

hermanlee Feb 21, 2024

hermanlee Feb 21, 2024

hermanlee Feb 21, 2024

laurynas-biveinis Mar 18, 2024

laurynas-biveinis Mar 20, 2024

hermanlee Feb 21, 2024

hermanlee Feb 21, 2024

laurynas-biveinis Mar 20, 2024

laurynas-biveinis Apr 11, 2024

laurynas-biveinis Apr 12, 2024

hermanlee Feb 21, 2024

laurynas-biveinis Mar 20, 2024

laurynas-biveinis Apr 12, 2024

hermanlee Feb 21, 2024

hermanlee Feb 21, 2024

laurynas-biveinis Mar 20, 2024

hermanlee Feb 21, 2024

laurynas-biveinis commented Apr 26, 2024

[mysql-5.6][PR] Range Locking support, MyRocks part, updated #1430

Are you sure you want to change the base?

[mysql-5.6][PR] Range Locking support, MyRocks part, updated #1430

Conversation

hermanlee commented Feb 21, 2024

hermanlee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

laurynas-biveinis commented Apr 26, 2024