-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Iterator API Refresh(const snapshot*) #10594
Conversation
4ffcfb7
to
fb8d4f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! Added some comment and question.
We should decide and document the API for whether Refresh()
supports different snapshots. If so, changes are needed to set correct sequence number upperbound for range tombstones.
virtual Status Refresh() { | ||
virtual Status Refresh() { return Refresh(nullptr); } | ||
|
||
virtual Status Refresh(const class Snapshot*) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth clarifying if the parameter needs to be the same snapshot used in read options to create the original iterator. If so, maybe there is no need to pass in snapshot since we can access it from ArenaWrappedDBIter.read_options_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO - does not need to be the same, but we should anyways require the user provide snapshot, regardless of whether there was one in ReadOptions
. That way we can guarantee the target snapshot still exists so we don't accidentally refresh to a SuperVersion on which that snapshot is unavailable.
I am glad that the no-argument Refresh()
refreshes to latest state rather than the ReadOptions::snapshot
. We need to make it clear from the API doc though.
db/arena_wrapped_db_iter.cc
Outdated
@@ -66,7 +77,7 @@ Status ArenaWrappedDBIter::Refresh() { | |||
new (&arena_) Arena(); | |||
|
|||
SuperVersion* sv = cfd_->GetReferencedSuperVersion(db_impl_); | |||
SequenceNumber latest_seq = db_impl_->GetLatestSequenceNumber(); | |||
SequenceNumber latest_seq = GetSeqNum(db_impl_, snap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If snap
is different from read_options_.snapshot
, we may have incorrect sequence number upperbound for range tombstones. For example, see here
Lines 231 to 233 in f3b359a
SequenceNumber read_seq = options.snapshot != nullptr | |
? options.snapshot->GetSequenceNumber() | |
: kMaxSequenceNumber; |
rocksdb/table/block_based/block_based_table_reader.cc
Lines 2026 to 2028 in f3b359a
if (read_options.snapshot != nullptr) { | |
snapshot = read_options.snapshot->GetSequenceNumber(); | |
} |
where we use the sequence number from
read_options_.snapshot
as the upper bound for sequence number for range tombstones.
This applies to the case when SV number did not change too.
include/rocksdb/iterator.h
Outdated
@@ -92,7 +92,9 @@ class Iterator : public Cleanable { | |||
// If supported, renew the iterator to represent the latest state. The | |||
// iterator will be invalidated after the call. Not supported if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here regarding snapshot can be updated.
I agree, it is no need to pass a same snapshot object, snapshot in ArenaWrappedDBIter.read_options_ should be updated as the passing in snapshot, this is complicated with del_range tombstone, I have missed this case(I just use this in our myrocks branch with del_range disabled). |
6027754
to
d3415cf
Compare
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
2aee1c1
to
abc12c5
Compare
@rockeet has updated the pull request. You must reimport the pull request before landing. |
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
abc12c5
to
f882f6d
Compare
@rockeet has updated the pull request. You must reimport the pull request before landing. |
Add Iterator::Refresh(snapshot) to the same snapshot to stress test.
f882f6d
to
8d5e5b5
Compare
@rockeet has updated the pull request. You must reimport the pull request before landing. |
@rockeet has updated the pull request. You must reimport the pull request before landing. |
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Added support for setting new read sequence number in LevelIterator, so it will use it when creating new range tombstone iterator. |
…#11848) Summary: CI has been hitting assertion error like ``` #8 0x00007fafd9294fd6 in __GI___assert_fail (assertion=assertion@entry=0x7fafda270300 "!*memtable_range_tombstone_iter_ || sv_number_ != cfd_->GetSuperVersionNumber()", file=file@entry=0x7fafda270350 "db/arena_wrapped_db_iter.cc", line=line@entry=124, function=function@entry=0x7fafda270288 "virtual rocksdb::Status rocksdb::ArenaWrappedDBIter::Refresh(const rocksdb::Snapshot*)") at assert.c:101 ``` This is due to * Iterator::Refresh() passing in `cur_sv_number` instead of `sv->version_number` here: https://github.com/facebook/rocksdb/blob/1c6faf35871a236222bcbf0b69718ee43376a951/db/arena_wrapped_db_iter.cc#L94-L96 * `super_version_number_` can be incremented before thread local SV is installed: https://github.com/facebook/rocksdb/blob/main/db/column_family.cc#L1287-L1306 * The optimization in #11452 removed the check for SV number, such that `cur_sv_number > sv.version_number` is possible in the following code. ``` uint64_t cur_sv_number = cfd_->GetSuperVersionNumber(); SuperVersion* sv = cfd_->GetReferencedSuperVersion(db_impl_); ``` Not sure why assertion only started failing after #10594, maybe it's because Refresh() is called more often in stress test. Pull Request resolved: #11848 Test Plan: * This repros hits the assertion pretty consistently before this change: ``` ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=1 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_one_in=0 --block_size=16384 --bloom_bits=0.7161318870366848 --cache_index_and_filter_blocks=0 --cache_size=8388608 --charge_table_reader=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --flush_one_in=1000000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=14 --index_type=2 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=30 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=500000 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=30 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --subcompactions=1 --sync=0 --sync_fault_injection=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35 --use_io_uring=0 --db=/tmp/rocksdb_crashtest_blackboxnf3pyv_0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_6opy9nqg ``` Reviewed By: ajkr Differential Revision: D49344066 Pulled By: cbi42 fbshipit-source-id: d5373ddb48d933acb42a5dd8fae3f3019b0241e5
This PR resolves #10487 & #10536, user code needs to call Refresh() periodically.
The main code change is to support range deletions. A range tombstone iterator uses a sequence number as upper bound to decide which range tombstones are effective. During Iterator refresh, this sequence number upper bound needs to be updated for all range tombstone iterators under DBIter and LevelIterator. LevelIterator may create new table iterators and range tombstone iterator during scanning, so it needs to be aware of iterator refresh. The code path that propagates this change is
db_iter_->set_sequence(read_seq) -> MergingIterator::SetRangeDelReadSeqno() -> TruncatedRangeDelIterator::SetRangeDelReadSeqno() and LevelIterator::SetRangeDelReadSeqno()
.This change also fixes an issue where range tombstone iterators created by LevelIterator may access ReadOptions::snapshot, even though we do not explicitly require users to keep a snapshot alive after creating an Iterator.
Test plan:
TODO in a following PR: