-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Index Block Format
An index
block contains one entry per data block, where the key is a string >=
last key in that data block and <
the first key in the successive data block. The value is the BlockHandle
(file offset and length) for the data block.
If kTwoLevelIndexSearch is used as IndexType, the index
block is a 2nd level index on index partitions, i.e., each entry points to another index
block that contains one entry per data block. In this case, the format will be
[index block - 1st level]
[index block - 1st level]
...
[index block - 1st level]
[index block - 2nd level]
As described above, the key stored for a block is between the last key of the block and the first key of the next block. There are usually a range of potential keys met this condition. Choosing a smaller one can reduce the index size. If BlockBasedTableOptions.index_shortening
is set to kShortenSeparators
or kShortenSeparatorsAndSuccessor
, the last key of the block, and the first key of the next block will be passed to Comparator::FindShortestSeparator() to find out the shortest separator key. This function is implemented in builtin byte-wise and reverse byte-wise comparators. User comparators need to implement the function to take advantage this feature.
Similarly the index key for the last block is determined by Comparator::FindShortSuccessor(), which provides any key that is greater or equal to the last key of the last block. Users also need to implement this function for customized comparator to take advantage of the memory saving feature.
The default value for the option is kShortenSeparators
, which shortens index key for all blocks but the last one. It's because that the last key has minimal impacts to index size while can have positive impact of preventing the last data blocks to be read.
Value kNoShortening
can also be used together with index_type = kBinarySearchWithFirstKey
to prevent reading some blocks using a special function, which is explained below.
Up to RocksDB version 5.14, BlockBasedTableOptions::format_version
=2, the format of index and data blocks are the same, where the index blocks use same key format of <user_key
,seq
> but special values, <offset
,size
>, that point to data blocks. Different from data blocks, the option controlling restart block size is BlockBasedTableOptions.index_block_restart_interval
, rather than BlockBasedTableOptions.block_restart_interval
. The default value is 1, rather than 16 for data blocks. So the default is relatively memory costly. Setting the value to 8 or 16 can usually shrink index block size by half, but the CPU overhead might increase based on workloads.
format_version=
3,4 further optimized size, yet forward-incompatible format for index blocks.
-
format_version
=3 (Since RocksDB 5.15): In most of the cases the sequence numberseq
is not necessary for keys in the index blocks. In such cases, thisformat_version
skips encoding the sequence number and setsindex_key_is_user_key
in TableProperties, which is used by the reader to know how to decode the index block. -
format_version
=4 (Since RocksDB 5.16): Changes the format of index blocks by delta encoding the index values, which are the block handles. This saves the encoding ofBlockHandle::offset
of the non-head index entries in each restart interval. If used,TableProperties::index_value_is_delta_encoded
is set, which is used by the reader to know how to decode the index block. The format of each key is (shared_size, non_shared_size, shared, non_shared). The format of each value, i.e., block handle, is (offset, size) whenever the shared_size is 0, which included the first entry in each restart point. Otherwise the format is delta-size = block handle size - size of last block handle.
The index format in format_version=4
would be as follows:
restart_point 0: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz)
restart_point 1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz)
...
restart_point n-1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz)
where, k is key, v is value, and its encoding is in parenthesis.
The format applies to the case when index_type != kBinarySearchWithFirstKey
. The kBinarySearchWithFirstKey
case is described in the next section.
The feature of index_type == kBinarySearchWithFirstKey
is to allow RocksDB to see first key of a data block without reading it from the disk. With this feature, RocksDB knows the first key of the block, so it can defer reading the block until a value is actually needed. This can effectively prevent I/O and other overhead for some special workloads. For example, during range scans, when RocksDB merges key-values from multiple sources using a heap structure, reading a data block from an SST file can be delayed until the block actually reaches the top of the heap (and can be avoided completely if it never reaches the top).
If this option is used, for each entry in the index block, following the BlockHandle
(offset, size) part, the first key of the block is stored, in the form of length prefixed string. For example, version_format = 5 will take the format as:
restart_point 0:
k (block_key), v (block_offset, block_size, size_of_first_key, first_key)
k (block_key), v (delta_size, size_of_first_key, first_key)
k (block_key), v (delta_size, size_of_first_key, first_key)
restart_point 1:
k (block_key), v (block_offset, block_size, size_of_first_key, first_key)
k (block_key), v (delta_size, size_of_first_key, first_key)
k (block_key), v (delta_size, size_of_first_key, first_key)
...
It's similar to other format versions and restart block size.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc