mv dynamic block size

Status

merged to develop December 10, 2013
code complete November 30, 2013
development started November 12, 2013

History / Context

Warning: there are some environments where this feature does not help performance. Please review the Notes section for discussion.

leveldb's random reads perform best when the data is already in the block cache. The second best performance is when the potential files of each level that could contain the data are already in the file cache. The third best performance is when the file open operation and data read are minimized. The branch works to automatically maximize the second case so that the file cache has the potential to contain the entire, active set of files even as the dataset grows beyond the block cache and file cache size. Once the dataset grows beyond any possible maximization of the second case, the files' metadata has already been incrementally rebuilt to maximize the third case.

This branch's motivation is from heavy, long write tests that create large datasets. These tests' throughput suddenly degrade by 50 to 60% when the file cache is full. The underlying cause is file thrashing: opening and closing hundreds to thousands of files per second. The problem can be postponed by manually changing the block_size to a larger value, and rerunning the tests. This pushes severe degradation point to a larger dataset size, but also slows down random access. The better solution is to manipulate the block_size parameter dynamically to keep is small during early operations, then gradually raise it as needed to both maximum number of files in the file cache and maintain the highest random access rates.

Key Hierarchy

The underlying technique for dynamic block sizing is created based upon knowledge of how leveldb maintains it key indexes. leveldb tracks the storage location of a key via a three layer hierarchy:

MANIFEST layer: stores first and last key of every leveldb .sst table file. The manifest is kept completely in memory. There is one entry per file. Every file is constantly represented in memory. Larger file sizes result in fewer entries.
.sst metadata layer: stores one representative key per file data block. The metadata size increases as the block size decreases. This layer is stored in the file cache. It is loaded and unloaded based upon file usage and available file cache space.
block metadata layer: stores an index of "restart points" within block along with offset to that location within the block. This layer is stored in the block cache. It is loaded and unload based upon block access and available block cache space. Each block contains at least block_size of user data before compression. The block's index size is influenced by the block_restart_interval option. The higher the number, the smaller the block index (but more time needed to linearly scan the block for the exact key).

To find the value of a key:

** leveldb first determines the appropriate .sst table via the MANIFEST.
** It then locates the appropriate data block within the .sst file via the file's metadata index.
** leveldb next finds a starting point within the data block that is "close" to the actual key/value pair via the block's metadata index.
** And finally performs a linear walk from the restart point in the block to the actual key/value data.

So there is a simple size tradeoff between the file cache's metadata and the block cache's blocks. Smaller blocks cause larger metadata and are quicker to read from disk IF the file is already present in the file cache. Larger blocks are slower to read from the disk, but they result in smaller file cache entries. Smaller file cache entries allow more files to be simultaneously present in the file cache.

about leveldb's block_size

The size of a data block is controlled by the block_size parameter of the Options structure (sst_block_size in Riak's app.config). The size represents a threshold, not a fixed count. Whenever a newly created block reaches this uncompressed size, leveldb considers it full and writes the block with its metadata to disk. The number of keys contained in the block depends upon the size of the values and keys.

This branch dynamically raises block_size as space in the file cache becomes constrained.

Performance Stages

Stage 1: Physical RAM larger than dataset

When leveldb starts a new database (Riak vnode), there is plenty of space in the file cache for .sst metadata. The default block_size of 4096 is reasonable value that for many scenarios emphasizes keeping a greater number of keys in the .sst metadata layer. This emphasis allows the data blocks to be small and read more quickly from disk. The quicker disk read results in a faster response to the user.

User could now chose to lower the initial block_size to match their value size (assuming values smaller than 4096 bytes) because of this dynamic block_size branch. A block_size matching the user's value size will enhance random read operations. As already discussed, the smaller block_size will likely inflate the file metadata and fill the file cache sooner. But the branch's dynamic component should gradually raise the block_size automatically before the file cache actually starts thrashing.

If physical RAM is dramatically larger that the total dataset size and is expected to remain that way for an extended period of time, also consider enabling this branch: https://github.com/basho/leveldb/wiki/mv-fadvise-control

Stage 2: Dataset size greater than physical RAM, but all file metadata fits in file cache allocation

There comes a point where the number of files and the size of their metadata exceeds the size of the file cache. Riak 2.0's flex-cache extends when that point is reached, but it still happens eventually. The random read of an individual key/value pair becomes quite expensive when leveldb must first load the file and its metadata. The 4096 byte read of a data block now additionally requires a file open (operating system costly) and the read of all the metadata (maybe 4Mbytes): 4,096 bytes becomes 4,004,096. These larger random read operations are a performance killer.

This branch's first improvement is to have leveldb adjust the block_size parameter dynamically once the file_cache approaches its full state. leveldb adjusts the block_size incrementally higher, reducing the metadata size of newly created .sst files. This dynamic block_size adjustment makes individual read actions of an already open file a little slower due to the data block being larger. That read size cost is offset by two gains: more files can fit into the file cache reducing the number of open operations, and metadata reads of a newly opened file will be less costly. Over time, leveldb will recompact all active .sst files using the dynamic block_size.

This branch's second improvement is a result of adjusting the block_size based upon the size of data currently being compacted. Not all .sst table files contain the same size key/value pairs. The size of Riak's 2i key/value data is quite small compared to most user key/value data. And the 2i key/value data clumps together as leveldb compacts it into higher .sst files. The 2i centric .sst files can use a dynamic block_size that is different from the block_size of the user data centric .sst files, optimizing both. No more "one size fits all" non-optimization.

This branch's third improvement is thanks to the Riak 2.0 flexcache feature. flexcache automatically unloads file cache entries after 5 days if they have no activity. The unload of stale files creates an opportunity for more file cache space. The dynamic block_size algorithm watches for this opportunity and will decrease the block_size to again emphasize random read performance over .sst metadata size.

Stage 3: Total file metadata exceeds file cache size

By the time the dataset reaches this point, the dynamic block_size algorithm will have gradually increased the block size to the point that it represents the minimum size of reading both a data block and the file metadata. The underlying concept is that every random read is likely to open a new file, read its metadata, then read one block. Therefore the metadata and block sizes should be tuned such that their aggregate represents the minimum attainable via block_size manipulations.

This initial dynamic block_size implementation does not consider the success / failure rate of the bloom filter in the block_sizing. It also does not consider the impact of operating system and/or disk controller caches.

There is an upper limit to leveldb. That limit is due to the size of the "manifest" (mentioned previously in the Key Hierarchy section). There must be one manifest entry for every .sst table file, without regard to the file's presents in the file cache. And the entry contains the highest and lowest keys within the file. Google's original leveldb code used 2Mbyte file size limits. Riak had at least one site using Google's original 2Mbyte size where each database of ~50 on the server contain over 78,000 files. The manifest size became a limitation in memory.

Riak has modified leveldb to allow total size of the .sst files to approach 500Mbytes in higher levels. The larger file sizes primarily assist with write throughput (which is not discussed here). The larger file sizes also reduce the memory resident manifest size, extending leveldb's upper limit on dataset size.

Block size algorithm

The dynamic block size algorithm uses four computed values:

original block size
maximum block size that also minimize total size of .sst metadata and block read
incremental block size based upon (maximum - original) / block_size_steps
current block size

When the file cache size is low and block size has not been raised within last 5 minutes, the current block size is incremented by the incremental block size and used for subsequent compactions. The new block size will not exceed the maximum block size.

original block size: this is the greater of either block_size specified by the user options or the average value size of input files to the compaction. Where average value size is not available, the size of the next value to compact is substituted.

maximum block size: this is computed from the first derivative of the function f(x)=block_size + metadata_size, solving where f(x) is 0.

block_size_steps: this is a Riak specific option added to leveldb's Options structure. It represents how many incremental steps to take between the original block_size Options parameter and maximum block size. Example: given block_size is 4,096 and average value size is 1,024, the maximum block size can be around 121,000. The default block_size_steps is 16. So (121000 - 4096) / 16 is 7,306. Therefore the first block_size change is from 4,096 to 11,402 (4096+7306), then to 18,708, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly