Skip to content

mv shadow tables

Matthew Von-Maszewski edited this page Mar 27, 2017 · 4 revisions

Status

  • merged to master -
  • code complete -
  • development started - March 11, 2017

History / Context

Traditional databases often split content into index files and data files. MySql's MYISAM did this. Faircom's Ctree low level function did this too. And Riak's secondary indexes (2i) are essentially doing the same thing by creating a second key (index) that simply points to the primary key and its data. Google's leveldb keeps keys (indexes) and data together. This works reasonably well for sorted data but causes extra write volume with unsorted data.

A little known fact is that leveldb does not run its compaction process on files that contain unique keys that do not overlap keys on higher levels. leveldb simply "moves" the entire file to next level. This is a simple directory operation, not a wholesale read and rewrite of the file. The performance opportunity is to place the data within a space that never overlaps other data and reduce the actual keys needing to be sorted to having minimal data.

A baseline test of unsorted keys with 800 bytes data values took 20 hours with Basho's shipping leveldb. A quick implementation that split the keys from the values reduced the time to 10 hours. A second, completely different test wrote unsorted keys with 10 kilobyte values into Riak for 3 hours. The number of keys written within the time limit doubled with the quick implementation. The feature name for splitting the keys from the data is "shadow tables".

This investigation was inspired by Martin Sumner's work creating a new database engine called "leveled". The notable difference is that Martin created an independent storage component for the data. This branch keeps everything within the existing code structure. The keys and the data just live within two independent key spaces of the same database. All other Basho extensions of leveldb continue to work just fine without special consideration for shadow tables, i.e expiry, hot backup, cache object warming, etc.

Future performance improvements

This branch does not change the existing compaction logic. There is a another, significant performance improvement in making the compaction logic aware of shadow tables. The current logic will continue to read and rewrite shadow tables at level 1 and level 2. Only at level 3 do the shadow tables become truly independent of compaction.

Shadow tables will not benefit from having a bloom filter. The bloom filter takes up space both on disk and in memory. Further, the bloom filter takes additional time to read from disk if the target data is within a shadow table's .sst table file that is not already open. leveldb should dynamically disable the bloom filter on new .sst table files that contain only shadow table keys.

Branch description

include/leveldb/options.h & util/options.cc

Clone this wiki locally